Simply put, OOMKilled is an error in Kubernetes that emerges because a pod or container has exceeded the amount of memory allocated to it. OOM stands for Out of Memory, and Killed means a process has been terminated.

This is a common problem with an easy fix: increase the memory allocation. However, this straightforward solution would only be applicable if resources are infinite, if available memory is inexhaustible. Learn more about OOMKilled, its causes, the ways to resolve it, and how to balance memory allocations in light of this Kubernetes error.

OOMKilled basics

Also known as Exit Code 137, the OOMKilled error is based on the Linux Kernel feature called OOM Killer, which gives Kubernetes the management ability over container lifecycles. It is not a native feature of Kubernetes, but is one of the important errors Kubernetes users should be familiar with.

In Kubernetes, pods can set minimum and maximum values for the memory that can be used by containers on the host machine. The minimum is referred to as the “memory request,” while the maximum is the “memory limit.” The OOMKilled error appears whenever a container utilizes more than the memory limit granted to it

The error is identified by running the “kubectl get pods” command. The command displays the name of the pod, status, number of restarts, and age. Initially, the pod status indicates “Terminating” but eventually shows “OOMKilled”.

The Exit Code 137 error appears when running the “kubectl describe pod [name]” command, which shows the state of the pod (waiting/running/terminated), the date and time when it started, last state, reason (for termination), and the exit code. The reason shows “OOMKilled” while the exit code writes “137” if the pod used more memory than what was allocated for it and was stopped.

Causes of Exit Code 137

There are three main causes why containers or pods use more memory than what they are expected to utilize and go beyond the configured memory limit. They are as follows:

  • The application could be having a workload that is more than what it regularly handles. This can happen when multiple instances of the application are running. It can also occur when the app operates with the highest settings. Additionally, the increased load can be due to the presence of malware.
  • The application could be suffering from a memory leak. Devices typically perform “garbage collection” to reclaim memory from applications that are no longer using the memory they were initially allocated. However, there are instances when apps fail to give way to the “garbage collector” and continue to hold on to the memory it was granted. An instance in the app could be persistently holding a reference to the memory allocation. As such, the instance continues to use memory even when the application supposedly no longer needs it.
  • Another reason for Exit Code 137 is node over commitment. This occurs when the memory set for an entire node is lower than the memory used by the pods. Node memory limits are set independently of those of the containers and pods. As such, it is possible for the pods assigned to a node to use more memory than what is allocated to that specific node.

Troubleshooting memory overuse

In the case of apps that are experiencing greater than usual loads, the solution is to increase the memory limits in the pod specification.  When it comes to apps with memory leaks, the solution calls for the debugging of the application to remove the underlying reason for the memory leak. If malware is detected, it must be eliminated completely. 

On the other hand, to resolve Exit Code 137 with nodes that are over committed, it is necessary to revisit the memory request and memory limit values in the containers to make sure they are in line with the available memory in the nodes. Expanding memory allocation in the nodes is also possible, but it is more logical to look at the minimum and maximum memory allocations in the containers and pods first. This is because it rarely happens that all or almost all the pods experience the OOMKilled error. In most cases, only a few of the pods are affected, so it makes sense to implement the fix on them individually instead of having a blanket solution through the nodes.

Another solution to the memory overuse problem is the reduction of parallel runners. Kubernetes comes with parallel processing capabilities, which enable the running of two systems simultaneously to do more with less time. This is definitely a boon to efficiency, but it inevitably results in more memory utilization and strain on the overall Kubernetes ecosystem. If the Exit Code 137 error continues showing up after container, pod, and node memory allocations have already been adjusted, the difficulty could be in parallel processing. Evaluate parallel processes thoroughly and find some that may not be that crucial and could be removed.

Points on memory adjustment

As mentioned, the solution to OOMKilled is primarily the increase in memory limits. This error will never appear if memory allocations were set extremely high. Unfortunately, memory is a finite resource. It is important to balance memory allocations for different containers and nodes that serve different purposes.

Even if memory is unlimited, it is still essential to examine memory utilization and set just the right amount of memory minimums and maximums based on what applications are expected to use. It would be unwise to allow apps to operate inefficiently. Apps with memory leaks, for example, will continue using more memory and eventually cause issues on the overall Kubernetes environment.

When adjusting memory allocations, it helps to remember the order of priority by which nodes terminate pods. When OOMKilled occurs, the first to be terminated are the pods that have no memory requests or limits. The next are pods that have memory requests but no limits. Third in line are the pods that use more memory than their memory request (minimum) but not exceeding the memory limit. The last to be terminated are the pods that take up more than what their limits indicate.

Observe pod terminations based on the prioritization order above. The last pods killed merit more prompt action. It is going to be a manual and somewhat unwieldy process to review all these pods, containers, and nodes then proceed with the adjustments. However, the process can be addressed much faster with the help of an automated troubleshooting platform. There are Kubernetes troubleshooting solutions that expedite the process of fixing errors through seamless notifications, greater system visibility, insights into service dependencies, and change intelligence.

In summary

OOMKilled is basically a problem with inadequate memory, to which the solution is to have more memory. However, simply increasing memory allocations is not going to be a sensible solution, even if the error disappears (temporarily). It is important to ensure the efficient operation of applications. Memory requests and limits are not set whimsically. They are based on what applications are designed to do.

If apps require more memory than expected, this could indicate issues like bugs, malware infection, or more users than projected. Exit Code 137 is not itself the issue, but a symptom of issues that should be addressed properly. Eliminating it by simply allocating more memory is but a band-aid remedy.

Also Read: Everything You Need to Know About Securing Your iPhone