Debugging OOMKilled Pods in Production: A Step-by-Step Guide
Munish ThakurIt’s Monday morning, and Slack is blowing up. The backend service is down. Again.
I SSH into the node and run kubectl get pods:
| |
If you’ve worked with Kubernetes, you’ve seen this. OOMKilled means your pod tried to use more memory than allowed and got terminated by the kernel.
Here’s how I debug it.
What is OOMKilled?
OOMKilled = Out Of Memory Killed
It happens when a container exceeds its memory limit. The Linux kernel’s OOM (Out Of Memory) killer terminates the process to protect the node.
Why It Happens
| |
If your app tries to allocate more than 512Mi, Kubernetes kills it.
Step 1: Confirm OOMKilled
| |
Step 2: Describe the Pod
This is where the investigation begins:
| |
Look for these sections:
Containers Section
| |
Events Section
Events:
Warning BackOff Back-off restarting failed container
Warning Failed Error: OOMKilled
Exit Code 137 always means OOMKilled (128 + 9, where 9 is SIGKILL).
Step 3: Check Actual Memory Usage
Option A: kubectl top (if metrics-server is running)
| |
But wait - the pod is crashed. You can’t get metrics from a dead pod.
Option B: Check Previous Instance Logs
| |
Look for memory-related errors:
java.lang.OutOfMemoryErrorMemoryErrorin PythonFATAL: out of memoryin Postgres
Option C: Check Prometheus/Grafana
Query historical memory usage before the crash:
| |
Step 4: Analyze the Root Cause
Scenario 1: Memory Leak
If memory usage steadily increases over time, it’s likely a memory leak in your application.
Solution: Fix the code, not the limits.
Scenario 2: Traffic Spike
If memory spikes correlate with traffic, your app is under-resourced for peak load.
Solution: Increase resources OR add autoscaling.
Scenario 3: Limits Too Low
If your app consistently uses near its limit during normal operation, limits are just too low.
Solution: Increase memory limits.
Step 5: The Fix
Quick Fix: Increase Memory Limits
| |
Apply the change:
| |
Long-term Fix: Optimize the Application
Sometimes the app is inefficient:
Common issues:
- Loading entire datasets into memory
- Not using database pagination
- Memory-intensive operations without cleanup
- Cache without eviction policy
My Real Production Case
At Solytics, our modelestimation service was getting OOMKilled:
| |
Memory usage: 6.2 GiB
Memory limit: 4 GiB
The Investigation
| |
The Fix
Option A: Increase memory to 7 GiB
Option B: Optimize code to load data in batches
We did Option A first (get service back up), then Option B (proper fix).
| |
Then the dev team optimized data loading:
| |
Final memory usage: 2.5 GiB (back under 4 GiB limit).
Common Mistakes
Mistake 1: Only Increasing Limits
If it’s a memory leak, increasing limits just delays the inevitable.
Mistake 2: Setting Limits = Requests
| |
This prevents efficient bin-packing. Better:
| |
Mistake 3: No Limits At All
| |
One pod can consume all node memory and crash everything else.
Prevention: Set Up Alerts
Don’t wait for OOMKilled. Monitor memory usage:
Prometheus Alert
| |
kubectl Events
| |
Debugging Checklist
When you see OOMKilled:
- Run
kubectl describe pod <pod-name> - Check Exit Code (137 = OOMKilled)
- Review memory limits in pod spec
- Check logs:
kubectl logs <pod> --previous - Query Prometheus for historical memory usage
- Identify if it’s a leak, spike, or under-provisioning
- Apply appropriate fix (increase limit OR optimize code)
- Monitor for 24-48 hours
- Set up alerts to prevent recurrence
The Decision Tree
Pod OOMKilled?
↓
Is memory usage growing steadily?
├─ Yes → Memory leak → Fix code
└─ No → Continue
↓
Is it only during traffic spikes?
├─ Yes → Under-provisioned → Increase limits + HPA
└─ No → Continue
↓
Is it consistently near limit?
├─ Yes → Limits too low → Increase by 50-100%
└─ No → Investigate specific cause
Tools for Memory Profiling
For Python Apps
| |
For JVM Apps
| |
For Node.js
| |
Conclusion
OOMKilled errors are common in Kubernetes. The key is:
- Diagnose properly - Don’t just blindly increase resources
- Understand the pattern - Leak, spike, or under-provisioned?
- Fix the root cause - Code optimization often better than bigger limits
- Monitor proactively - Catch issues before they become incidents
Pro tip: If you’re getting OOMKilled on startup, check your application’s initialization phase. Loading large datasets or caches on startup is a common culprit.
Remember: The goal isn’t to eliminate restarts, it’s to eliminate the root cause.