Munish Thakur

Kubernetes Cost Optimization: How I Saved 81% CPU and 68% Memory

Munish Thakur Munish Thakur
5 min read

During my time at Solytics Partners, I was asked to analyze our Kubernetes cluster resource usage. The CFO wanted to know: “Are we wasting money on over-provisioned infrastructure?”

Spoiler: We were. By a lot.

The Investigation

I was given access to our pre-production cluster and asked to conduct a comprehensive resource utilization analysis. The goal: identify optimization opportunities without risking service stability.

The Methodology

I wrote a script to collect resource metrics across all namespaces:

1
2
3
4
5
6
7
8
# For each pod in the cluster
for pod in $(kubectl get pods -A -o name); do
  # Get resource requests/limits
  kubectl get $pod -o jsonpath='{.spec.containers[*].resources}'
  
  # Get actual usage from metrics-server
  kubectl top pod $pod
done

The data went into CSV files for analysis. After 2 weeks of collection, the results were shocking.

The Findings

Frontend Services: 0% Utilization 🔴

ServiceAllocated CPUUsed CPUAllocated MemoryUsed MemoryEfficiency
frontend3 cores0.002 cores6 GiB8 MiB0%
admin-panel-frontend3 cores0.003 cores6 GiB12 MiB0%

Total waste: 6 CPU cores, 12 GiB memory doing almost nothing.

These were static file servers. They didn’t need enterprise-grade resources.

JupyterHub: 98% Over-Provisioned 🔴

ServiceAllocatedUsedEfficiency
jupyterhub-jupyter7 cores, 13 GiB0.1 cores, 256 MiB2%

The JupyterHub service was allocated like it would handle 100 concurrent users. Actual usage? 2-3 users per day.

RabbitMQ: 140% CPU Usage 🚨

While most services were over-provisioned, I found the opposite problem:

ServiceAllocatedUsedStatus
rabbitmq-10.25 cores0.35 coresThrottled
rabbitmq-00.25 cores0.35 coresThrottled

RabbitMQ was CPU-starved, causing message queue delays.

ELK Stack: 95% Waste

ServiceAllocated CPUUsed CPUWaste
elasticsearch6 cores0.2 cores96%
logstash4 cores0.1 cores97%
kibana4 cores0.05 cores98%

14 CPU cores and 30 GiB memory for a logging stack serving a small team.

The Recommendations

High Impact: Frontend Services

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Before
resources:
  requests:
    cpu: 3
    memory: 6Gi
  limits:
    cpu: 3
    memory: 6Gi

# After
resources:
  requests:
    cpu: 100m  # 0.1 cores
    memory: 256Mi
  limits:
    cpu: 200m
    memory: 512Mi

Savings: 5.8 CPU cores, 11.5 GiB memory per service

Critical Fix: RabbitMQ (Under-provisioned)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Before (CPU throttled!)
resources:
  requests:
    cpu: 250m
    memory: 512Mi

# After
resources:
  requests:
    cpu: 400m
    memory: 512Mi

This fixed message processing delays.

Total Cluster Savings

ResourceAllocatedActually NeededWasteSavings
CPU73.6 cores13.9 cores59.7 cores81%
Memory201.25 GiB63.65 GiB137.6 GiB68%

The Implementation Strategy

I didn’t change everything at once. That’s a recipe for disaster.

Phase 1: Quick Wins (Week 1)

✅ Increase RabbitMQ CPU (critical)
✅ Reduce frontend services by 90%
✅ Reduce nginx/pgbouncer (infrastructure)

Phase 2: Low-Risk (Week 2-3)

✅ Reduce JupyterHub services by 85%
✅ Optimize ELK stack

Phase 3: Careful (Week 4)

✅ Right-size backend Django services
✅ Monitor and adjust

Phase 4: Monitor (Ongoing)

✅ Set up Prometheus alerts at 80% of new limits
✅ Weekly resource usage reviews

Tools I Used

1. kubectl top

1
2
3
# Real-time resource usage
kubectl top pods -n production --sort-by=memory
kubectl top nodes

2. Prometheus Queries

1
2
3
4
5
# CPU usage over time
rate(container_cpu_usage_seconds_total[5m])

# Memory usage
container_memory_working_set_bytes

3. Metrics Server

1
2
# Ensure metrics-server is installed
kubectl get deployment metrics-server -n kube-system

4. Excel/CSV Analysis

Exported data to CSV, created pivot tables, identified patterns.

Real-World Impact

Cost Savings

  • Cluster size reduction: 20 nodes → 12 nodes
  • Annual savings: ~$12,000/year
  • Image pull time: Faster with smaller resource contention

Performance Improvements

  • RabbitMQ throughput: +40% (after CPU increase)
  • No service degradation: All SLAs maintained
  • Faster deployments: Less resource scheduling time

Operational Benefits

  • Better monitoring: Set proper alert thresholds
  • Faster autoscaling: HPA works better with accurate baselines
  • Improved planning: Data-driven capacity planning

Lessons Learned

1. Most Services Are Over-Provisioned

Default to measuring first, not guessing. Engineers typically over-allocate “to be safe.”

2. Static Content Needs Minimal Resources

If it’s just serving files (frontend, nginx), start with:

  • CPU: 100m
  • Memory: 128Mi

Scale up if needed.

3. Some Services Need MORE Resources

Don’t assume everything is over-provisioned. Always check actual usage.

4. Gradual Rollout Is Key

Never change production resources all at once. Do it in phases with monitoring.

5. Document Everything

I created a spreadsheet showing:

  • Current allocation
  • Actual usage (P50, P95, P99)
  • Recommendation
  • Justification
  • Risk level

This made stakeholder buy-in easy.

The Process I Follow Now

1. Collect data (2 weeks minimum)
   ↓
2. Analyze usage patterns (identify outliers)
   ↓
3. Calculate recommendations (with buffer)
   ↓
4. Get stakeholder approval
   ↓
5. Implement in phases
   ↓
6. Monitor closely (alert if >80% of new limits)
   ↓
7. Document results

Red Flags to Watch For

Service using 0-5% of resources → Massively over-provisioned
Service using 90-100% → Under-provisioned, needs more
Wide variance → Unpredictable, needs investigation
Service using 40-70% → Well-sized

Conclusion

Cloud cost optimization isn’t about starving services of resources. It’s about:

  1. Measuring actual usage
  2. Right-sizing allocations
  3. Monitoring continuously
  4. Adjusting based on data

Our cluster went from wasteful to efficient, and services ran better (RabbitMQ throughput improved 40%).

The best part? No service outages, no performance degradation, just a leaner, faster infrastructure.

Key takeaway: If you haven’t analyzed your K8s resource usage in the last 6 months, you’re probably wasting money.

Download Resume