Building a Production-Ready Kubernetes Health Check Script
Munish Thakur“Can you write a script to monitor our K8s cluster health and send email alerts if something breaks?”
This was my task as a DevOps intern. Simple request, but I learned a lot about production monitoring patterns.
The Requirements
- Check cluster health - nodes, pods, services
- Monitor resource usage - CPU, memory across nodes
- Score the health - 0-100 points system
- Email alerts - When score drops below threshold
- No external dependencies - Just bash and kubectl
- Production-ready - Run via cron every 30 minutes
The Architecture
┌─────────────┐
│ Cron Job │
│ (*/30 * *) │
└──────┬──────┘
│
v
┌─────────────────────────────────────────┐
│ Health Check Script (latest.sh) │
│ │
│ 1. Check K8s Cluster Status │
│ 2. Check Resource Utilization │
│ 3. Check Platform Services │
│ 4. Check Network/DNS │
│ 5. Calculate Health Score (0-100) │
│ 6. Send Email if Score < 70 │
└─────────────────────────────────────────┘
│
v
┌─────────────┐
│ msmtp │ → Email to DevOps team
└─────────────┘
The Implementation
Part 1: Cluster Status Check
| |
Part 2: Resource Utilization
| |
Part 3: Platform Services
| |
Part 4: Health Scoring System
| |
Part 5: Email Alerts
| |
The Full Script
The complete script checks:
✅ Node readiness
✅ Pod health (Running, Failed, Pending states)
✅ Resource utilization (CPU, memory per node)
✅ Platform services (containerd, kubelet)
✅ Network connectivity (DNS, API server)
✅ Disk space (/ and /var/lib/kubelet)
✅ Critical directories accessibility
Real Production Alert Example
Here’s an actual email I received:
Subject: [CRITICAL] Platform Health Alert - production (Score: 65/100)
CLUSTER & PLATFORM HEALTH CHECK - 2025-01-14_15-30-45
Cluster Node: aks-worker-node-1
Scope: Namespace: production
===============================================
=== KUBERNETES CLUSTER STATUS ===
OK: All cluster nodes ready
CRITICAL: 3 pods in failed state (Namespace: production)
Failed Pods:
NAMESPACE NAME STATUS
production backend-django-7d8f9c-xyz CrashLoopBackOff
production worker-celery-6f8d9-abc OOMKilled
production redis-sentinel-8g7h-def Error
=== CLUSTER RESOURCE UTILIZATION ===
WARNING: Node aks-worker-2 memory usage is 78%
=== PLATFORM SERVICES STATUS ===
OK: containerd service running
OK: kubelet service running
=== PLATFORM HEALTH ASSESSMENT ===
PLATFORM STATUS: CRITICAL (65/100)
Platform requires immediate intervention
This alert woke me up (literally - it was set to my phone), and I fixed the CrashLoopBackOff within 20 minutes.
Automation with Cron
| |
Every 30 minutes, the script:
- Checks cluster health
- Calculates score
- Sends email if score < 70
- Logs results
Advanced Features I Added
1. Namespace-Specific Monitoring
| |
Usage:
| |
2. Historical Tracking
| |
3. Detailed Pod Analysis
| |
4. Slack Integration (Bonus)
| |
Lessons Learned
1. Start Simple, Iterate
Version 1: Just checked if nodes were ready
Version 2: Added pod status checks
Version 3: Added resource monitoring
Version 4: Added health scoring
Version 5: Added email alerts
Each version added value. I didn’t try to build everything at once.
2. Health Scores Are Subjective
I weighted penalties based on business impact:
- Node failure: -40 points (catastrophic)
- Pod failure: -30 points (major)
- High memory: -25 points (warning)
Your priorities might differ. Adjust the weights.
3. Alert Fatigue Is Real
Don’t alert on every minor issue. We set threshold at 70/100 to reduce noise.
Too many alerts = team ignores them.
4. Bash Is Powerful
You don’t always need Python/Go. For simple monitoring scripts, bash + kubectl is perfect:
- No dependencies
- Easy to modify
- Runs everywhere
- Team can read it
The Tools I Used
kubectl
| |
msmtp (Lightweight Email)
| |
systemctl
| |
Running in Production
The script runs on all our K8s nodes:
| |
It catches issues before they become incidents:
- Disk space filling up
- Memory pressure before OOMKills
- Failed pods before services degrade
- DNS issues before total outage
The Impact
After deploying this script:
- Faster incident response: Alerts often arrive before users notice
- Better visibility: Weekly health reports showed trends
- Reduced downtime: Caught 3 potential outages in first month
- Team confidence: Engineers trusted the monitoring
Improvements I’d Make
If I started over:
- ✅ Add Prometheus integration for historical data
- ✅ Graph health scores over time
- ✅ Predict failures with trend analysis
- ✅ Auto-remediation for common issues
- ✅ Mobile-friendly alert format
But honestly? The simple version works great.
Conclusion
You don’t need enterprise monitoring tools for effective cluster monitoring. A well-written bash script with:
- Clear checks
- Sensible scoring
- Timely alerts
- Easy maintenance
…can catch 90% of issues before they become critical.
Pro tip: When writing monitoring scripts, optimize for readability over cleverness. Future you (at 3am during an incident) will thank you.
The full script is ~300 lines. Small enough to understand, comprehensive enough to be useful.
Sometimes the best tools are the ones you build yourself.