Real Engineering Stories
The Memory Leak That Caused Gradual Degradation
A production incident where a memory leak in notification processing code caused gradual memory increase over 21 days, eventually leading to pod OOM kills and service degradation. Learn about memory leak detection, bounded data structures, and monitoring memory trends.
This is a story about how a small bug—an unbounded in-memory cache—caused a memory leak that took 21 days to surface. It's also a story about why gradual problems are harder to detect than sudden failures, and how we changed our monitoring and code review processes to catch them earlier.
Context
We had a microservices architecture with a notification service that sent emails, SMS, and push notifications. The service processed about 1M notifications per day, running on Kubernetes with auto-scaling enabled.
Original Architecture:
graph TB
Queue[Message Queue<br/>RabbitMQ] --> Worker1[Worker Pod 1]
Queue --> Worker2[Worker Pod 2]
Queue --> Worker3[Worker Pod 3]
Worker1 --> Email[Email Service]
Worker2 --> SMS[SMS Service]
Worker3 --> Push[Push Service]
Worker1 --> DB[(Database)]
Worker2 --> DB
Worker3 --> DB
Technology Choices:
- Workers: Node.js (3 pods, auto-scaling 1-10 pods)
- Queue: RabbitMQ
- Database: PostgreSQL
- Orchestration: Kubernetes
Assumptions Made:
- Memory usage would be stable
- Auto-scaling would handle traffic spikes
- Pod restarts would clear any memory issues
The Incident
Timeline:
- Day 1: Service deployed, memory usage at 200MB per pod
- Day 3: Memory usage at 300MB per pod (normal variation)
- Day 7: Memory usage at 500MB per pod (investigated, no issues found)
- Day 14: Memory usage at 800MB per pod (alert fired, investigated)
- Day 21: Memory usage at 1.2GB per pod (approaching pod memory limit of 1.5GB)
- Day 21, 3:00 PM: First pod OOM (Out of Memory) killed
- Day 21, 3:05 PM: Second pod OOM killed
- Day 21, 3:10 PM: Third pod OOM killed
- Day 21, 3:15 PM: All pods restarting, queue backing up
- Day 21, 3:20 PM: On-call engineer paged
- Day 21, 3:30 PM: Identified memory leak in notification processing
- Day 21, 4:00 PM: Hotfix deployed
- Day 21, 4:30 PM: Service recovered
Symptoms
What We Saw:
- Memory Usage: Gradual increase from 200MB to 1.2GB over 21 days
- Pod Restarts: Pods killed by Kubernetes due to OOM
- Queue Depth: Increased from 0 to 50K messages during incident
- Processing Rate: Dropped from 1000 notifications/min to 200 notifications/min
- User Impact: ~200K notifications delayed by 2-4 hours
How We Detected It:
- Memory usage alert fired on Day 14 (but investigation found "no issues")
- OOM alerts fired on Day 21 when pods started dying
- Queue depth alert fired when queue exceeded 10K messages
Monitoring Gaps:
- No memory leak detection (trending analysis)
- No alert for gradual memory increase
- No alert for pod restart frequency
Root Cause Analysis
Primary Cause: Memory leak in notification processing code.
The Bug:
// BAD CODE (simplified)
const notificationCache = new Map();
async function processNotification(notification) {
// Store notification in cache (never cleared!)
notificationCache.set(notification.id, notification);
// Process notification
await sendNotification(notification);
// Cache never cleared, grows indefinitely
}
What Happened:
- Each processed notification was stored in an in-memory Map
- Map was never cleared (no TTL, no size limit)
- Over 21 days, Map grew to 1M+ entries
- Each entry consumed memory (notification object + metadata)
- Memory usage increased linearly with processed notifications
- Pods hit memory limits and were killed by Kubernetes
Why It Wasn't Caught:
- Memory increase was gradual (not sudden)
- No memory leak detection in monitoring
- Code review missed the unbounded Map
- Tests didn't run long enough to catch the leak
Contributing Factors:
- No memory profiling in production
- No alert for gradual memory trends
- No code review checklist for memory management
- Pod memory limits too high (1.5GB) - allowed leak to grow
Fix & Mitigation
Immediate Fix:
// FIXED CODE
const notificationCache = new Map();
const MAX_CACHE_SIZE = 1000;
const CACHE_TTL = 5 * 60 * 1000; // 5 minutes
async function processNotification(notification) {
// Only cache if under limit
if (notificationCache.size < MAX_CACHE_SIZE) {
notificationCache.set(notification.id, {
data: notification,
timestamp: Date.now()
});
}
// Process notification
await sendNotification(notification);
// Clean up old entries periodically
cleanupCache();
}
function cleanupCache() {
const now = Date.now();
for (const [id, entry] of notificationCache.entries()) {
if (now - entry.timestamp > CACHE_TTL) {
notificationCache.delete(id);
}
}
// Also enforce size limit
if (notificationCache.size > MAX_CACHE_SIZE) {
const firstKey = notificationCache.keys().next().value;
notificationCache.delete(firstKey);
}
}
Long-Term Improvements:
-
Memory Leak Detection:
- Added memory trend monitoring (alert if memory increases > 10% per day)
- Added memory profiling in production (weekly heap dumps)
- Added memory usage per endpoint tracking
-
Code Review Process:
- Added checklist for memory management (bounded caches, TTLs, cleanup)
- Added code review for unbounded data structures
- Added memory leak testing in CI/CD
-
Monitoring & Alerting:
- Added alert for pod restart frequency
- Added alert for gradual memory increase
- Added memory usage per pod tracking
-
Process Improvements:
- Reduced pod memory limits (from 1.5GB to 512MB) to catch leaks earlier
- Added memory profiling to deployment process
- Created runbook for memory leak incidents
Architecture After Fix
Key Changes:
- Bounded in-memory caches with TTLs
- Memory leak detection and alerting
- Lower pod memory limits (fail fast)
- Memory profiling in production
Key Lessons
-
Memory leaks are gradual: They don't cause immediate failures, making them hard to detect. Monitor memory trends, not just absolute values.
-
Bounded data structures: Always set limits and TTLs for in-memory caches. Unbounded growth will eventually cause OOM.
-
Memory profiling matters: Regular heap dumps and profiling help catch leaks before they cause production issues.
-
Lower limits fail fast: Setting lower memory limits helps catch leaks earlier, before they cause widespread issues.
-
Code review for memory: Add memory management to code review checklists.
Interview Takeaways
Common Questions:
- "How do you detect memory leaks?"
- "What causes memory leaks in Node.js/Python/Java?"
- "How do you prevent memory leaks?"
What Interviewers Are Looking For:
- Understanding of memory management
- Knowledge of memory leak detection strategies
- Experience with production memory issues
- Awareness of bounded data structures
What a Senior Engineer Would Do Differently
From the Start:
- Use bounded caches: Always set size limits and TTLs
- Monitor memory trends: Alert on gradual increases, not just absolute values
- Memory profiling: Regular heap dumps and profiling
- Lower memory limits: Fail fast to catch leaks earlier
- Code review checklist: Include memory management in reviews
The Real Lesson: Memory leaks are silent killers. They don't cause immediate failures, but they will eventually bring down your service. Monitor trends, not just absolute values.
FAQs
Q: What causes memory leaks in production?
A: Unbounded data structures (Maps, Arrays, Objects) that grow indefinitely. Missing cleanup code, event listeners not removed, and closures holding references are common causes.
Q: How do you detect memory leaks?
A: Monitor memory trends (alert if memory increases > 10% per day), take regular heap dumps, and profile memory usage per endpoint. Lower memory limits help catch leaks earlier.
Q: How do you prevent memory leaks?
A: Use bounded data structures with TTLs, implement cleanup code, remove event listeners, and profile memory regularly. Add memory management to code review checklists.
Q: Why are memory leaks harder to detect than other bugs?
A: Memory leaks are gradual—they don't cause immediate failures. They grow slowly over time, making them easy to miss in monitoring. You need to monitor trends, not just absolute values.
Q: Should you set lower or higher memory limits for pods?
A: Lower limits help catch leaks earlier (fail fast), but too low can cause unnecessary restarts. Balance based on your workload, but err on the side of lower limits for leak detection.
Q: How do you debug memory leaks in production?
A: Take heap dumps, analyze memory usage patterns, use memory profilers, and track memory usage per endpoint. Compare memory usage over time to identify trends.
Q: What's the difference between a memory leak and high memory usage?
A: A memory leak is unbounded growth—memory increases continuously and never decreases. High memory usage is stable but high. Leaks will eventually cause OOM, while high usage might just be inefficient.
Keep exploring
Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.