Real Engineering Stories

The Memory Leak That Caused Gradual Degradation

A production incident where a memory leak in notification processing code caused gradual memory increase over 21 days, eventually leading to pod OOM kills and service degradation. Learn about memory leak detection, bounded data structures, and monitoring memory trends.

Intermediate25 min read

This is a story about how a small bug—an unbounded in-memory cache—caused a memory leak that took 21 days to surface. It's also a story about why gradual problems are harder to detect than sudden failures, and how we changed our monitoring and code review processes to catch them earlier.

Context

We had a microservices architecture with a notification service that sent emails, SMS, and push notifications. The service processed about 1M notifications per day, running on Kubernetes with auto-scaling enabled.

Original Architecture:

graph TB
    Queue[Message Queue<br/>RabbitMQ] --> Worker1[Worker Pod 1]
    Queue --> Worker2[Worker Pod 2]
    Queue --> Worker3[Worker Pod 3]
    Worker1 --> Email[Email Service]
    Worker2 --> SMS[SMS Service]
    Worker3 --> Push[Push Service]
    Worker1 --> DB[(Database)]
    Worker2 --> DB
    Worker3 --> DB

Technology Choices:

Workers: Node.js (3 pods, auto-scaling 1-10 pods)
Queue: RabbitMQ
Database: PostgreSQL
Orchestration: Kubernetes

Assumptions Made:

Memory usage would be stable
Auto-scaling would handle traffic spikes
Pod restarts would clear any memory issues

The Incident

Timeline:

Day 1: Service deployed, memory usage at 200MB per pod
Day 3: Memory usage at 300MB per pod (normal variation)
Day 7: Memory usage at 500MB per pod (investigated, no issues found)
Day 14: Memory usage at 800MB per pod (alert fired, investigated)
Day 21: Memory usage at 1.2GB per pod (approaching pod memory limit of 1.5GB)
Day 21, 3:00 PM: First pod OOM (Out of Memory) killed
Day 21, 3:05 PM: Second pod OOM killed
Day 21, 3:10 PM: Third pod OOM killed
Day 21, 3:15 PM: All pods restarting, queue backing up
Day 21, 3:20 PM: On-call engineer paged
Day 21, 3:30 PM: Identified memory leak in notification processing
Day 21, 4:00 PM: Hotfix deployed
Day 21, 4:30 PM: Service recovered

Symptoms

What We Saw:

Memory Usage: Gradual increase from 200MB to 1.2GB over 21 days
Pod Restarts: Pods killed by Kubernetes due to OOM
Queue Depth: Increased from 0 to 50K messages during incident
Processing Rate: Dropped from 1000 notifications/min to 200 notifications/min
User Impact: ~200K notifications delayed by 2-4 hours

How We Detected It:

Memory usage alert fired on Day 14 (but investigation found "no issues")
OOM alerts fired on Day 21 when pods started dying
Queue depth alert fired when queue exceeded 10K messages

Monitoring Gaps:

No memory leak detection (trending analysis)
No alert for gradual memory increase
No alert for pod restart frequency

Root Cause Analysis

Primary Cause: Memory leak in notification processing code.

The Bug:

// BAD CODE (simplified)
const notificationCache = new Map();

async function processNotification(notification) {
  // Store notification in cache (never cleared!)
  notificationCache.set(notification.id, notification);
  
  // Process notification
  await sendNotification(notification);
  
  // Cache never cleared, grows indefinitely
}

What Happened:

Each processed notification was stored in an in-memory Map
Map was never cleared (no TTL, no size limit)
Over 21 days, Map grew to 1M+ entries
Each entry consumed memory (notification object + metadata)
Memory usage increased linearly with processed notifications
Pods hit memory limits and were killed by Kubernetes

Why It Wasn't Caught:

Memory increase was gradual (not sudden)
No memory leak detection in monitoring
Code review missed the unbounded Map
Tests didn't run long enough to catch the leak

Contributing Factors:

No memory profiling in production
No alert for gradual memory trends
No code review checklist for memory management
Pod memory limits too high (1.5GB) - allowed leak to grow

Fix & Mitigation

Immediate Fix:

// FIXED CODE
const notificationCache = new Map();
const MAX_CACHE_SIZE = 1000;
const CACHE_TTL = 5 * 60 * 1000; // 5 minutes

async function processNotification(notification) {
  // Only cache if under limit
  if (notificationCache.size < MAX_CACHE_SIZE) {
    notificationCache.set(notification.id, {
      data: notification,
      timestamp: Date.now()
    });
  }
  
  // Process notification
  await sendNotification(notification);
  
  // Clean up old entries periodically
  cleanupCache();
}

function cleanupCache() {
  const now = Date.now();
  for (const [id, entry] of notificationCache.entries()) {
    if (now - entry.timestamp > CACHE_TTL) {
      notificationCache.delete(id);
    }
  }
  
  // Also enforce size limit
  if (notificationCache.size > MAX_CACHE_SIZE) {
    const firstKey = notificationCache.keys().next().value;
    notificationCache.delete(firstKey);
  }
}

Long-Term Improvements:

Memory Leak Detection:
- Added memory trend monitoring (alert if memory increases > 10% per day)
- Added memory profiling in production (weekly heap dumps)
- Added memory usage per endpoint tracking
Code Review Process:
- Added checklist for memory management (bounded caches, TTLs, cleanup)
- Added code review for unbounded data structures
- Added memory leak testing in CI/CD
Monitoring & Alerting:
- Added alert for pod restart frequency
- Added alert for gradual memory increase
- Added memory usage per pod tracking
Process Improvements:
- Reduced pod memory limits (from 1.5GB to 512MB) to catch leaks earlier
- Added memory profiling to deployment process
- Created runbook for memory leak incidents

Architecture After Fix

Key Changes:

Bounded in-memory caches with TTLs
Memory leak detection and alerting
Lower pod memory limits (fail fast)
Memory profiling in production

Key Lessons

Memory leaks are gradual: They don't cause immediate failures, making them hard to detect. Monitor memory trends, not just absolute values.
Bounded data structures: Always set limits and TTLs for in-memory caches. Unbounded growth will eventually cause OOM.
Memory profiling matters: Regular heap dumps and profiling help catch leaks before they cause production issues.
Lower limits fail fast: Setting lower memory limits helps catch leaks earlier, before they cause widespread issues.
Code review for memory: Add memory management to code review checklists.

Interview Takeaways

Common Questions:

"How do you detect memory leaks?"
"What causes memory leaks in Node.js/Python/Java?"
"How do you prevent memory leaks?"

What Interviewers Are Looking For:

Understanding of memory management
Knowledge of memory leak detection strategies
Experience with production memory issues
Awareness of bounded data structures

What a Senior Engineer Would Do Differently

From the Start:

Use bounded caches: Always set size limits and TTLs
Monitor memory trends: Alert on gradual increases, not just absolute values
Memory profiling: Regular heap dumps and profiling
Lower memory limits: Fail fast to catch leaks earlier
Code review checklist: Include memory management in reviews

The Real Lesson: Memory leaks are silent killers. They don't cause immediate failures, but they will eventually bring down your service. Monitor trends, not just absolute values.

FAQs

Q: What causes memory leaks in production?

A: Unbounded data structures (Maps, Arrays, Objects) that grow indefinitely. Missing cleanup code, event listeners not removed, and closures holding references are common causes.

Q: How do you detect memory leaks?

A: Monitor memory trends (alert if memory increases > 10% per day), take regular heap dumps, and profile memory usage per endpoint. Lower memory limits help catch leaks earlier.

Q: How do you prevent memory leaks?

A: Use bounded data structures with TTLs, implement cleanup code, remove event listeners, and profile memory regularly. Add memory management to code review checklists.

Q: Why are memory leaks harder to detect than other bugs?

A: Memory leaks are gradual—they don't cause immediate failures. They grow slowly over time, making them easy to miss in monitoring. You need to monitor trends, not just absolute values.

Q: Should you set lower or higher memory limits for pods?

A: Lower limits help catch leaks earlier (fail fast), but too low can cause unnecessary restarts. Balance based on your workload, but err on the side of lower limits for leak detection.

Q: How do you debug memory leaks in production?

A: Take heap dumps, analyze memory usage patterns, use memory profilers, and track memory usage per endpoint. Compare memory usage over time to identify trends.

Q: What's the difference between a memory leak and high memory usage?

A: A memory leak is unbounded growth—memory increases continuously and never decreases. High memory usage is stable but high. Leaks will eventually cause OOM, while high usage might just be inefficient.

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.

View All Stories Practice System Design