Real Engineering Stories

The On-Call Fatigue That Led to a Critical Bug

A production incident where on-call fatigue caused an engineer to make a critical mistake during an incident response, making the outage worse. Learn about on-call practices, incident response, and team health.

Beginner20 min read

This is a story about how on-call fatigue led to a mistake that made an outage worse. It's also about why taking care of your team matters as much as taking care of your systems, and how we learned to prevent on-call burnout.

Context

We were running a small engineering team (5 engineers) with a shared on-call rotation. Each engineer was on-call for one week, then off for four weeks. The system had occasional incidents, but nothing critical.

Original On-Call Setup:

Rotation: 5 engineers, 1 week on-call each
Coverage: 24/7 on-call coverage
Incidents: ~2-3 incidents per month
Response Time: < 15 minutes SLA

Assumptions Made:

1 week on-call every 5 weeks was sustainable
Engineers could handle occasional incidents
On-call wasn't too stressful

The Incident

Timeline:

Week 1: Engineer A on-call, 2 minor incidents (resolved quickly)
Week 2: Engineer B on-call, 1 incident (resolved in 30 minutes)
Week 3: Engineer C on-call, 3 incidents (one took 2 hours)
Week 4: Engineer D on-call, 2 incidents (resolved quickly)
Week 5: Engineer E on-call (first time, new to team)
Week 5, Day 3, 2:00 AM: Database connection pool exhausted
Week 5, Day 3, 2:05 AM: Engineer E paged (woken up)
Week 5, Day 3, 2:10 AM: Engineer E tried to restart database (wrong action)
Week 5, Day 3, 2:15 AM: Database restart caused 15-minute downtime
Week 5, Day 3, 2:30 AM: Service still down, senior engineer called
Week 5, Day 3, 2:45 AM: Senior engineer identified correct fix (increase pool size)
Week 5, Day 3, 3:00 AM: Service recovered, but 2-hour outage total

Symptoms

What We Saw:

Initial Issue: Database connection pool exhausted (15-minute fix)
Engineer Action: Restarted database (wrong fix, made it worse)
Extended Outage: 15-minute issue became 2-hour outage
Team Impact: Engineer E stressed, team morale low
User Impact: Extended service downtime

How We Detected It:

Alert fired for database connection pool
Engineer E responded but made wrong decision
Senior engineer noticed extended outage and intervened

Monitoring Gaps:

No on-call fatigue monitoring
No incident response guidance
No escalation procedures
No post-incident support

Root Cause Analysis

Primary Cause: On-call fatigue and lack of incident response guidance.

What Happened:

Engineer E was on-call for first time, new to team
Paged at 2:00 AM (woken up from sleep)
Tired, stressed, and unfamiliar with system
Saw database connection pool error
Made quick decision: restart database (seemed like reasonable fix)
Database restart caused 15-minute downtime
Service still down after restart (original issue not fixed)
Senior engineer called, identified correct fix
Total outage: 2 hours (should have been 15 minutes)

Why It Was So Bad:

On-call fatigue: Engineer tired, stressed, not thinking clearly
Lack of guidance: No runbook or incident response procedures
Wrong decision: Restarting database made outage worse
No escalation: Engineer didn't know when to escalate
Team impact: Engineer felt guilty, team morale affected

Contributing Factors:

New engineer on-call without proper training
No incident response runbooks
No escalation procedures
On-call rotation too frequent for small team
No post-incident support or debrief

Fix & Mitigation

Immediate Fix:

Corrected the issue: Increased database connection pool size
Service recovered: Back online after 2 hours
Post-incident review: Identified on-call fatigue as root cause

Long-Term Improvements:

On-Call Practices:
- Changed rotation to 1 week on-call every 6 weeks (more rest)
- Added primary and secondary on-call (backup support)
- Added on-call training for new engineers
- Added on-call compensation (extra PTO days)
Incident Response:
- Created runbooks for common incidents
- Added escalation procedures (when to call senior engineer)
- Added incident response training
- Added "stop and think" checklist before major actions
Team Health:
- Added post-incident debriefs (blameless)
- Added on-call fatigue monitoring
- Added mental health support
- Added on-call rotation feedback process
Process Improvements:
- Required runbook review before on-call
- Added "test" environment for incident response practice
- Created incident response playbook
- Added on-call handoff procedures

Architecture After Fix

Key Changes:

Primary and secondary on-call rotation
Incident response runbooks
Escalation procedures
On-call training and support

Key Lessons

On-call fatigue is real: Tired engineers make mistakes. Ensure adequate rest and rotation.
Provide guidance: Runbooks and incident response procedures help engineers make better decisions under stress.
Know when to escalate: Engineers should know when to call for help, not try to fix everything alone.
Support your team: Post-incident debriefs, mental health support, and compensation matter.
Learn from incidents: Every incident is a learning opportunity. Review and improve processes.

Interview Takeaways

Common Questions:

"How do you handle on-call?"
"What is on-call fatigue?"
"How do you prevent incidents?"

What Interviewers Are Looking For:

Understanding of on-call practices
Knowledge of incident response
Awareness of team health
Experience with process improvement

What a Senior Engineer Would Do Differently

From the Start:

Better on-call rotation: More rest between on-call weeks
Provide runbooks: Clear guidance for common incidents
Add escalation procedures: Know when to call for help
Train new engineers: On-call training before going on-call
Support team health: Mental health support, compensation, debriefs

The Real Lesson: Taking care of your team is as important as taking care of your systems. On-call fatigue leads to mistakes, which lead to longer outages.

FAQs

Q: What is on-call fatigue?

A: On-call fatigue is exhaustion and stress from being on-call, especially when woken up frequently or dealing with stressful incidents. It leads to poor decision-making and mistakes.

Q: How do you prevent on-call fatigue?

A: Ensure adequate rest between on-call weeks, provide runbooks and guidance, add backup on-call support, compensate on-call time, and provide mental health support.

Q: How do you handle on-call for a small team?

A: Use longer rotation cycles (more rest), add primary and secondary on-call, provide clear runbooks, and know when to escalate to external support.

Q: What should be in an incident response runbook?

A: Common incidents, step-by-step resolution procedures, escalation contacts, rollback procedures, and "stop and think" checklists before major actions.

Q: When should you escalate an incident?

A: Escalate when you're unsure of the fix, when the incident is getting worse, when you're tired or stressed, or when the incident exceeds SLA.

Q: How do you support engineers after incidents?

A: Hold blameless post-incident reviews, provide mental health support, recognize their effort, and learn from incidents to improve processes.

Q: Should on-call be compensated?

A: Yes. On-call is additional work and stress. Compensate with extra PTO, bonuses, or reduced regular work hours. This shows you value your team's time and health.

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.

View All Stories Practice System Design