← Back to Real Engineering Stories

Real Engineering Stories

The On-Call Fatigue That Led to a Critical Bug

A production incident where on-call fatigue caused an engineer to make a critical mistake during an incident response, making the outage worse. Learn about on-call practices, incident response, and team health.

Beginner20 min read

This is a story about how on-call fatigue led to a mistake that made an outage worse. It's also about why taking care of your team matters as much as taking care of your systems, and how we learned to prevent on-call burnout.


Context

We were running a small engineering team (5 engineers) with a shared on-call rotation. Each engineer was on-call for one week, then off for four weeks. The system had occasional incidents, but nothing critical.

Original On-Call Setup:

  • Rotation: 5 engineers, 1 week on-call each
  • Coverage: 24/7 on-call coverage
  • Incidents: ~2-3 incidents per month
  • Response Time: < 15 minutes SLA

Assumptions Made:

  • 1 week on-call every 5 weeks was sustainable
  • Engineers could handle occasional incidents
  • On-call wasn't too stressful

The Incident

Timeline:

  • Week 1: Engineer A on-call, 2 minor incidents (resolved quickly)
  • Week 2: Engineer B on-call, 1 incident (resolved in 30 minutes)
  • Week 3: Engineer C on-call, 3 incidents (one took 2 hours)
  • Week 4: Engineer D on-call, 2 incidents (resolved quickly)
  • Week 5: Engineer E on-call (first time, new to team)
  • Week 5, Day 3, 2:00 AM: Database connection pool exhausted
  • Week 5, Day 3, 2:05 AM: Engineer E paged (woken up)
  • Week 5, Day 3, 2:10 AM: Engineer E tried to restart database (wrong action)
  • Week 5, Day 3, 2:15 AM: Database restart caused 15-minute downtime
  • Week 5, Day 3, 2:30 AM: Service still down, senior engineer called
  • Week 5, Day 3, 2:45 AM: Senior engineer identified correct fix (increase pool size)
  • Week 5, Day 3, 3:00 AM: Service recovered, but 2-hour outage total

Symptoms

What We Saw:

  • Initial Issue: Database connection pool exhausted (15-minute fix)
  • Engineer Action: Restarted database (wrong fix, made it worse)
  • Extended Outage: 15-minute issue became 2-hour outage
  • Team Impact: Engineer E stressed, team morale low
  • User Impact: Extended service downtime

How We Detected It:

  • Alert fired for database connection pool
  • Engineer E responded but made wrong decision
  • Senior engineer noticed extended outage and intervened

Monitoring Gaps:

  • No on-call fatigue monitoring
  • No incident response guidance
  • No escalation procedures
  • No post-incident support

Root Cause Analysis

Primary Cause: On-call fatigue and lack of incident response guidance.

What Happened:

  1. Engineer E was on-call for first time, new to team
  2. Paged at 2:00 AM (woken up from sleep)
  3. Tired, stressed, and unfamiliar with system
  4. Saw database connection pool error
  5. Made quick decision: restart database (seemed like reasonable fix)
  6. Database restart caused 15-minute downtime
  7. Service still down after restart (original issue not fixed)
  8. Senior engineer called, identified correct fix
  9. Total outage: 2 hours (should have been 15 minutes)

Why It Was So Bad:

  • On-call fatigue: Engineer tired, stressed, not thinking clearly
  • Lack of guidance: No runbook or incident response procedures
  • Wrong decision: Restarting database made outage worse
  • No escalation: Engineer didn't know when to escalate
  • Team impact: Engineer felt guilty, team morale affected

Contributing Factors:

  • New engineer on-call without proper training
  • No incident response runbooks
  • No escalation procedures
  • On-call rotation too frequent for small team
  • No post-incident support or debrief

Fix & Mitigation

Immediate Fix:

  1. Corrected the issue: Increased database connection pool size
  2. Service recovered: Back online after 2 hours
  3. Post-incident review: Identified on-call fatigue as root cause

Long-Term Improvements:

  1. On-Call Practices:

    • Changed rotation to 1 week on-call every 6 weeks (more rest)
    • Added primary and secondary on-call (backup support)
    • Added on-call training for new engineers
    • Added on-call compensation (extra PTO days)
  2. Incident Response:

    • Created runbooks for common incidents
    • Added escalation procedures (when to call senior engineer)
    • Added incident response training
    • Added "stop and think" checklist before major actions
  3. Team Health:

    • Added post-incident debriefs (blameless)
    • Added on-call fatigue monitoring
    • Added mental health support
    • Added on-call rotation feedback process
  4. Process Improvements:

    • Required runbook review before on-call
    • Added "test" environment for incident response practice
    • Created incident response playbook
    • Added on-call handoff procedures

Architecture After Fix

Key Changes:

  • Primary and secondary on-call rotation
  • Incident response runbooks
  • Escalation procedures
  • On-call training and support

Key Lessons

  1. On-call fatigue is real: Tired engineers make mistakes. Ensure adequate rest and rotation.

  2. Provide guidance: Runbooks and incident response procedures help engineers make better decisions under stress.

  3. Know when to escalate: Engineers should know when to call for help, not try to fix everything alone.

  4. Support your team: Post-incident debriefs, mental health support, and compensation matter.

  5. Learn from incidents: Every incident is a learning opportunity. Review and improve processes.


Interview Takeaways

Common Questions:

  • "How do you handle on-call?"
  • "What is on-call fatigue?"
  • "How do you prevent incidents?"

What Interviewers Are Looking For:

  • Understanding of on-call practices
  • Knowledge of incident response
  • Awareness of team health
  • Experience with process improvement

What a Senior Engineer Would Do Differently

From the Start:

  1. Better on-call rotation: More rest between on-call weeks
  2. Provide runbooks: Clear guidance for common incidents
  3. Add escalation procedures: Know when to call for help
  4. Train new engineers: On-call training before going on-call
  5. Support team health: Mental health support, compensation, debriefs

The Real Lesson: Taking care of your team is as important as taking care of your systems. On-call fatigue leads to mistakes, which lead to longer outages.


FAQs

Q: What is on-call fatigue?

A: On-call fatigue is exhaustion and stress from being on-call, especially when woken up frequently or dealing with stressful incidents. It leads to poor decision-making and mistakes.

Q: How do you prevent on-call fatigue?

A: Ensure adequate rest between on-call weeks, provide runbooks and guidance, add backup on-call support, compensate on-call time, and provide mental health support.

Q: How do you handle on-call for a small team?

A: Use longer rotation cycles (more rest), add primary and secondary on-call, provide clear runbooks, and know when to escalate to external support.

Q: What should be in an incident response runbook?

A: Common incidents, step-by-step resolution procedures, escalation contacts, rollback procedures, and "stop and think" checklists before major actions.

Q: When should you escalate an incident?

A: Escalate when you're unsure of the fix, when the incident is getting worse, when you're tired or stressed, or when the incident exceeds SLA.

Q: How do you support engineers after incidents?

A: Hold blameless post-incident reviews, provide mental health support, recognize their effort, and learn from incidents to improve processes.

Q: Should on-call be compensated?

A: Yes. On-call is additional work and stress. Compensate with extra PTO, bonuses, or reduced regular work hours. This shows you value your team's time and health.

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.