Real Engineering Stories
The N+1 Query Problem That Slowed Down Our API
A production incident where an N+1 query problem in a user feed endpoint caused database load to spike, slowing down the entire API. Learn about N+1 queries, eager loading, and query optimization.
This is a story about how a seemingly simple feature addition—showing user avatars in a feed—caused our API to slow down from 100ms to 10+ seconds. It's also about why understanding database query patterns matters, and how we learned to profile queries before deploying.
Context
We were running a social media API with a feed endpoint that returned posts from users you follow. The system handled about 2M feed requests per day. Initially, the feed only showed post content and timestamps.
Original Architecture:
graph TB
Client[Client] --> API[API Server]
API --> Cache[Redis Cache]
API --> DB[(PostgreSQL Database)]
Cache --> DB
Technology Choices:
- API: Node.js with Express
- Database: PostgreSQL with connection pooling
- Cache: Redis for feed data
- ORM: Sequelize
Assumptions Made:
- Feed queries would be fast (indexed user_id and created_at)
- Database could handle feed request load
- Adding user avatars wouldn't significantly impact performance
The Incident
Timeline:
- Day 1: Feature deployed: show user avatars in feed
- Day 2: API response time increased from 100ms to 500ms (noticed but not investigated)
- Day 3: Response time increased to 2 seconds (alert fired)
- Day 4: Response time increased to 5 seconds (investigation started)
- Day 5: Response time increased to 10+ seconds (service degradation)
- Day 5, 2:00 PM: Database CPU usage at 90%
- Day 5, 2:15 PM: On-call engineer paged
- Day 5, 2:30 PM: Identified N+1 query problem
- Day 5, 3:00 PM: Hotfix deployed (eager loading)
- Day 5, 3:15 PM: Response time back to 100ms
Symptoms
What We Saw:
- Response Time: Increased from 100ms to 10+ seconds over 5 days
- Database Queries: Increased from 1 query per request to 1000+ queries per request
- Database CPU: Spiked from 20% to 90%
- Error Rate: Increased from 0.1% to 5% (timeouts)
- User Impact: ~100K feed requests failed or timed out
How We Detected It:
- Alert fired when response time exceeded 2 seconds
- Database monitoring showed high query count
- Slow query log showed thousands of user lookup queries
Monitoring Gaps:
- No alert for query count per request
- No alert for N+1 query patterns
- No query profiling in CI/CD
Root Cause Analysis
Primary Cause: N+1 query problem in feed endpoint.
The Bug:
// BAD CODE (simplified)
async function getFeed(userId) {
// 1 query: Get posts
const posts = await Post.findAll({
where: { userId: userId },
limit: 50
});
// N queries: Get user for each post (N+1 problem!)
for (const post of posts) {
post.user = await User.findByPk(post.userId); // 50 queries!
}
return posts;
}
What Happened:
- Feed endpoint fetched 50 posts (1 query)
- For each post, it fetched the user to get avatar (50 queries)
- Total: 51 queries per feed request
- With 2M requests/day = ~23 requests/second
- That's 1,173 queries/second just for feeds
- Database couldn't handle the load
- Queries queued up, causing timeouts
Why It Was So Bad:
- No eager loading: Users weren't loaded with posts
- No query profiling: We didn't know about the N+1 problem
- Gradual degradation: Performance worsened over days, not immediately
- No query monitoring: We didn't track queries per request
Contributing Factors:
- ORM made it easy to write N+1 queries
- No code review checklist for database queries
- No query profiling in development
- Feature deployed without performance testing
Fix & Mitigation
Immediate Fix:
// FIXED CODE
async function getFeed(userId) {
// 1 query: Get posts with users (eager loading)
const posts = await Post.findAll({
where: { userId: userId },
include: [{ model: User, attributes: ['id', 'name', 'avatar'] }],
limit: 50
});
// No additional queries needed!
return posts;
}
Long-Term Improvements:
-
Query Optimization:
- Added eager loading for all associations
- Added query profiling in development
- Added query count limits per request
-
Monitoring & Alerting:
- Added alert for query count per request (alert if > 10)
- Added slow query log analysis
- Added database query rate monitoring
-
Code Review Process:
- Added checklist for database queries (eager loading, indexes, N+1)
- Added query profiling to code review
- Added performance testing for database-heavy endpoints
-
Process Improvements:
- Required query profiling before deploying database changes
- Added performance benchmarks for critical endpoints
- Created runbook for query optimization
Architecture After Fix
Key Changes:
- Eager loading for all database associations
- Query profiling in development and staging
- Query count monitoring and alerting
- Performance testing in CI/CD
Key Lessons
-
N+1 queries are silent killers: They don't cause immediate failures, but they will slow down your API. Always use eager loading.
-
Profile queries before deploying: Know how many queries your endpoints execute. Set limits and alerts.
-
Monitor query patterns: Track queries per request, not just total queries. N+1 problems show up as high query counts.
-
ORMs make it easy to write bad queries: ORMs are convenient, but they can hide performance problems. Always check the generated SQL.
-
Performance testing matters: Test database-heavy endpoints under load before deploying.
Interview Takeaways
Common Questions:
- "What is the N+1 query problem?"
- "How do you optimize database queries?"
- "How do you prevent N+1 queries?"
What Interviewers Are Looking For:
- Understanding of N+1 query problem
- Knowledge of eager loading and query optimization
- Experience with database performance issues
- Awareness of ORM pitfalls
What a Senior Engineer Would Do Differently
From the Start:
- Use eager loading: Always load associations in the initial query
- Profile queries: Know how many queries each endpoint executes
- Monitor query patterns: Alert on high query counts per request
- Performance test: Test database-heavy endpoints under load
- Code review for queries: Add query optimization to code review checklist
The Real Lesson: ORMs make it easy to write code, but they also make it easy to write slow code. Always check the generated SQL and profile your queries.
FAQs
Q: What is the N+1 query problem?
A: The N+1 query problem occurs when you fetch N records, then make N additional queries to fetch related data. For example, fetching 50 posts, then making 50 queries to fetch each post's user. This should be done in 1 query with eager loading.
Q: How do you prevent N+1 queries?
A: Use eager loading to fetch associations in the initial query. In Sequelize, use include. In raw SQL, use JOINs. Always load related data in one query, not in a loop.
Q: How do you detect N+1 queries?
A: Monitor query count per request. If one request executes 100+ queries, you likely have an N+1 problem. Use query profiling tools to identify the pattern.
Q: Are ORMs bad for performance?
A: ORMs aren't inherently bad, but they can hide performance problems. Always check the generated SQL and profile queries. Use eager loading and avoid lazy loading in loops.
Q: How do you optimize slow queries?
A: Use indexes, eager loading, query caching, and pagination. Profile queries to identify bottlenecks. Consider denormalization for read-heavy workloads.
Q: Should you always use eager loading?
A: Not always. Eager loading is good for frequently accessed associations. For rarely accessed data, lazy loading might be fine. The key is to avoid loading in loops.
Q: How do you test query performance?
A: Use query profiling tools, load testing, and database monitoring. Set up alerts for slow queries and high query counts. Test in staging with production-like data volumes.
Keep exploring
Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.