Design Thinking

Case Studies in Design Thinking

Learn from real-world systems: Instagram's sharded MySQL, WhatsApp's Erlang architecture, Uber's dynamic pricing, Slack's WebSocket management, and Netflix's microservices migration.

Advanced30 min read

Case studies show how real companies applied trade-offs, why they chose one architecture, and what failures shaped their decisions.

Why Case Studies Matter

Studying real-world systems teaches you:

How trade-offs are made in practice: Not just theory, but real decisions
Why certain architectures were chosen: The reasoning behind choices
What failures shaped decisions: Learning from mistakes
How systems evolve: From MVP to scale

Case Study 1: Instagram's Sharded MySQL

The Problem

Instagram started with a single PostgreSQL database. As they grew to millions of users, they hit scaling limits:

Database couldn't handle the load
Writes were slow
Reads were slow
Single point of failure

The Solution: Sharded MySQL

Architecture:

Sharded MySQL (thousands of shards)
Each shard handles a subset of users
Sharding by user ID (hash-based)
Read replicas for each shard

Why MySQL over PostgreSQL?:

Better tooling for sharding
More mature replication
Better performance at scale
Larger community

Trade-offs:

✅ Horizontal scaling (can add shards)
✅ High availability (shard failure doesn't affect all users)
✅ Better performance (smaller databases)
❌ Complex sharding logic
❌ Cross-shard queries difficult
❌ Data migration complexity

Key Learnings

Start simple, scale when needed: Started with single database, sharded when needed
Sharding is hard: Requires careful planning, data migration, application changes
Choose technology for scale: MySQL chosen for sharding tooling, not just performance
Design for sharding from day 1: Even if you don't shard initially, design with sharding in mind

Case Study 2: WhatsApp's Erlang Architecture

The Problem

WhatsApp needed to handle:

Billions of messages per day
Real-time delivery
Low latency (< 100ms)
High availability (99.99%)
Small team (50 engineers)

The Solution: Erlang/OTP

Architecture:

Erlang/OTP for backend
Single server per user (process per user)
Message queue per user
Minimal infrastructure (few servers)

Why Erlang?:

Lightweight processes: Millions of concurrent processes
Fault tolerance: "Let it crash" philosophy
Hot code reloading: Update without downtime
Built-in message passing: Perfect for messaging systems

Trade-offs:

✅ Handles millions of concurrent connections
✅ Fault tolerant (process crash doesn't affect others)
✅ Low latency (in-memory message passing)
✅ Small team (Erlang is productive)
❌ Smaller talent pool (harder to hire)
❌ Less ecosystem (fewer libraries)
❌ Learning curve (different paradigm)

Key Learnings

Choose technology for the problem: Erlang perfect for messaging (concurrent processes, fault tolerance)
Small team can build at scale: Right technology + right architecture = small team success
Fault tolerance is built-in: Erlang's "let it crash" philosophy enables resilience
Process per user pattern: Simple mental model, scales naturally

Case Study 3: Uber's Dynamic Pricing System

The Problem

Uber needed to:

Adjust prices based on demand/supply
Update prices in real-time
Handle millions of price calculations per second
Ensure consistency (same price for same conditions)

The Solution: Event-Driven Architecture

Architecture:

Event stream (Kafka) for demand/supply events
Pricing service consumes events
Real-time price calculation
Cache prices (Redis) for fast reads
Database for price history

Why Event-Driven?:

Real-time updates: Events trigger price recalculation
Scalability: Multiple services can consume events
Decoupling: Services don't need to know about each other
Replayability: Can replay events for debugging

Trade-offs:

✅ Real-time price updates
✅ Scalable (multiple consumers)
✅ Decoupled services
❌ Eventual consistency (prices may be slightly stale)
❌ Complex event processing
❌ Debugging challenges

Key Learnings

Event-driven for real-time: Events enable real-time price updates
Cache for performance: Cache prices for fast reads, recalculate on events
Accept eventual consistency: Prices may be slightly stale, but acceptable for user experience
Design for scale: Event stream handles millions of events per second

Case Study 4: Slack's WebSocket Management

The Problem

Slack needed to:

Handle millions of concurrent WebSocket connections
Deliver messages in real-time
Scale horizontally
Handle connection failures gracefully

The Solution: WebSocket Gateway + Message Queue

Architecture:

WebSocket gateway (multiple instances)
Message queue (RabbitMQ/Kafka) for message routing
Presence service (track online/offline users)
Database for message persistence

Why This Architecture?:

WebSocket gateway: Handles connections, scales horizontally
Message queue: Routes messages to correct gateway instance
Presence service: Tracks which gateway has which user
Database: Persists messages for offline users

Trade-offs:

✅ Horizontal scaling (add gateway instances)
✅ Fault tolerant (gateway failure doesn't affect all users)
✅ Real-time delivery (WebSocket)
❌ Complex routing (need to find correct gateway)
❌ Connection state management
❌ Message queue overhead

Key Learnings

Gateway pattern for WebSockets: Gateway handles connections, scales horizontally
Message queue for routing: Routes messages to correct gateway instance
Presence service for state: Tracks which gateway has which user
Design for connection failures: Handle reconnections, message delivery

Case Study 5: Netflix's Microservices Migration

The Problem

Netflix started as a monolith. As they grew, they faced:

Deployment blocks (all teams deploy together)
Scaling challenges (can't scale services independently)
Technology constraints (stuck with one stack)
Team coordination (all teams work on same codebase)

The Solution: Microservices

Architecture:

100+ microservices
Each service owns a domain (video, recommendations, billing)
Independent deployments
Service mesh for communication
API gateway for external access

Why Microservices?:

Independent scaling: Scale services based on demand
Technology freedom: Each service can use different stack
Team autonomy: Teams work independently
Fault isolation: Service failure doesn't affect others

Trade-offs:

✅ Independent scaling
✅ Technology freedom
✅ Team autonomy
✅ Fault isolation
❌ Increased complexity
❌ Network latency
❌ Distributed system challenges
❌ Operational overhead

The Migration

Phase 1: Strangler Pattern

Build new services alongside monolith
Gradually migrate functionality
Keep monolith running

Phase 2: Service Extraction

Extract services one by one
Migrate users gradually
Monitor and optimize

Phase 3: Full Migration

All functionality in microservices
Decommission monolith
Optimize and scale

Key Learnings

Strangler pattern for migration: Build new alongside old, migrate gradually
Start with high-value services: Extract services that provide most value first
Independent scaling is powerful: Scale services based on demand (video vs billing)
Complexity is real: Microservices add complexity, but provide benefits at scale
Team autonomy matters: Teams can move faster when independent

Thinking Aloud Like a Senior Engineer

Let me walk you through how I'd actually learn from a case study and apply it to my own problem. This is the real-time reasoning that happens when you're trying to extract useful lessons.

Problem: "I'm designing a messaging system that needs to handle 1B messages/day. I read that WhatsApp used Erlang. Should I use Erlang too?"

My first instinct: "Yes! If it worked for WhatsApp, it'll work for me. Erlang it is!"

But wait—that's copying without understanding. Let me think about what WhatsApp actually did and why.

What was WhatsApp's problem?

Billions of messages per day
Real-time delivery
Low latency (< 100ms)
High availability (99.99%)
Small team (50 engineers)

What did they choose? Erlang/OTP with process-per-user pattern.

Why did they choose Erlang?

Lightweight processes (millions of concurrent)
Fault tolerance ("let it crash")
Built-in message passing
Hot code reloading

Now, what's my problem?

1B messages/day (similar scale)
Real-time delivery (similar requirement)
Low latency (similar requirement)
Team: 10 engineers (smaller than WhatsApp, but still small)

My next thought: "Erlang seems like a good fit. Similar requirements, similar team size."

But wait—do I have Erlang expertise? "My team knows Node.js and Python. Learning Erlang would take time."

I'm thinking: "Can I achieve the same with Node.js? Node.js has good concurrency, WebSocket support, and my team knows it."

Actually, let me think about the core pattern: "WhatsApp's key insight wasn't Erlang itself—it was the process-per-user pattern. Each user has a lightweight process that handles their messages."

Can I do that in Node.js? "Yes! I can use a Map to track user connections, handle messages per user, and scale horizontally with multiple Node.js servers."

I'm choosing Node.js: "Because:

My team knows it (faster development)
Can achieve similar concurrency (Node.js is good at I/O)
Similar architecture pattern (process-per-user concept)
This is the trade-off: familiar technology over Erlang's specific benefits"

But what about fault tolerance? "Erlang's 'let it crash' is powerful. In Node.js, I need to handle errors explicitly."

I'm thinking: "I can add error handling, retries, and monitoring. It's more work, but acceptable for my team size."

Now, what about the architecture? "WhatsApp used a simple architecture: process per user, message queue per user, minimal infrastructure."

Can I do that? "Yes! I can use:

WebSocket server (one connection per user)
Redis for message queues (one queue per user concept)
Database for persistence
Multiple servers for scale"

This is how I learn from case studies: I don't copy the solution. I understand the problem, the solution, the trade-offs, and then adapt it to my context.

What if my problem is different? "Let's say I need to handle group chats with 10K members. WhatsApp's process-per-user doesn't work as well for groups."

I'm thinking: "For groups, I'd need a different pattern. Maybe a group service that manages group state, and fan-out to members when messages are sent."

This is the key insight: Case studies teach you patterns and trade-offs, not solutions. You adapt them to your context.

Notice how I didn't just say "use Erlang because WhatsApp did." I understood why they chose it, what trade-offs they made, and then adapted the pattern to my context.

How a Senior Engineer Learns from Case Studies

A senior engineer:

Identifies the problem: What problem were they solving?
Understands the solution: What architecture did they choose?
Analyzes trade-offs: What were the pros and cons?
Learns from failures: What went wrong? How did they fix it?
Applies learnings: How can I apply this to my problems?

Example: Learning from Instagram

Problem: Single database can't scale Solution: Sharded MySQL Trade-offs: Complexity vs scalability Learning: Design for sharding from day 1, even if you don't shard initially Application: When designing my system, I'll consider sharding strategy even for MVP

Best Practices

Study real systems: Learn from companies that solved similar problems
Understand the context: Why did they make these choices?
Analyze trade-offs: What were the pros and cons?
Learn from failures: What went wrong? How did they fix it?
Apply selectively: Not every pattern applies to every problem
Think critically: Question decisions, understand alternatives

Common Interview Questions

Beginner

Q: Why did Instagram choose sharded MySQL over a single database?

A: Instagram started with a single PostgreSQL database but hit scaling limits. They chose sharded MySQL because:

Better tooling for sharding
More mature replication
Better performance at scale
Can scale horizontally by adding shards

The trade-off was increased complexity (sharding logic, cross-shard queries) for better scalability.

Intermediate

Q: How did WhatsApp handle billions of messages with a small team?

A: WhatsApp used Erlang/OTP architecture:

Lightweight processes (millions of concurrent processes)
Process per user pattern (simple mental model)
Fault tolerance built-in ("let it crash")
Hot code reloading (update without downtime)

This allowed a small team (50 engineers) to handle billions of messages because Erlang is designed for concurrent, fault-tolerant systems.

Senior

Q: How would you apply Netflix's microservices migration strategy to a monolith you're working on?

A: I would use the strangler pattern:

Identify high-value services: Extract services that provide most value first (e.g., payment, user management)
Build alongside monolith: Don't rewrite, build new services alongside
Migrate gradually: Move users to new services incrementally
Monitor and optimize: Measure performance, optimize based on data
Decommission monolith: Once all functionality migrated, decommission

Key considerations:

Start with services that have clear boundaries
Ensure backward compatibility during migration
Have rollback plan for each service
Monitor metrics (latency, error rate, throughput)

Summary

Case studies teach you how real companies applied design thinking:

Instagram: Sharded MySQL for horizontal scaling
WhatsApp: Erlang/OTP for concurrent, fault-tolerant messaging
Uber: Event-driven architecture for real-time pricing
Slack: WebSocket gateway + message queue for real-time delivery
Netflix: Microservices migration using strangler pattern

Key takeaways:

Study real systems to learn from their choices
Understand the context and trade-offs
Learn from failures and how they were fixed
Apply learnings selectively to your problems
Think critically about decisions

FAQs

Q: Should I memorize these case studies for interviews?

A: Not exactly. Understand the principles (why they made these choices, what trade-offs they accepted), not just the facts. Interviewers care more about your reasoning than your memory.

Q: How do I find more case studies?

Engineering blogs (Instagram, Uber, Netflix, etc.)
Conference talks (QCon, AWS re:Invent, etc.)
Books (Designing Data-Intensive Applications, etc.)
Podcasts (Software Engineering Daily, etc.)

Q: Can I use case studies in my designs?

A: Yes, but adapt them. Don't copy blindly. Understand why they worked, then adapt to your context. What worked for Instagram may not work for you.

Q: How do I know if a case study is relevant to my problem?

A: Look for:

Similar scale (users, requests, data)
Similar requirements (latency, availability, consistency)
Similar constraints (team size, budget, timeline)
Similar domain (messaging, social media, e-commerce)

Q: What if I don't know any case studies?

A: That's okay. Focus on understanding principles (trade-offs, patterns, architectures) rather than memorizing specific companies. Principles are more valuable than facts.

Q: How do I apply case studies to interview problems?

Reference similar systems: "This is similar to how Instagram handled..."
Explain trade-offs: "Like WhatsApp, we're choosing X because..."
Learn from failures: "Netflix learned that Y, so we'll avoid..."
Adapt to context: "For our problem, we'll modify this because..."

Q: Are case studies only for large companies?

A: No. Case studies from large companies are useful because they've solved scale problems, but principles apply to smaller systems too. Start simple, scale when needed—that's the lesson.

Keep exploring

Design thinking works best when combined with practice. Explore more topics or apply what you've learned in our system design practice platform.

View All Topics Practice System Design