← Back to Design Thinking

Design Thinking

Case Studies in Design Thinking

Learn from real-world systems: Instagram's sharded MySQL, WhatsApp's Erlang architecture, Uber's dynamic pricing, Slack's WebSocket management, and Netflix's microservices migration.

Advanced30 min read

Case studies show how real companies applied trade-offs, why they chose one architecture, and what failures shaped their decisions.


Why Case Studies Matter

Studying real-world systems teaches you:

  • How trade-offs are made in practice: Not just theory, but real decisions
  • Why certain architectures were chosen: The reasoning behind choices
  • What failures shaped decisions: Learning from mistakes
  • How systems evolve: From MVP to scale

Case Study 1: Instagram's Sharded MySQL

The Problem

Instagram started with a single PostgreSQL database. As they grew to millions of users, they hit scaling limits:

  • Database couldn't handle the load
  • Writes were slow
  • Reads were slow
  • Single point of failure

The Solution: Sharded MySQL

Architecture:

  • Sharded MySQL (thousands of shards)
  • Each shard handles a subset of users
  • Sharding by user ID (hash-based)
  • Read replicas for each shard

Why MySQL over PostgreSQL?:

  • Better tooling for sharding
  • More mature replication
  • Better performance at scale
  • Larger community

Trade-offs:

  • ✅ Horizontal scaling (can add shards)
  • ✅ High availability (shard failure doesn't affect all users)
  • ✅ Better performance (smaller databases)
  • ❌ Complex sharding logic
  • ❌ Cross-shard queries difficult
  • ❌ Data migration complexity

Key Learnings

  1. Start simple, scale when needed: Started with single database, sharded when needed
  2. Sharding is hard: Requires careful planning, data migration, application changes
  3. Choose technology for scale: MySQL chosen for sharding tooling, not just performance
  4. Design for sharding from day 1: Even if you don't shard initially, design with sharding in mind

Case Study 2: WhatsApp's Erlang Architecture

The Problem

WhatsApp needed to handle:

  • Billions of messages per day
  • Real-time delivery
  • Low latency (< 100ms)
  • High availability (99.99%)
  • Small team (50 engineers)

The Solution: Erlang/OTP

Architecture:

  • Erlang/OTP for backend
  • Single server per user (process per user)
  • Message queue per user
  • Minimal infrastructure (few servers)

Why Erlang?:

  • Lightweight processes: Millions of concurrent processes
  • Fault tolerance: "Let it crash" philosophy
  • Hot code reloading: Update without downtime
  • Built-in message passing: Perfect for messaging systems

Trade-offs:

  • ✅ Handles millions of concurrent connections
  • ✅ Fault tolerant (process crash doesn't affect others)
  • ✅ Low latency (in-memory message passing)
  • ✅ Small team (Erlang is productive)
  • ❌ Smaller talent pool (harder to hire)
  • ❌ Less ecosystem (fewer libraries)
  • ❌ Learning curve (different paradigm)

Key Learnings

  1. Choose technology for the problem: Erlang perfect for messaging (concurrent processes, fault tolerance)
  2. Small team can build at scale: Right technology + right architecture = small team success
  3. Fault tolerance is built-in: Erlang's "let it crash" philosophy enables resilience
  4. Process per user pattern: Simple mental model, scales naturally

Case Study 3: Uber's Dynamic Pricing System

The Problem

Uber needed to:

  • Adjust prices based on demand/supply
  • Update prices in real-time
  • Handle millions of price calculations per second
  • Ensure consistency (same price for same conditions)

The Solution: Event-Driven Architecture

Architecture:

  • Event stream (Kafka) for demand/supply events
  • Pricing service consumes events
  • Real-time price calculation
  • Cache prices (Redis) for fast reads
  • Database for price history

Why Event-Driven?:

  • Real-time updates: Events trigger price recalculation
  • Scalability: Multiple services can consume events
  • Decoupling: Services don't need to know about each other
  • Replayability: Can replay events for debugging

Trade-offs:

  • ✅ Real-time price updates
  • ✅ Scalable (multiple consumers)
  • ✅ Decoupled services
  • ❌ Eventual consistency (prices may be slightly stale)
  • ❌ Complex event processing
  • ❌ Debugging challenges

Key Learnings

  1. Event-driven for real-time: Events enable real-time price updates
  2. Cache for performance: Cache prices for fast reads, recalculate on events
  3. Accept eventual consistency: Prices may be slightly stale, but acceptable for user experience
  4. Design for scale: Event stream handles millions of events per second

Case Study 4: Slack's WebSocket Management

The Problem

Slack needed to:

  • Handle millions of concurrent WebSocket connections
  • Deliver messages in real-time
  • Scale horizontally
  • Handle connection failures gracefully

The Solution: WebSocket Gateway + Message Queue

Architecture:

  • WebSocket gateway (multiple instances)
  • Message queue (RabbitMQ/Kafka) for message routing
  • Presence service (track online/offline users)
  • Database for message persistence

Why This Architecture?:

  • WebSocket gateway: Handles connections, scales horizontally
  • Message queue: Routes messages to correct gateway instance
  • Presence service: Tracks which gateway has which user
  • Database: Persists messages for offline users

Trade-offs:

  • ✅ Horizontal scaling (add gateway instances)
  • ✅ Fault tolerant (gateway failure doesn't affect all users)
  • ✅ Real-time delivery (WebSocket)
  • ❌ Complex routing (need to find correct gateway)
  • ❌ Connection state management
  • ❌ Message queue overhead

Key Learnings

  1. Gateway pattern for WebSockets: Gateway handles connections, scales horizontally
  2. Message queue for routing: Routes messages to correct gateway instance
  3. Presence service for state: Tracks which gateway has which user
  4. Design for connection failures: Handle reconnections, message delivery

Case Study 5: Netflix's Microservices Migration

The Problem

Netflix started as a monolith. As they grew, they faced:

  • Deployment blocks (all teams deploy together)
  • Scaling challenges (can't scale services independently)
  • Technology constraints (stuck with one stack)
  • Team coordination (all teams work on same codebase)

The Solution: Microservices

Architecture:

  • 100+ microservices
  • Each service owns a domain (video, recommendations, billing)
  • Independent deployments
  • Service mesh for communication
  • API gateway for external access

Why Microservices?:

  • Independent scaling: Scale services based on demand
  • Technology freedom: Each service can use different stack
  • Team autonomy: Teams work independently
  • Fault isolation: Service failure doesn't affect others

Trade-offs:

  • ✅ Independent scaling
  • ✅ Technology freedom
  • ✅ Team autonomy
  • ✅ Fault isolation
  • ❌ Increased complexity
  • ❌ Network latency
  • ❌ Distributed system challenges
  • ❌ Operational overhead

The Migration

Phase 1: Strangler Pattern

  • Build new services alongside monolith
  • Gradually migrate functionality
  • Keep monolith running

Phase 2: Service Extraction

  • Extract services one by one
  • Migrate users gradually
  • Monitor and optimize

Phase 3: Full Migration

  • All functionality in microservices
  • Decommission monolith
  • Optimize and scale

Key Learnings

  1. Strangler pattern for migration: Build new alongside old, migrate gradually
  2. Start with high-value services: Extract services that provide most value first
  3. Independent scaling is powerful: Scale services based on demand (video vs billing)
  4. Complexity is real: Microservices add complexity, but provide benefits at scale
  5. Team autonomy matters: Teams can move faster when independent

Thinking Aloud Like a Senior Engineer

Let me walk you through how I'd actually learn from a case study and apply it to my own problem. This is the real-time reasoning that happens when you're trying to extract useful lessons.

Problem: "I'm designing a messaging system that needs to handle 1B messages/day. I read that WhatsApp used Erlang. Should I use Erlang too?"

My first instinct: "Yes! If it worked for WhatsApp, it'll work for me. Erlang it is!"

But wait—that's copying without understanding. Let me think about what WhatsApp actually did and why.

What was WhatsApp's problem?

  • Billions of messages per day
  • Real-time delivery
  • Low latency (< 100ms)
  • High availability (99.99%)
  • Small team (50 engineers)

What did they choose? Erlang/OTP with process-per-user pattern.

Why did they choose Erlang?

  • Lightweight processes (millions of concurrent)
  • Fault tolerance ("let it crash")
  • Built-in message passing
  • Hot code reloading

Now, what's my problem?

  • 1B messages/day (similar scale)
  • Real-time delivery (similar requirement)
  • Low latency (similar requirement)
  • Team: 10 engineers (smaller than WhatsApp, but still small)

My next thought: "Erlang seems like a good fit. Similar requirements, similar team size."

But wait—do I have Erlang expertise? "My team knows Node.js and Python. Learning Erlang would take time."

I'm thinking: "Can I achieve the same with Node.js? Node.js has good concurrency, WebSocket support, and my team knows it."

Actually, let me think about the core pattern: "WhatsApp's key insight wasn't Erlang itself—it was the process-per-user pattern. Each user has a lightweight process that handles their messages."

Can I do that in Node.js? "Yes! I can use a Map to track user connections, handle messages per user, and scale horizontally with multiple Node.js servers."

I'm choosing Node.js: "Because:

  • My team knows it (faster development)
  • Can achieve similar concurrency (Node.js is good at I/O)
  • Similar architecture pattern (process-per-user concept)
  • This is the trade-off: familiar technology over Erlang's specific benefits"

But what about fault tolerance? "Erlang's 'let it crash' is powerful. In Node.js, I need to handle errors explicitly."

I'm thinking: "I can add error handling, retries, and monitoring. It's more work, but acceptable for my team size."

Now, what about the architecture? "WhatsApp used a simple architecture: process per user, message queue per user, minimal infrastructure."

Can I do that? "Yes! I can use:

  • WebSocket server (one connection per user)
  • Redis for message queues (one queue per user concept)
  • Database for persistence
  • Multiple servers for scale"

This is how I learn from case studies: I don't copy the solution. I understand the problem, the solution, the trade-offs, and then adapt it to my context.

What if my problem is different? "Let's say I need to handle group chats with 10K members. WhatsApp's process-per-user doesn't work as well for groups."

I'm thinking: "For groups, I'd need a different pattern. Maybe a group service that manages group state, and fan-out to members when messages are sent."

This is the key insight: Case studies teach you patterns and trade-offs, not solutions. You adapt them to your context.

Notice how I didn't just say "use Erlang because WhatsApp did." I understood why they chose it, what trade-offs they made, and then adapted the pattern to my context.


How a Senior Engineer Learns from Case Studies

A senior engineer:

  1. Identifies the problem: What problem were they solving?
  2. Understands the solution: What architecture did they choose?
  3. Analyzes trade-offs: What were the pros and cons?
  4. Learns from failures: What went wrong? How did they fix it?
  5. Applies learnings: How can I apply this to my problems?

Example: Learning from Instagram

Problem: Single database can't scale Solution: Sharded MySQL Trade-offs: Complexity vs scalability Learning: Design for sharding from day 1, even if you don't shard initially Application: When designing my system, I'll consider sharding strategy even for MVP


Best Practices

  1. Study real systems: Learn from companies that solved similar problems
  2. Understand the context: Why did they make these choices?
  3. Analyze trade-offs: What were the pros and cons?
  4. Learn from failures: What went wrong? How did they fix it?
  5. Apply selectively: Not every pattern applies to every problem
  6. Think critically: Question decisions, understand alternatives

Common Interview Questions

Beginner

Q: Why did Instagram choose sharded MySQL over a single database?

A: Instagram started with a single PostgreSQL database but hit scaling limits. They chose sharded MySQL because:

  • Better tooling for sharding
  • More mature replication
  • Better performance at scale
  • Can scale horizontally by adding shards

The trade-off was increased complexity (sharding logic, cross-shard queries) for better scalability.


Intermediate

Q: How did WhatsApp handle billions of messages with a small team?

A: WhatsApp used Erlang/OTP architecture:

  • Lightweight processes (millions of concurrent processes)
  • Process per user pattern (simple mental model)
  • Fault tolerance built-in ("let it crash")
  • Hot code reloading (update without downtime)

This allowed a small team (50 engineers) to handle billions of messages because Erlang is designed for concurrent, fault-tolerant systems.


Senior

Q: How would you apply Netflix's microservices migration strategy to a monolith you're working on?

A: I would use the strangler pattern:

  1. Identify high-value services: Extract services that provide most value first (e.g., payment, user management)
  2. Build alongside monolith: Don't rewrite, build new services alongside
  3. Migrate gradually: Move users to new services incrementally
  4. Monitor and optimize: Measure performance, optimize based on data
  5. Decommission monolith: Once all functionality migrated, decommission

Key considerations:

  • Start with services that have clear boundaries
  • Ensure backward compatibility during migration
  • Have rollback plan for each service
  • Monitor metrics (latency, error rate, throughput)

Summary

Case studies teach you how real companies applied design thinking:

  • Instagram: Sharded MySQL for horizontal scaling
  • WhatsApp: Erlang/OTP for concurrent, fault-tolerant messaging
  • Uber: Event-driven architecture for real-time pricing
  • Slack: WebSocket gateway + message queue for real-time delivery
  • Netflix: Microservices migration using strangler pattern

Key takeaways:

  • Study real systems to learn from their choices
  • Understand the context and trade-offs
  • Learn from failures and how they were fixed
  • Apply learnings selectively to your problems
  • Think critically about decisions

FAQs

Q: Should I memorize these case studies for interviews?

A: Not exactly. Understand the principles (why they made these choices, what trade-offs they accepted), not just the facts. Interviewers care more about your reasoning than your memory.

Q: How do I find more case studies?

A:

  • Engineering blogs (Instagram, Uber, Netflix, etc.)
  • Conference talks (QCon, AWS re:Invent, etc.)
  • Books (Designing Data-Intensive Applications, etc.)
  • Podcasts (Software Engineering Daily, etc.)

Q: Can I use case studies in my designs?

A: Yes, but adapt them. Don't copy blindly. Understand why they worked, then adapt to your context. What worked for Instagram may not work for you.

Q: How do I know if a case study is relevant to my problem?

A: Look for:

  • Similar scale (users, requests, data)
  • Similar requirements (latency, availability, consistency)
  • Similar constraints (team size, budget, timeline)
  • Similar domain (messaging, social media, e-commerce)

Q: What if I don't know any case studies?

A: That's okay. Focus on understanding principles (trade-offs, patterns, architectures) rather than memorizing specific companies. Principles are more valuable than facts.

Q: How do I apply case studies to interview problems?

A:

  • Reference similar systems: "This is similar to how Instagram handled..."
  • Explain trade-offs: "Like WhatsApp, we're choosing X because..."
  • Learn from failures: "Netflix learned that Y, so we'll avoid..."
  • Adapt to context: "For our problem, we'll modify this because..."

Q: Are case studies only for large companies?

A: No. Case studies from large companies are useful because they've solved scale problems, but principles apply to smaller systems too. Start simple, scale when needed—that's the lesson.

Keep exploring

Design thinking works best when combined with practice. Explore more topics or apply what you've learned in our system design practice platform.