Rate Limiter System Design (Token Bucket Explained)

Visual Problem Diagram

Rate Limiter System Design (Token Bucket Explained) architecture diagram

Scenario

An API gateway must enforce per-client limits (by user, IP, API key, or route) to protect backends and keep usage fair. The interesting part is doing it at millions of checks per second with sub-millisecond overhead, correctly across many gateway instances, and with honest behavior when the limiter or its store blips—fail open vs closed is a product and security decision you should state, not hide.

Design a rate limiter service that restricts the number of requests a client can make within a given time period. This is essential for preventing abuse and ensuring fair resource usage.

In production, rate limiters sit in front of APIs (often at the gateway) and decide per request whether to allow or reject based on the client’s usage so far. You need to support many clients, many gateway nodes, and strict latency and availability requirements. The system should support configurable limits per client (e.g. by user ID, IP, or API key), per endpoint, and per tier (e.g. free vs premium). You should be able to explain how you would implement at least one algorithm (e.g. fixed window, sliding window, or token bucket), how state is shared across nodes, and what happens when the rate limiter or its store fails.

Constraints

Functional

Limit requests per client per window, multiple strategies (fixed/sliding window, token bucket), per-user limits, whitelisting, rate limit headers in response

Non-functional

Low latency (under 1 ms overhead), millions of requests/second, precise limits, works across multiple servers

Scale

10M requests/s, 100M unique clients, ~100 bytes per client (~10 GB total), 1-minute window

Stages ahead

1Requirement Analysis
2API Design
3High-Level Design
4HLD Extensions
5Trade-offs