Logging System Design (Ingestion, Storage & Query at Scale)
Scenario
Services emit structured logs at massive scale; operators need search, dashboards, and retention with compliance for audit trails without paying petabyte prices for every debug printf. The interview is ingestion pipelines, tiered storage, and cost control—not “we put logs in Elasticsearch” without cardinality discipline.
Design a centralized logging platform that collects logs from thousands of services, stores them durably, and supports interactive search and alerting.
Constraints
Ingest structured logs; tag with service, env, trace/span ids; query by time range and filters; saved searches and alerts (high level)
Handle burst traffic; configurable retention per tenant/tier; durability for compliance logs
Millions of events per second aggregate; PB stored; global services