Job Scheduler System Design (Queues, Retries & Reliability)
Scenario
Teams need cron-like scheduling, one-off delayed jobs, and dependency graphs at huge scale: nothing should run twice as observed by the business without you saying why, and nothing should silently never run when a worker dies. The core tension is time (what fires next), durability (never lose a job), and concurrency (many schedulers and workers without double fire).
Design a distributed job scheduler that can schedule and execute jobs at specific times or intervals. The system should support one-time jobs, recurring jobs, job dependencies, and handle millions of jobs reliably.
Constraints
Schedule at time or interval, one-time and recurring (cron-like), job dependencies, priorities, retry with configurable policy, status (pending/running/completed/failed), cancel, execution history
No lost or duplicate execution, millions of jobs, execute within seconds of schedule, 99.9% uptime, durable job data
100M scheduled jobs, 10M executions/day, peak ~500/s; ~1 KB per job (~100 GB), ~5 GB/day history (~1.8 TB/year), 1000 workers