Design a real-time data pipeline in Python capable of processing 1 million events per day
py-sys-001
Your answer
Answer as you would in a real interview — explain your thinking, not just the conclusion.
Model answer
I would use Apache Kafka as the event bus — high-throughput, durable, replayable — with Python consumer groups via confluent-kafka. Each consumer group represents a pipeline stage: ingest, enrich, and store. The ingest layer is a FastAPI service that validates events and produces them to Kafka synchronously. Enrichment workers use asyncio to fan out lookups (geo-IP, user profiles) without blocking. Processed events land in ClickHouse for analytics and PostgreSQL for transactional records. At 1M events/day (~12 events/s average, much higher in bursts), a single Kafka partition with 3 replicated consumer instances is more than sufficient — Kafka handles millions per second. Monitoring via Prometheus + Grafana watches consumer lag; alerting fires when lag grows beyond a threshold.
Follow-up
How do you handle schema evolution — specifically adding a mandatory field while old producers are still running?