Describe a time you drove a critical technical decision at the system architecture level
py-sys-004
Your answer
Answer as you would in a real interview — explain your thinking, not just the conclusion.
Model answer
STAR structure. Situation: a Python ML serving platform was receiving 200k inference requests/day. The existing approach — spinning a new FastAPI process per request in Docker — had 2–3 second cold starts destroying p99 SLA. Task: own the platform redesign to achieve sub-100ms p99 warm path and sub-500ms cold path. Action: (1) Evaluated three options: keep-alive FastAPI containers (high memory), TorchServe (good for PyTorch, poor multi-framework), Ray Serve (multi-framework, batching, autoscaling). (2) Built a 2-day proof of concept for Ray Serve. (3) Presented results to VP Engineering with a migration risk matrix — blast radius if Ray Serve failed, rollback time, migration effort. (4) Proposed a blue-green deployment: run new and old platform in parallel for 30 days with shadow traffic. Result: p99 cold start dropped to 400ms, warm path to 12ms. Managed 3 engineers through the migration. Lesson: when making a platform bet, build a PoC with real traffic before committing — a week of engineering time avoided months of potential regret.
Follow-up
How do you handle the organisational politics of a platform migration when teams are comfortable with the status quo?