Tell me about a bug that was hard to find and how you debugged it
py-jun-006
Your answer
Answer as you would in a real interview — explain your thinking, not just the conclusion.
Model answer
STAR structure: Situation — a background Celery task was silently dropping 2-3% of webhook deliveries with no exception in the logs. Task — find the root cause and fix it without affecting the task's 99.9% SLA. Action — added structured logging (structlog) with a unique delivery_id at each step. Replayed failed deliveries in a local environment with CELERY_ALWAYS_EAGER=True. Found the issue in the retry decorator: the exception being caught was the base Exception, which masked a requests.Timeout. The timeout was being retried immediately (no backoff), hitting the server hard during a spike, causing it to time out again — a self-reinforcing loop. Fixed by: (1) catching specific exceptions (requests.RequestException), (2) adding exponential backoff with jitter (2^attempt * 0.5 + random.uniform(0, 0.5) seconds). Result: delivery rate rose to 99.97%, no more silent drops. Lesson: always log unique identifiers at every step in async pipelines, and never catch the base Exception class.
Follow-up
What is the difference between catching Exception and BaseException in Python, and when does it matter?