Production is broken. Alerts are firing. Someone important is asking what's going on.
This is not the time to panic. This is the time to be methodical.
Here's the approach I use when debugging production issues — the same process whether it's a 500 error at 2 AM or a subtle data corruption discovered during an audit.
The first 5 minutes
Before you touch anything, gather context.
What changed? Check recent deployments, config changes, infrastructure updates. Most production issues correlate with recent changes. git log --oneline -10 and your deployment history are your friends.
What's the blast radius? Is this affecting all users, specific regions, certain account types? The scope tells you where to look and how urgent this really is.
What do the metrics say? CPU, memory, error rates, latency. Don't guess — look at the data. A CPU spike tells a different story than a memory leak.
The debugging loop
Once you have context, enter the loop:
1. Form hypothesis
2. Gather evidence
3. Confirm or reject
4. Repeat until root cause
This sounds obvious, but under pressure, people skip steps. They jump to solutions before understanding the problem. Don't be that person.
Form hypothesis
Based on your context, what could be causing this? Be specific. "The database is slow" is not a hypothesis. "The new query on the orders endpoint is doing a full table scan" is a hypothesis.
Good hypotheses are:
- Falsifiable — you can prove them wrong
- Specific — they point to a particular component
- Connected to evidence — based on what you observed, not random guessing
Gather evidence
Now prove or disprove your hypothesis. The tools depend on your stack, but the pattern is universal:
Logs — grep for error patterns, trace request IDs, look for the timestamp when things broke
grep -i error /var/log/app.log | tail -100Metrics — correlate with the incident time window. What spiked? What flatlined?
Traces — follow a failing request through your system. Where does it slow down or error?
The database — running queries, connection pools, explain plans
SELECT * FROM pg_stat_activity WHERE state != 'idle';
EXPLAIN ANALYZE SELECT ...;Confirm or reject
This is where discipline matters. If your hypothesis was wrong, don't cling to it. Form a new one based on what you learned.
If your hypothesis was right, dig deeper. "The query is slow" isn't enough. Why is it slow? Missing index? Lock contention? Bad plan because of stale statistics?
Common root causes
After debugging enough incidents, patterns emerge. Here's what I see most often:
1. The deploy that should have been fine
Something changed. It might be code, config, or dependencies. Rollback first, investigate second. You can always redeploy once you understand the issue.
2. Resource exhaustion
Connection pools, file descriptors, memory limits. Systems have boundaries, and hitting them causes cascading failures. Check your limits, check your current usage.
3. External dependencies
Third-party APIs, DNS, CDNs. Your code might be perfect, but if Stripe is having a bad day, so are you. Check status pages, check your outbound requests.
4. Data issues
Bad data in, bad behavior out. Null values where you expected objects. Missing records. Schema drift between services. Validate your assumptions about data.
5. The thundering herd
Cache expires, everyone hits the database at once, the database falls over. Cron jobs that all fire at midnight. Retry storms after an outage. Coordination failures.
Tools I reach for
General:
htop/top— what's using resourcesnetstat/ss— network connectionsstrace— what syscalls is this process makinglsof— what files/sockets are open
Logs:
journalctl— systemd logs with filteringgrep,awk,jq— parsing log output- Your log aggregator (Datadog, CloudWatch, etc.)
Database (Postgres):
pg_stat_activity— current queriespg_stat_statements— query performance historyEXPLAIN ANALYZE— understand query plans
Application:
- Request tracing (your APM tool)
- Feature flags (check what's enabled)
- Error tracking (Sentry, Bugsnag, etc.)
What to write down
While debugging, keep a running log:
16:05 - Alert fired: 500 errors spike on /api/orders
16:06 - Checked recent deploys: v2.3.1 deployed at 16:00
16:08 - Error logs show: "connection refused to redis"
16:10 - Redis status: OOM killer triggered at 16:02
16:12 - Root cause: Redis memory limit too low for new caching feature
16:15 - Fix deployed: increased Redis memory, restarted
16:20 - Error rate back to normal, monitoring
This log is your incident report draft. It's also invaluable when someone asks "what happened?" while you're still fixing things.
After the fire is out
Production is stable. Now do the actual work:
Write the postmortem. What happened, why, and how to prevent it. Be blameless and specific. "Human error" is not a root cause. "Lack of validation on the memory config field allowed deployment of an invalid value" is a root cause.
Fix the systemic issue. If you can push a bad config to production, the config system needs guardrails. If the database can be overwhelmed by a single query, you need query timeouts. The incident revealed a gap — close it.
Update your runbooks. The next person who sees this alert should have better starting context than you did.
The mindset
Debugging production is stressful, but it's also where you level up fastest. Each incident teaches you something about your system that no amount of code review would reveal.
Stay calm. Be methodical. Take notes. Fix it for real, not just for now.
And when it's over, get some sleep. You earned it.
Need help debugging a production issue or building systems that are easier to debug? Let's talk.