Fixing a Server Crash in the MCP Python SDK

Last week I got a PR merged into the Model Context Protocol Python SDK. It wasn't a feature—it was a crash fix. The boring kind of contribution that doesn't get celebrated much, but keeps real systems running.

Here's what happened, what I learned, and why handling transport failures is trickier than it looks.

The Bug

Issue #2328 described a crash: when an MCP server's transport closed unexpectedly while handlers were still processing requests, the server would crash with a ClosedResourceError.

This happens more often than you'd think:

Client disconnects mid-request (network hiccup, user closes app)
stdin EOF during a long-running tool call
Malicious client sends invalid UTF-8 bytes that crashes the transport

The server shouldn't die because a client misbehaved. It should log, clean up, and keep running.

The Fix

The PR made four changes:

1. Cancel in-flight handlers when transport closes

When the message loop detects transport closure, it now explicitly cancels all handlers that are still running. Previously, handlers would finish execution and then try to send a response through an already-closed stream—hence the crash.

# Wrap message loop in try/finally
try:
    async for message in read_stream:
        # ... handle message
finally:
    # Cancel any handlers still in flight
    for task in _in_flight_handlers:
        task.cancel()

2. Catch ClosedResourceError on respond

Even with cancellation, there's a race condition: a handler might try to respond in the tiny window between transport closure and cancellation arriving. So we also catch ClosedResourceError at the response boundary and handle it gracefully.

3. Distinguish cancellation types

This one was subtle. When a handler gets cancelled, we need to know why:

Client-initiated cancellation: The client sent a cancel request. We've already sent a response. Don't re-raise.
Transport-close cancellation: The transport died. Re-raise so cleanup can happen properly.

Getting this wrong means either swallowing important exceptions or double-responding.

4. Fix dict iteration during cleanup

Classic Python gotcha: iterating over a dictionary while another coroutine modifies it raises RuntimeError: dictionary changed size during iteration. The fix is simple—snapshot with list():

for stream_id in list(_response_streams.keys()):
    # Safe to modify _response_streams in here now

What I Learned

Transport failure is a first-class event. It's not an edge case you can ignore. Any networked service needs explicit handling for "the other side disappeared." The question isn't if it will happen, but when.

Cancellation semantics matter. Python's asyncio.CancelledError is a blunt instrument. When you catch it, you often need context about why the cancellation happened to respond correctly. The MCP SDK tracks this with explicit flags.

Tests prove the fix. I added two test cases that reproduce the exact crash scenarios. Without them, this would be a "trust me" PR. With them, any future regression gets caught immediately.

The Takeaway

Open-source contributions don't have to be features. Crash fixes, test coverage, documentation—this is the maintenance work that keeps projects usable. It's not glamorous, but someone clicking through their MCP-powered app won't crash because of a network blip. That's real value.

The PR took a few hours to write and a few days to get reviewed and merged. Not a huge investment, but now it's in the SDK forever, helping everyone who uses it.

If you're looking to contribute to open source, start with the bug tracker. Find a crash. Fix it. Ship it.

React to this post:

#The Bug

#The Fix

#1. Cancel in-flight handlers when transport closes

#2. Catch ClosedResourceError on respond

#3. Distinguish cancellation types

#4. Fix dict iteration during cleanup

#What I Learned

#The Takeaway

Keep Reading

Need help shipping fast?