I recently dove into Python threading and concurrency, and honestly, it was confusing at first. The GIL? Locks? Race conditions? It felt like learning a new language. But after banging my head against it for a while, things started to click. Here's what I learned.

The GIL: Why Python Threading Is... Weird

Before writing any code, you need to understand the Global Interpreter Lock (GIL). This confused me for weeks.

The GIL is a mutex that protects access to Python objects. Only one thread can execute Python bytecode at a time, even on a multi-core machine.

# Even with 4 threads, only ONE runs Python code at a time
# The GIL switches between them rapidly

This sounds terrible, right? Why even have threading? Here's the key insight that finally made it click for me:

The GIL is released during I/O operations. When your thread is waiting for a network response or file read, it releases the GIL and another thread can run.

# Thread 1: makes HTTP request, releases GIL while waiting
# Thread 2: can now run! Makes its own request
# Thread 3: also waiting on I/O, GIL released
# All three can "wait" simultaneously

This is why threading works great for I/O-bound tasks but not CPU-bound ones. We'll come back to this.

Threading Module Basics

Let's start simple. The threading module is Python's high-level threading interface.

import threading
import time
 
def worker(name):
    print(f"{name} starting")
    time.sleep(2)  # Simulates I/O - GIL is released!
    print(f"{name} finished")
 
# Create threads
t1 = threading.Thread(target=worker, args=("Thread-1",))
t2 = threading.Thread(target=worker, args=("Thread-2",))
 
# Start them
t1.start()
t2.start()
 
# Wait for completion
t1.join()
t2.join()
 
print("All done!")

Output:

Thread-1 starting
Thread-2 starting
Thread-1 finished  # Both finish around the same time!
Thread-2 finished
All done!

Both threads sleep concurrently, so this takes ~2 seconds total, not 4.

The Thread Class

You can subclass Thread for more complex scenarios:

import threading
import time
 
class DownloadThread(threading.Thread):
    def __init__(self, url):
        super().__init__()
        self.url = url
        self.result = None
    
    def run(self):
        # This method is called when you call .start()
        print(f"Downloading {self.url}")
        time.sleep(1)  # Simulate download
        self.result = f"Data from {self.url}"
 
# Usage
threads = [
    DownloadThread("https://api.example.com/users"),
    DownloadThread("https://api.example.com/posts"),
    DownloadThread("https://api.example.com/comments"),
]
 
for t in threads:
    t.start()
 
for t in threads:
    t.join()
 
for t in threads:
    print(t.result)

I prefer the function-based approach for simple cases, but subclassing is nice when threads need to return values or maintain state.

Daemon Threads

Daemon threads are background threads that die when your main program exits:

import threading
import time
 
def background_task():
    while True:
        print("Background working...")
        time.sleep(1)
 
# Regular thread - program waits for it
t = threading.Thread(target=background_task)
t.daemon = True  # Make it a daemon
t.start()
 
time.sleep(3)
print("Main thread exiting")
# Daemon thread is killed automatically

Use daemons for cleanup tasks, monitoring, or anything that shouldn't keep your program alive.

Locks and Synchronization

Here's where I made my first big mistake. I had multiple threads updating a shared counter:

import threading
 
counter = 0
 
def increment():
    global counter
    for _ in range(100000):
        counter += 1  # NOT thread-safe!
 
threads = [threading.Thread(target=increment) for _ in range(5)]
for t in threads:
    t.start()
for t in threads:
    t.join()
 
print(counter)  # Expected: 500000, Actual: something less!

The problem? counter += 1 isn't atomic. It's actually:

  1. Read counter
  2. Add 1
  3. Write counter

If two threads read the same value before either writes, you lose an increment. This is a race condition.

Using Locks

The fix is a Lock:

import threading
 
counter = 0
lock = threading.Lock()
 
def increment():
    global counter
    for _ in range(100000):
        with lock:  # Only one thread can hold this at a time
            counter += 1
 
threads = [threading.Thread(target=increment) for _ in range(5)]
for t in threads:
    t.start()
for t in threads:
    t.join()
 
print(counter)  # Always 500000

The with lock: syntax acquires the lock, runs your code, then releases it. Even if an exception occurs, the lock is released.

RLock (Reentrant Lock)

A regular Lock can't be acquired twice by the same thread - it'll deadlock. RLock can:

import threading
 
lock = threading.RLock()
 
def outer():
    with lock:
        print("In outer")
        inner()  # This acquires the same lock
 
def inner():
    with lock:  # With Lock, this would deadlock!
        print("In inner")
 
outer()

I use RLock when I have nested function calls that all need the lock.

Other Synchronization Primitives

import threading
 
# Semaphore - allows N threads at once
semaphore = threading.Semaphore(3)
 
def limited_worker():
    with semaphore:
        # Only 3 threads run this at a time
        do_work()
 
# Event - simple flag for signaling
event = threading.Event()
 
def waiter():
    print("Waiting...")
    event.wait()  # Blocks until event is set
    print("Done waiting!")
 
def setter():
    time.sleep(2)
    event.set()  # Unblocks all waiters
 
# Condition - more complex waiting
condition = threading.Condition()
 
def consumer():
    with condition:
        condition.wait()  # Wait for notification
        process_item()
 
def producer():
    with condition:
        add_item()
        condition.notify()  # Wake up one waiter

Thread Pools with concurrent.futures

Creating individual threads works, but managing them is tedious. ThreadPoolExecutor is much cleaner:

from concurrent.futures import ThreadPoolExecutor
import time
 
def download(url):
    print(f"Downloading {url}")
    time.sleep(1)
    return f"Data from {url}"
 
urls = [
    "https://example.com/1",
    "https://example.com/2",
    "https://example.com/3",
    "https://example.com/4",
]
 
# Use a pool of 3 workers
with ThreadPoolExecutor(max_workers=3) as executor:
    results = executor.map(download, urls)
    
for result in results:
    print(result)

The pool reuses threads, handles the join logic, and provides a clean interface.

Submitting Individual Tasks

For more control, use submit():

from concurrent.futures import ThreadPoolExecutor, as_completed
import time
 
def slow_task(n):
    time.sleep(n)
    return f"Task {n} done"
 
with ThreadPoolExecutor(max_workers=3) as executor:
    # Submit tasks
    futures = {executor.submit(slow_task, i): i for i in [3, 1, 2]}
    
    # Process as they complete (not in submission order!)
    for future in as_completed(futures):
        task_id = futures[future]
        result = future.result()
        print(f"Task {task_id}: {result}")

Output:

Task 1: Task 1 done
Task 2: Task 2 done
Task 3: Task 3 done

as_completed() yields futures as they finish, which is great for handling results as soon as they're ready.

Error Handling

from concurrent.futures import ThreadPoolExecutor
 
def might_fail(n):
    if n == 2:
        raise ValueError("I don't like 2")
    return n * 2
 
with ThreadPoolExecutor(max_workers=3) as executor:
    futures = [executor.submit(might_fail, i) for i in range(5)]
    
    for i, future in enumerate(futures):
        try:
            result = future.result()
            print(f"Task {i}: {result}")
        except Exception as e:
            print(f"Task {i} failed: {e}")

When Threading Helps (and When It Doesn't)

This is the most important section. I wasted hours trying to speed up CPU-bound code with threads.

I/O-Bound: Threading Wins

I/O-bound means your code spends most of its time waiting for external things:

  • Network requests (APIs, databases)
  • File reads/writes
  • User input
from concurrent.futures import ThreadPoolExecutor
import requests
import time
 
def fetch(url):
    response = requests.get(url)
    return len(response.content)
 
urls = ["https://example.com"] * 10
 
# Sequential: slow
start = time.time()
results = [fetch(url) for url in urls]
print(f"Sequential: {time.time() - start:.2f}s")
 
# Threaded: fast!
start = time.time()
with ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(fetch, urls))
print(f"Threaded: {time.time() - start:.2f}s")
Sequential: 5.23s
Threaded: 0.68s

When one thread waits for the network, others run. Big win.

CPU-Bound: Threading Fails

CPU-bound means your code does heavy computation:

  • Number crunching
  • Image processing
  • Data parsing
from concurrent.futures import ThreadPoolExecutor
import time
 
def cpu_intensive(n):
    # Simulates heavy computation
    total = 0
    for i in range(n):
        total += i * i
    return total
 
# Sequential
start = time.time()
results = [cpu_intensive(10_000_000) for _ in range(4)]
print(f"Sequential: {time.time() - start:.2f}s")
 
# Threaded - NOT faster due to GIL!
start = time.time()
with ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(cpu_intensive, [10_000_000] * 4))
print(f"Threaded: {time.time() - start:.2f}s")
Sequential: 3.42s
Threaded: 3.51s  # Same or worse!

The GIL means only one thread runs Python code at a time. For CPU-bound work, use multiprocessing or ProcessPoolExecutor instead:

from concurrent.futures import ProcessPoolExecutor
 
# This DOES parallelize CPU work
with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(cpu_intensive, [10_000_000] * 4))

Common Pitfalls

Here are mistakes I made so you don't have to:

1. Forgetting to Join

# Bad - program might exit before thread finishes
t = threading.Thread(target=long_task)
t.start()
# No join!
 
# Good
t = threading.Thread(target=long_task)
t.start()
t.join()  # Wait for completion

2. Sharing Mutable State Without Locks

# Bad - race condition
shared_list = []
 
def append_items():
    for i in range(100):
        shared_list.append(i)  # Can cause issues!
 
# Good - use a lock
lock = threading.Lock()
shared_list = []
 
def append_items():
    for i in range(100):
        with lock:
            shared_list.append(i)
 
# Better - use thread-safe structures
from queue import Queue
shared_queue = Queue()
 
def append_items():
    for i in range(100):
        shared_queue.put(i)  # Thread-safe!

3. Deadlocks

# Deadlock! Thread 1 has lock_a, waits for lock_b
#           Thread 2 has lock_b, waits for lock_a
lock_a = threading.Lock()
lock_b = threading.Lock()
 
def thread1():
    with lock_a:
        time.sleep(0.1)
        with lock_b:  # Waits forever
            pass
 
def thread2():
    with lock_b:
        time.sleep(0.1)
        with lock_a:  # Waits forever
            pass
 
# Fix: Always acquire locks in the same order
def thread1_fixed():
    with lock_a:
        with lock_b:
            pass
 
def thread2_fixed():
    with lock_a:  # Same order!
        with lock_b:
            pass

4. Using Threading for CPU-Bound Work

See the section above. Use multiprocessing for CPU work.

5. Too Many Threads

# Bad - thousands of threads = overhead
with ThreadPoolExecutor(max_workers=1000) as executor:
    results = executor.map(fetch, urls)
 
# Good - reasonable pool size
# Rule of thumb: 2-4x CPU cores for I/O-bound
import os
workers = min(32, os.cpu_count() * 4)
with ThreadPoolExecutor(max_workers=workers) as executor:
    results = executor.map(fetch, urls)

6. Not Handling Exceptions in Threads

# Bad - exception in thread silently disappears
def worker():
    raise ValueError("Oops!")
 
t = threading.Thread(target=worker)
t.start()
t.join()
# No error shown!
 
# Good - catch exceptions in the thread
def worker():
    try:
        raise ValueError("Oops!")
    except Exception as e:
        print(f"Error in thread: {e}")
 
# Or use ThreadPoolExecutor and check results
with ThreadPoolExecutor() as executor:
    future = executor.submit(worker)
    try:
        future.result()  # Raises the exception
    except ValueError as e:
        print(f"Caught: {e}")

Quick Reference

import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
 
# Simple thread
t = threading.Thread(target=func, args=(arg1, arg2))
t.start()
t.join()
 
# Thread pool
with ThreadPoolExecutor(max_workers=4) as executor:
    results = executor.map(func, items)
 
# Lock for shared state
lock = threading.Lock()
with lock:
    modify_shared_state()
 
# Thread-safe queue
from queue import Queue
q = Queue()
q.put(item)
item = q.get()

Final Thoughts

Threading in Python isn't as scary as it seems once you understand:

  1. The GIL makes threading great for I/O, useless for CPU
  2. Use locks when multiple threads modify shared state
  3. ThreadPoolExecutor is cleaner than managing threads manually
  4. Use multiprocessing for CPU-bound parallelism

Start with ThreadPoolExecutor for most cases. It handles the hard parts and keeps your code clean. Only reach for lower-level primitives when you need more control.

React to this post: