A Junior Engineer's Guide to Python's difflib Module

I recently needed to compare two configuration files and show the user what changed. My first instinct was to write a manual loop comparing lines—that would've been painful. Then I discovered difflib. This module does all the heavy lifting for sequence comparison, and it's been in the standard library since Python 2.1.

Here's everything I've learned about it.

The Core: SequenceMatcher

SequenceMatcher is the foundation of difflib. It compares any two sequences (strings, lists, whatever) and figures out how similar they are and what's different.

from difflib import SequenceMatcher
 
text_a = "the quick brown fox"
text_b = "the quack brown box"
 
matcher = SequenceMatcher(None, text_a, text_b)

The first argument is a "junk" function—pass None for now. We'll cover that later.

Similarity with ratio()

The most common thing you'll want: how similar are these two strings?

print(matcher.ratio())  # 0.8947...

A ratio of 1.0 means identical. 0.0 means completely different. This returns the number of matching characters divided by the total characters in both sequences.

Faster Alternatives: quick_ratio() and real_quick_ratio()

Computing ratio() can be slow for very long sequences. If you just need a ballpark estimate:

# Upper bound (may overestimate)
print(matcher.quick_ratio())      # 0.8947...
 
# Even faster upper bound
print(matcher.real_quick_ratio()) # 1.0

When would you use these? Say you're fuzzy-searching through thousands of strings. Use real_quick_ratio() first to eliminate obvious non-matches, then quick_ratio() to narrow further, then ratio() for your final candidates.

def fuzzy_search(query: str, candidates: list[str], cutoff: float = 0.6) -> list[str]:
    """Efficient fuzzy search using tiered ratio checks."""
    results = []
    
    for candidate in candidates:
        matcher = SequenceMatcher(None, query.lower(), candidate.lower())
        
        # Quick reject if upper bound is below cutoff
        if matcher.real_quick_ratio() < cutoff:
            continue
        if matcher.quick_ratio() < cutoff:
            continue
        if matcher.ratio() >= cutoff:
            results.append((candidate, matcher.ratio()))
    
    return sorted(results, key=lambda x: x[1], reverse=True)

Understanding What Changed: get_matching_blocks()

This shows you exactly which parts match between two sequences:

s1 = "abXcd"
s2 = "abYcd"
 
matcher = SequenceMatcher(None, s1, s2)
 
for block in matcher.get_matching_blocks():
    print(f"Match: s1[{block.a}:{block.a + block.size}] = '{s1[block.a:block.a + block.size]}'")

Output:

Match: s1[0:2] = 'ab'
Match: s1[3:5] = 'cd'
Match: s1[5:5] = ''

The last block with size 0 is always a "sentinel" marking the end.

The Edit Script: get_opcodes()

This is powerful. It tells you exactly what operations transform sequence A into sequence B:

s1 = "hello world"
s2 = "hello there"
 
matcher = SequenceMatcher(None, s1, s2)
 
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
    print(f"{tag:8} s1[{i1}:{i2}] '{s1[i1:i2]}' -> s2[{j1}:{j2}] '{s2[j1:j2]}'")

Output:

equal    s1[0:6] 'hello ' -> s2[0:6] 'hello '
replace  s1[6:11] 'world' -> s2[6:11] 'there'

The operation tags:

equal: No change needed
replace: Replace with different content
insert: Add new content (s1 range will be empty)
delete: Remove content (s2 range will be empty)

This is how you'd highlight changes in a UI:

def highlight_changes(old: str, new: str) -> str:
    """Return HTML with changes highlighted."""
    matcher = SequenceMatcher(None, old, new)
    result = []
    
    for tag, i1, i2, j1, j2 in matcher.get_opcodes():
        if tag == 'equal':
            result.append(old[i1:i2])
        elif tag == 'delete':
            result.append(f'<del>{old[i1:i2]}</del>')
        elif tag == 'insert':
            result.append(f'<ins>{new[j1:j2]}</ins>')
        elif tag == 'replace':
            result.append(f'<del>{old[i1:i2]}</del><ins>{new[j1:j2]}</ins>')
    
    return ''.join(result)
 
print(highlight_changes("hello world", "hello there"))
# hello <del>world</del><ins>there</ins>

Differ: Line-by-Line Text Comparison

For comparing multiline text, Differ produces human-readable output:

from difflib import Differ
 
text1 = ["one\n", "two\n", "three\n"]
text2 = ["one\n", "TWO\n", "three\n", "four\n"]
 
differ = Differ()
diff = differ.compare(text1, text2)
print(''.join(diff))

Output:

  one
- two
?  ^
+ TWO
?  ^
  three
+ four

The ? lines are helpful—they point to exactly which characters changed. That little ^ shows you "two" became "TWO" with different case at that position.

Understanding Differ Markers

'  '  (two spaces) - line exists in both
'- '               - line only in first sequence
'+ '               - line only in second sequence
'? '               - hints about character-level changes

unified_diff: Git-Style Diffs

This is what you want for patches and code reviews:

from difflib import unified_diff
 
old_code = """def greet(name):
    print("Hello " + name)
    return True
""".splitlines(keepends=True)
 
new_code = """def greet(name):
    print(f"Hello {name}!")
    return True
""".splitlines(keepends=True)
 
diff = unified_diff(
    old_code, new_code,
    fromfile='greet.py.old',
    tofile='greet.py'
)
 
print(''.join(diff))

Output:

--- greet.py.old
+++ greet.py
@@ -1,3 +1,3 @@
 def greet(name):
-    print("Hello " + name)
+    print(f"Hello {name}!")
     return True

The @@ -1,3 +1,3 @@ header means: starting at line 1, showing 3 lines from the old file and 3 lines from the new file.

context_diff: More Context

If you want more surrounding context (like the old diff -c command):

from difflib import context_diff
 
diff = context_diff(
    old_code, new_code,
    fromfile='greet.py.old',
    tofile='greet.py',
    n=3  # lines of context
)
 
print(''.join(diff))

Output:

*** greet.py.old
--- greet.py
***************
*** 1,3 ****
  def greet(name):
!     print("Hello " + name)
      return True
--- 1,3 ----
  def greet(name):
!     print(f"Hello {name}!")
      return True

The ! marks changed lines. I find unified diffs easier to read, but context diffs are still used in some tools.

HtmlDiff: Visual HTML Output

Need to show diffs in a web UI? HtmlDiff generates a side-by-side HTML table:

from difflib import HtmlDiff
 
old_lines = ["line one", "line two", "line three"]
new_lines = ["line one", "line 2", "line three", "line four"]
 
differ = HtmlDiff()
 
# Full HTML document
html_doc = differ.make_file(old_lines, new_lines, 
                             fromdesc='Original',
                             todesc='Modified')
 
# Just the table (for embedding)
html_table = differ.make_table(old_lines, new_lines,
                                fromdesc='Original',
                                todesc='Modified')
 
# Write to file
with open('diff_output.html', 'w') as f:
    f.write(html_doc)

The output includes CSS styling with deletions in red and additions in green. It's not beautiful, but it works.

Practical Patterns

Pattern 1: File Diffing

from difflib import unified_diff
from pathlib import Path
 
def diff_files(path1: str, path2: str) -> str:
    """Generate a unified diff between two files."""
    file1 = Path(path1)
    file2 = Path(path2)
    
    lines1 = file1.read_text().splitlines(keepends=True)
    lines2 = file2.read_text().splitlines(keepends=True)
    
    diff = unified_diff(
        lines1, lines2,
        fromfile=str(path1),
        tofile=str(path2)
    )
    
    return ''.join(diff)
 
# Usage
result = diff_files('config.old.yaml', 'config.yaml')
if result:
    print(result)
else:
    print("Files are identical")

Pattern 2: Fuzzy Matching with get_close_matches

get_close_matches is a convenience function for finding similar strings:

from difflib import get_close_matches
 
commands = ["start", "stop", "status", "restart", "help", "quit"]
 
# Find matches for a typo
matches = get_close_matches("strt", commands)
print(matches)  # ['start']
 
# Control number of results and minimum similarity
matches = get_close_matches(
    "sta", 
    commands, 
    n=3,       # max 3 results
    cutoff=0.6  # minimum 60% similar
)
print(matches)  # ['start', 'status']

This is perfect for "did you mean?" suggestions:

def suggest_command(user_input: str, valid_commands: list[str]) -> str:
    """Suggest a correction if command is misspelled."""
    if user_input in valid_commands:
        return user_input
    
    suggestions = get_close_matches(user_input, valid_commands, n=1, cutoff=0.6)
    
    if suggestions:
        return f"Unknown command '{user_input}'. Did you mean '{suggestions[0]}'?"
    return f"Unknown command '{user_input}'."
 
print(suggest_command("strat", commands))
# Unknown command 'strat'. Did you mean 'start'?

Pattern 3: Configuration Change Detection

from difflib import Differ
from typing import NamedTuple
 
class ConfigChanges(NamedTuple):
    added: list[str]
    removed: list[str]
    modified: list[tuple[str, str]]
 
def detect_changes(old_config: dict, new_config: dict) -> ConfigChanges:
    """Detect what changed between two config dictionaries."""
    added = []
    removed = []
    modified = []
    
    all_keys = set(old_config.keys()) | set(new_config.keys())
    
    for key in all_keys:
        if key not in old_config:
            added.append(f"{key}={new_config[key]}")
        elif key not in new_config:
            removed.append(f"{key}={old_config[key]}")
        elif old_config[key] != new_config[key]:
            modified.append((
                f"{key}={old_config[key]}",
                f"{key}={new_config[key]}"
            ))
    
    return ConfigChanges(added, removed, modified)
 
# Usage
old = {"debug": False, "port": 8080, "timeout": 30}
new = {"debug": True, "port": 8080, "workers": 4}
 
changes = detect_changes(old, new)
print(f"Added: {changes.added}")      # ['workers=4']
print(f"Removed: {changes.removed}")  # ['timeout=30']
print(f"Modified: {changes.modified}")  # [('debug=False', 'debug=True')]

Pattern 4: Similarity Ranking

from difflib import SequenceMatcher
 
def rank_by_similarity(query: str, candidates: list[str]) -> list[tuple[str, float]]:
    """Rank candidates by similarity to query."""
    scored = []
    
    for candidate in candidates:
        ratio = SequenceMatcher(None, query.lower(), candidate.lower()).ratio()
        scored.append((candidate, ratio))
    
    return sorted(scored, key=lambda x: x[1], reverse=True)
 
# Finding the best product match
products = ["iPhone 15 Pro", "iPhone 15", "iPhone 14", "iPad Pro", "MacBook Pro"]
 
results = rank_by_similarity("iphone 15 por", products)
for product, score in results[:3]:
    print(f"{product}: {score:.1%}")
# iPhone 15 Pro: 92.3%
# iPhone 15: 76.0%
# iPhone 14: 72.0%

Pattern 5: Ignoring Whitespace

The junk function filters out "unimportant" characters:

from difflib import SequenceMatcher
 
# Compare ignoring spaces
def whitespace_junk(char):
    return char in " \t\n"
 
s1 = "hello   world"
s2 = "hello world"
 
# Without junk filtering
matcher1 = SequenceMatcher(None, s1, s2)
print(f"With spaces: {matcher1.ratio():.3f}")  # 0.923
 
# With junk filtering
matcher2 = SequenceMatcher(whitespace_junk, s1, s2)
print(f"Ignoring spaces: {matcher2.ratio():.3f}")  # 1.0

Performance Considerations

SequenceMatcher has O(n²) worst-case complexity. For very long sequences:

Use the tiered ratio approach shown earlier
Disable autojunk for accuracy: By default, SequenceMatcher treats frequently-occurring elements as "junk" to speed things up. This can cause weird results:

# Long sequence with repeated characters
s1 = "a" * 200 + "b" * 200
s2 = "a" * 200 + "c" * 200
 
# Default behavior might be surprising
matcher1 = SequenceMatcher(None, s1, s2)
print(matcher1.ratio())  # Might not be what you expect
 
# Disable autojunk for accuracy
matcher2 = SequenceMatcher(None, s1, s2, autojunk=False)
print(matcher2.ratio())  # More accurate

For huge files, consider line-by-line comparison instead of character-by-character.

Quick Reference

from difflib import (
    SequenceMatcher,   # Core comparison engine
    Differ,            # Line-by-line with markers
    get_close_matches, # Find similar strings
    unified_diff,      # Git-style patches
    context_diff,      # Context-style patches
    HtmlDiff,          # Visual HTML tables
    ndiff,             # Simple line diff
)
 
# Similarity ratio (0 to 1)
SequenceMatcher(None, a, b).ratio()
 
# What operations to transform a->b
SequenceMatcher(None, a, b).get_opcodes()
 
# Where sequences match
SequenceMatcher(None, a, b).get_matching_blocks()
 
# Find similar strings
get_close_matches(word, candidates, n=3, cutoff=0.6)
 
# Generate patch
''.join(unified_diff(old_lines, new_lines, fromfile='a', tofile='b'))

Wrapping Up

I started this exploration because I needed to compare two config files. What I found was a complete toolkit for sequence comparison. The key insight: SequenceMatcher is the engine, everything else is convenience wrappers.

For simple "are these similar?" questions, use ratio(). For "what changed?", use get_opcodes(). For patches, use unified_diff(). And for fuzzy search, get_close_matches() saves you from writing loops.

The next time you need to compare anything sequential in Python, skip the manual approach. difflib has been solving this problem for decades.

React to this post:

#The Core: SequenceMatcher

#Similarity with ratio()

#Faster Alternatives: quick_ratio() and real_quick_ratio()

#Understanding What Changed: get_matching_blocks()

#The Edit Script: get_opcodes()

#Differ: Line-by-Line Text Comparison

#Understanding Differ Markers

#unified_diff: Git-Style Diffs

#context_diff: More Context

#HtmlDiff: Visual HTML Output

#Practical Patterns

#Pattern 1: File Diffing

#Pattern 2: Fuzzy Matching with get_close_matches

#Pattern 3: Configuration Change Detection

#Pattern 4: Similarity Ranking

#Pattern 5: Ignoring Whitespace

#Performance Considerations

#Quick Reference

#Wrapping Up

Keep Reading

Need help shipping fast?

The Core: SequenceMatcher

Similarity with ratio()

Faster Alternatives: quick_ratio() and real_quick_ratio()

Understanding What Changed: get_matching_blocks()

The Edit Script: get_opcodes()

Differ: Line-by-Line Text Comparison

Understanding Differ Markers

unified_diff: Git-Style Diffs

context_diff: More Context

HtmlDiff: Visual HTML Output

Practical Patterns

Pattern 1: File Diffing

Pattern 2: Fuzzy Matching with get_close_matches

Pattern 3: Configuration Change Detection

Pattern 4: Similarity Ranking

Pattern 5: Ignoring Whitespace

Performance Considerations

Quick Reference

Wrapping Up