I recently needed to compare two configuration files and show the user what changed. My first instinct was to write a manual loop comparing lines—that would've been painful. Then I discovered difflib. This module does all the heavy lifting for sequence comparison, and it's been in the standard library since Python 2.1.
Here's everything I've learned about it.
The Core: SequenceMatcher
SequenceMatcher is the foundation of difflib. It compares any two sequences (strings, lists, whatever) and figures out how similar they are and what's different.
from difflib import SequenceMatcher
text_a = "the quick brown fox"
text_b = "the quack brown box"
matcher = SequenceMatcher(None, text_a, text_b)The first argument is a "junk" function—pass None for now. We'll cover that later.
Similarity with ratio()
The most common thing you'll want: how similar are these two strings?
print(matcher.ratio()) # 0.8947...A ratio of 1.0 means identical. 0.0 means completely different. This returns the number of matching characters divided by the total characters in both sequences.
Faster Alternatives: quick_ratio() and real_quick_ratio()
Computing ratio() can be slow for very long sequences. If you just need a ballpark estimate:
# Upper bound (may overestimate)
print(matcher.quick_ratio()) # 0.8947...
# Even faster upper bound
print(matcher.real_quick_ratio()) # 1.0When would you use these? Say you're fuzzy-searching through thousands of strings. Use real_quick_ratio() first to eliminate obvious non-matches, then quick_ratio() to narrow further, then ratio() for your final candidates.
def fuzzy_search(query: str, candidates: list[str], cutoff: float = 0.6) -> list[str]:
"""Efficient fuzzy search using tiered ratio checks."""
results = []
for candidate in candidates:
matcher = SequenceMatcher(None, query.lower(), candidate.lower())
# Quick reject if upper bound is below cutoff
if matcher.real_quick_ratio() < cutoff:
continue
if matcher.quick_ratio() < cutoff:
continue
if matcher.ratio() >= cutoff:
results.append((candidate, matcher.ratio()))
return sorted(results, key=lambda x: x[1], reverse=True)Understanding What Changed: get_matching_blocks()
This shows you exactly which parts match between two sequences:
s1 = "abXcd"
s2 = "abYcd"
matcher = SequenceMatcher(None, s1, s2)
for block in matcher.get_matching_blocks():
print(f"Match: s1[{block.a}:{block.a + block.size}] = '{s1[block.a:block.a + block.size]}'")Output:
Match: s1[0:2] = 'ab'
Match: s1[3:5] = 'cd'
Match: s1[5:5] = ''
The last block with size 0 is always a "sentinel" marking the end.
The Edit Script: get_opcodes()
This is powerful. It tells you exactly what operations transform sequence A into sequence B:
s1 = "hello world"
s2 = "hello there"
matcher = SequenceMatcher(None, s1, s2)
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
print(f"{tag:8} s1[{i1}:{i2}] '{s1[i1:i2]}' -> s2[{j1}:{j2}] '{s2[j1:j2]}'")Output:
equal s1[0:6] 'hello ' -> s2[0:6] 'hello '
replace s1[6:11] 'world' -> s2[6:11] 'there'
The operation tags:
equal: No change neededreplace: Replace with different contentinsert: Add new content (s1 range will be empty)delete: Remove content (s2 range will be empty)
This is how you'd highlight changes in a UI:
def highlight_changes(old: str, new: str) -> str:
"""Return HTML with changes highlighted."""
matcher = SequenceMatcher(None, old, new)
result = []
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == 'equal':
result.append(old[i1:i2])
elif tag == 'delete':
result.append(f'<del>{old[i1:i2]}</del>')
elif tag == 'insert':
result.append(f'<ins>{new[j1:j2]}</ins>')
elif tag == 'replace':
result.append(f'<del>{old[i1:i2]}</del><ins>{new[j1:j2]}</ins>')
return ''.join(result)
print(highlight_changes("hello world", "hello there"))
# hello <del>world</del><ins>there</ins>Differ: Line-by-Line Text Comparison
For comparing multiline text, Differ produces human-readable output:
from difflib import Differ
text1 = ["one\n", "two\n", "three\n"]
text2 = ["one\n", "TWO\n", "three\n", "four\n"]
differ = Differ()
diff = differ.compare(text1, text2)
print(''.join(diff))Output:
one
- two
? ^
+ TWO
? ^
three
+ four
The ? lines are helpful—they point to exactly which characters changed. That little ^ shows you "two" became "TWO" with different case at that position.
Understanding Differ Markers
' ' (two spaces) - line exists in both
'- ' - line only in first sequence
'+ ' - line only in second sequence
'? ' - hints about character-level changes
unified_diff: Git-Style Diffs
This is what you want for patches and code reviews:
from difflib import unified_diff
old_code = """def greet(name):
print("Hello " + name)
return True
""".splitlines(keepends=True)
new_code = """def greet(name):
print(f"Hello {name}!")
return True
""".splitlines(keepends=True)
diff = unified_diff(
old_code, new_code,
fromfile='greet.py.old',
tofile='greet.py'
)
print(''.join(diff))Output:
--- greet.py.old
+++ greet.py
@@ -1,3 +1,3 @@
def greet(name):
- print("Hello " + name)
+ print(f"Hello {name}!")
return TrueThe @@ -1,3 +1,3 @@ header means: starting at line 1, showing 3 lines from the old file and 3 lines from the new file.
context_diff: More Context
If you want more surrounding context (like the old diff -c command):
from difflib import context_diff
diff = context_diff(
old_code, new_code,
fromfile='greet.py.old',
tofile='greet.py',
n=3 # lines of context
)
print(''.join(diff))Output:
*** greet.py.old
--- greet.py
***************
*** 1,3 ****
def greet(name):
! print("Hello " + name)
return True
--- 1,3 ----
def greet(name):
! print(f"Hello {name}!")
return True
The ! marks changed lines. I find unified diffs easier to read, but context diffs are still used in some tools.
HtmlDiff: Visual HTML Output
Need to show diffs in a web UI? HtmlDiff generates a side-by-side HTML table:
from difflib import HtmlDiff
old_lines = ["line one", "line two", "line three"]
new_lines = ["line one", "line 2", "line three", "line four"]
differ = HtmlDiff()
# Full HTML document
html_doc = differ.make_file(old_lines, new_lines,
fromdesc='Original',
todesc='Modified')
# Just the table (for embedding)
html_table = differ.make_table(old_lines, new_lines,
fromdesc='Original',
todesc='Modified')
# Write to file
with open('diff_output.html', 'w') as f:
f.write(html_doc)The output includes CSS styling with deletions in red and additions in green. It's not beautiful, but it works.
Practical Patterns
Pattern 1: File Diffing
from difflib import unified_diff
from pathlib import Path
def diff_files(path1: str, path2: str) -> str:
"""Generate a unified diff between two files."""
file1 = Path(path1)
file2 = Path(path2)
lines1 = file1.read_text().splitlines(keepends=True)
lines2 = file2.read_text().splitlines(keepends=True)
diff = unified_diff(
lines1, lines2,
fromfile=str(path1),
tofile=str(path2)
)
return ''.join(diff)
# Usage
result = diff_files('config.old.yaml', 'config.yaml')
if result:
print(result)
else:
print("Files are identical")Pattern 2: Fuzzy Matching with get_close_matches
get_close_matches is a convenience function for finding similar strings:
from difflib import get_close_matches
commands = ["start", "stop", "status", "restart", "help", "quit"]
# Find matches for a typo
matches = get_close_matches("strt", commands)
print(matches) # ['start']
# Control number of results and minimum similarity
matches = get_close_matches(
"sta",
commands,
n=3, # max 3 results
cutoff=0.6 # minimum 60% similar
)
print(matches) # ['start', 'status']This is perfect for "did you mean?" suggestions:
def suggest_command(user_input: str, valid_commands: list[str]) -> str:
"""Suggest a correction if command is misspelled."""
if user_input in valid_commands:
return user_input
suggestions = get_close_matches(user_input, valid_commands, n=1, cutoff=0.6)
if suggestions:
return f"Unknown command '{user_input}'. Did you mean '{suggestions[0]}'?"
return f"Unknown command '{user_input}'."
print(suggest_command("strat", commands))
# Unknown command 'strat'. Did you mean 'start'?Pattern 3: Configuration Change Detection
from difflib import Differ
from typing import NamedTuple
class ConfigChanges(NamedTuple):
added: list[str]
removed: list[str]
modified: list[tuple[str, str]]
def detect_changes(old_config: dict, new_config: dict) -> ConfigChanges:
"""Detect what changed between two config dictionaries."""
added = []
removed = []
modified = []
all_keys = set(old_config.keys()) | set(new_config.keys())
for key in all_keys:
if key not in old_config:
added.append(f"{key}={new_config[key]}")
elif key not in new_config:
removed.append(f"{key}={old_config[key]}")
elif old_config[key] != new_config[key]:
modified.append((
f"{key}={old_config[key]}",
f"{key}={new_config[key]}"
))
return ConfigChanges(added, removed, modified)
# Usage
old = {"debug": False, "port": 8080, "timeout": 30}
new = {"debug": True, "port": 8080, "workers": 4}
changes = detect_changes(old, new)
print(f"Added: {changes.added}") # ['workers=4']
print(f"Removed: {changes.removed}") # ['timeout=30']
print(f"Modified: {changes.modified}") # [('debug=False', 'debug=True')]Pattern 4: Similarity Ranking
from difflib import SequenceMatcher
def rank_by_similarity(query: str, candidates: list[str]) -> list[tuple[str, float]]:
"""Rank candidates by similarity to query."""
scored = []
for candidate in candidates:
ratio = SequenceMatcher(None, query.lower(), candidate.lower()).ratio()
scored.append((candidate, ratio))
return sorted(scored, key=lambda x: x[1], reverse=True)
# Finding the best product match
products = ["iPhone 15 Pro", "iPhone 15", "iPhone 14", "iPad Pro", "MacBook Pro"]
results = rank_by_similarity("iphone 15 por", products)
for product, score in results[:3]:
print(f"{product}: {score:.1%}")
# iPhone 15 Pro: 92.3%
# iPhone 15: 76.0%
# iPhone 14: 72.0%Pattern 5: Ignoring Whitespace
The junk function filters out "unimportant" characters:
from difflib import SequenceMatcher
# Compare ignoring spaces
def whitespace_junk(char):
return char in " \t\n"
s1 = "hello world"
s2 = "hello world"
# Without junk filtering
matcher1 = SequenceMatcher(None, s1, s2)
print(f"With spaces: {matcher1.ratio():.3f}") # 0.923
# With junk filtering
matcher2 = SequenceMatcher(whitespace_junk, s1, s2)
print(f"Ignoring spaces: {matcher2.ratio():.3f}") # 1.0Performance Considerations
SequenceMatcher has O(n²) worst-case complexity. For very long sequences:
- Use the tiered ratio approach shown earlier
- Disable autojunk for accuracy: By default, SequenceMatcher treats frequently-occurring elements as "junk" to speed things up. This can cause weird results:
# Long sequence with repeated characters
s1 = "a" * 200 + "b" * 200
s2 = "a" * 200 + "c" * 200
# Default behavior might be surprising
matcher1 = SequenceMatcher(None, s1, s2)
print(matcher1.ratio()) # Might not be what you expect
# Disable autojunk for accuracy
matcher2 = SequenceMatcher(None, s1, s2, autojunk=False)
print(matcher2.ratio()) # More accurate- For huge files, consider line-by-line comparison instead of character-by-character.
Quick Reference
from difflib import (
SequenceMatcher, # Core comparison engine
Differ, # Line-by-line with markers
get_close_matches, # Find similar strings
unified_diff, # Git-style patches
context_diff, # Context-style patches
HtmlDiff, # Visual HTML tables
ndiff, # Simple line diff
)
# Similarity ratio (0 to 1)
SequenceMatcher(None, a, b).ratio()
# What operations to transform a->b
SequenceMatcher(None, a, b).get_opcodes()
# Where sequences match
SequenceMatcher(None, a, b).get_matching_blocks()
# Find similar strings
get_close_matches(word, candidates, n=3, cutoff=0.6)
# Generate patch
''.join(unified_diff(old_lines, new_lines, fromfile='a', tofile='b'))Wrapping Up
I started this exploration because I needed to compare two config files. What I found was a complete toolkit for sequence comparison. The key insight: SequenceMatcher is the engine, everything else is convenience wrappers.
For simple "are these similar?" questions, use ratio(). For "what changed?", use get_opcodes(). For patches, use unified_diff(). And for fuzzy search, get_close_matches() saves you from writing loops.
The next time you need to compare anything sequential in Python, skip the manual approach. difflib has been solving this problem for decades.