I've processed a lot of CSV files in Python. Most of the time, it's straightforward. But every few months, I hit some edge case that costs me an hour of debugging. This post is the reference I wish I'd had—covering the basics plus all the gotchas that aren't obvious until they break your code.
The Basics: reader and writer
The csv module is simple at its core:
import csv
# Reading
with open('data.csv', 'r', newline='') as f:
reader = csv.reader(f)
for row in reader:
print(row) # Each row is a list of strings
# Writing
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['name', 'age', 'city'])
writer.writerow(['Alice', '30', 'New York'])
# Or write multiple rows at once
writer.writerows([
['Bob', '25', 'Chicago'],
['Carol', '35', 'Boston'],
])This handles quoting, escaping, and edge cases automatically. Don't be tempted to split(',') manually—you'll break on the first field that contains a comma.
DictReader and DictWriter: Work with Names, Not Indices
When you have headers, using column names beats counting indices:
import csv
# Reading as dictionaries
with open('users.csv', 'r', newline='') as f:
reader = csv.DictReader(f)
for row in reader:
# row is a dict: {'name': 'Alice', 'email': 'alice@example.com'}
print(f"Emailing {row['name']} at {row['email']}")
# Writing from dictionaries
users = [
{'name': 'Alice', 'email': 'alice@example.com'},
{'name': 'Bob', 'email': 'bob@example.com'},
]
with open('users.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=['name', 'email'])
writer.writeheader() # Writes the column names
writer.writerows(users)The advantage is maintainability. When someone adds a new column to the CSV, your code that accesses row['email'] still works. Code that accesses row[1] might silently break.
Handling Missing or Extra Fields
DictWriter has options for fields that don't match:
import csv
# Ignore extra keys in the dict
writer = csv.DictWriter(f, fieldnames=['name'], extrasaction='ignore')
# Missing keys get a default value
writer = csv.DictWriter(f, fieldnames=['name', 'email'], restval='N/A')Delimiters and Quoting: Beyond Commas
Not everything is comma-separated. European Excel uses semicolons. Tab-separated (TSV) is common in data science. Pipes appear in legacy systems.
import csv
# Tab-separated
with open('data.tsv', 'r', newline='') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
print(row)
# Semicolon-separated (common in European locales)
reader = csv.reader(f, delimiter=';')
# Pipe-separated
reader = csv.reader(f, delimiter='|')Quoting Modes
Control how fields get quoted:
import csv
# QUOTE_MINIMAL (default): only quote when necessary
writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
# QUOTE_ALL: quote every field
writer = csv.writer(f, quoting=csv.QUOTE_ALL)
# QUOTE_NONNUMERIC: quote strings, leave numbers bare
# Note: when reading, this converts unquoted fields to floats
writer = csv.writer(f, quoting=csv.QUOTE_NONNUMERIC)
# QUOTE_NONE: never quote (must set escapechar)
writer = csv.writer(f, quoting=csv.QUOTE_NONE, escapechar='\\')Custom Dialects
If you're working with the same weird format repeatedly, register a dialect:
import csv
csv.register_dialect('european',
delimiter=';',
quotechar='"',
quoting=csv.QUOTE_MINIMAL
)
# Now use it by name
with open('data.csv', 'r', newline='') as f:
reader = csv.reader(f, dialect='european')Built-in dialects: 'excel', 'excel-tab', 'unix'.
The Encoding Gotcha: UTF-8 BOM
This one cost me hours. Excel on Windows saves CSV files with a UTF-8 BOM (Byte Order Mark)—three invisible bytes at the start of the file. When you read it with open() using encoding='utf-8', the BOM gets prepended to your first column name.
import csv
# This might give you '\ufeffname' as the first header!
with open('excel-export.csv', 'r', newline='', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
print(row.keys()) # OrderedDict with '\ufeffname' as first keyThe fix: use encoding='utf-8-sig', which handles the BOM automatically:
import csv
# utf-8-sig strips the BOM when reading
with open('excel-export.csv', 'r', newline='', encoding='utf-8-sig') as f:
reader = csv.DictReader(f)
for row in reader:
print(row['name']) # Works correctly now
# When writing for Excel, add the BOM
with open('for-excel.csv', 'w', newline='', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerow(['name', 'email'])
writer.writerow(['Müller', 'mueller@example.de'])Detecting Encoding
When you don't know the encoding:
import csv
import chardet # pip install chardet
# Detect encoding
with open('mystery.csv', 'rb') as f:
raw = f.read(10000) # Read a sample
result = chardet.detect(raw)
encoding = result['encoding']
# Now read with detected encoding
with open('mystery.csv', 'r', newline='', encoding=encoding) as f:
reader = csv.reader(f)
for row in reader:
print(row)Common Encodings You'll Hit
utf-8: Modern standardutf-8-sig: UTF-8 with BOM (Excel on Windows)cp1252: Windows "ANSI" (Western European)iso-8859-1/latin-1: Older Western Europeancp437: DOS legacy
When in doubt, try utf-8-sig first, then fall back to latin-1 (which accepts any byte sequence).
Streaming Large CSV Files
CSV readers are iterators—they don't load the entire file into memory. But there are patterns that accidentally kill this:
import csv
# BAD: loads entire file into memory
with open('huge.csv', 'r', newline='') as f:
reader = csv.DictReader(f)
all_rows = list(reader) # Don't do this with large files
# GOOD: process row by row
with open('huge.csv', 'r', newline='') as f:
reader = csv.DictReader(f)
for row in reader:
process_row(row) # Constant memory usageGenerator Pattern for Pipelines
Chain processing steps with generators:
import csv
def read_csv(path):
with open(path, 'r', newline='', encoding='utf-8-sig') as f:
reader = csv.DictReader(f)
yield from reader
def filter_active(rows):
for row in rows:
if row['status'] == 'active':
yield row
def transform(rows):
for row in rows:
yield {
'name': row['name'].upper(),
'email': row['email'].lower(),
}
# Process a 10GB file with constant memory
pipeline = transform(filter_active(read_csv('huge.csv')))
with open('output.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=['name', 'email'])
writer.writeheader()
for row in pipeline:
writer.writerow(row)Chunked Processing
When you need to batch rows (e.g., for database inserts):
import csv
from itertools import islice
def chunked(iterable, size):
"""Yield chunks of `size` items."""
iterator = iter(iterable)
while chunk := list(islice(iterator, size)):
yield chunk
with open('huge.csv', 'r', newline='') as f:
reader = csv.DictReader(f)
for batch in chunked(reader, 1000):
insert_into_database(batch) # Insert 1000 rows at a timeCommon Gotchas
The newline Parameter
Always use newline='' when opening CSV files:
# WRONG: Can cause blank lines on Windows or corrupt data
with open('data.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(['a', 'b', 'c'])
# RIGHT: Let csv module handle line endings
with open('data.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['a', 'b', 'c'])Why? The csv module does its own newline handling. Without newline='', Python's default newline translation can double-up \r\n on Windows, creating blank lines between rows.
Excel Compatibility
Excel has quirks:
import csv
# For Excel compatibility, use excel dialect (default on Windows)
with open('for-excel.csv', 'w', newline='', encoding='utf-8-sig') as f:
writer = csv.writer(f, dialect='excel')
writer.writerow(['name', 'amount'])
writer.writerow(['Product', '€1,234.56'])Watch out for:
- UTF-8 BOM: Excel needs
utf-8-sigto recognize UTF-8 encoding - Numbers with commas: Excel may misinterpret
"1,234"as two columns - Leading zeros: Excel strips them from "numbers" like
00123 - Long numbers: Excel converts long numbers to scientific notation
For problematic fields, prefix with a single quote (or equals sign) to force Excel to treat it as text:
# Force text interpretation in Excel
row = ['Product', '="00123"', '="1234567890123456"']Type Conversion
CSV is all strings. Convert explicitly:
import csv
with open('data.csv', 'r', newline='') as f:
reader = csv.DictReader(f)
for row in reader:
age = int(row['age'])
salary = float(row['salary'])
active = row['active'].lower() in ('true', '1', 'yes')Empty vs Missing Fields
import csv
# Empty string vs no value
row = ['Alice', '', 'NYC'] # Middle field is empty string
# With DictReader, missing columns depend on restval
reader = csv.DictReader(f, restval='MISSING')Quotes in Fields
The csv module handles this, but know the rules:
import csv
import io
# Fields with quotes get double-quoted
output = io.StringIO()
writer = csv.writer(output)
writer.writerow(['She said "hello"'])
print(output.getvalue())
# Output: "She said ""hello"""Newlines in Fields
Fields can contain newlines, and they'll be quoted:
import csv
import io
output = io.StringIO()
writer = csv.writer(output)
writer.writerow(['Line 1\nLine 2', 'normal'])
print(output.getvalue())
# Output: "Line 1
# Line 2",normalThis trips up shell tools that assume one row per line. The csv module handles it correctly.
Quick Reference
import csv
# Basic reading
reader = csv.reader(file)
reader = csv.DictReader(file)
reader = csv.DictReader(file, fieldnames=['a', 'b', 'c'])
# Basic writing
writer = csv.writer(file)
writer = csv.DictWriter(file, fieldnames=['a', 'b', 'c'])
writer.writeheader() # DictWriter only
writer.writerow(row)
writer.writerows(rows)
# Common options
csv.reader(f,
delimiter=',', # Field separator
quotechar='"', # Quote character
quoting=csv.QUOTE_MINIMAL, # When to quote
skipinitialspace=True, # Ignore space after delimiter
)
# File opening pattern
with open('file.csv', 'r', newline='', encoding='utf-8-sig') as f:
reader = csv.DictReader(f)The csv module is one of Python's most useful standard library modules. Learn the gotchas once, and CSV handling becomes trivial.