CSV Files in Python: The Gotchas That Bit Me

I've processed a lot of CSV files in Python. Most of the time, it's straightforward. But every few months, I hit some edge case that costs me an hour of debugging. This post is the reference I wish I'd had—covering the basics plus all the gotchas that aren't obvious until they break your code.

The Basics: reader and writer

The csv module is simple at its core:

import csv
 
# Reading
with open('data.csv', 'r', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)  # Each row is a list of strings
 
# Writing
with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['name', 'age', 'city'])
    writer.writerow(['Alice', '30', 'New York'])
    
    # Or write multiple rows at once
    writer.writerows([
        ['Bob', '25', 'Chicago'],
        ['Carol', '35', 'Boston'],
    ])

This handles quoting, escaping, and edge cases automatically. Don't be tempted to split(',') manually—you'll break on the first field that contains a comma.

DictReader and DictWriter: Work with Names, Not Indices

When you have headers, using column names beats counting indices:

import csv
 
# Reading as dictionaries
with open('users.csv', 'r', newline='') as f:
    reader = csv.DictReader(f)
    for row in reader:
        # row is a dict: {'name': 'Alice', 'email': 'alice@example.com'}
        print(f"Emailing {row['name']} at {row['email']}")
 
# Writing from dictionaries
users = [
    {'name': 'Alice', 'email': 'alice@example.com'},
    {'name': 'Bob', 'email': 'bob@example.com'},
]
 
with open('users.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=['name', 'email'])
    writer.writeheader()  # Writes the column names
    writer.writerows(users)

The advantage is maintainability. When someone adds a new column to the CSV, your code that accesses row['email'] still works. Code that accesses row[1] might silently break.

Handling Missing or Extra Fields

DictWriter has options for fields that don't match:

import csv
 
# Ignore extra keys in the dict
writer = csv.DictWriter(f, fieldnames=['name'], extrasaction='ignore')
 
# Missing keys get a default value
writer = csv.DictWriter(f, fieldnames=['name', 'email'], restval='N/A')

Delimiters and Quoting: Beyond Commas

Not everything is comma-separated. European Excel uses semicolons. Tab-separated (TSV) is common in data science. Pipes appear in legacy systems.

import csv
 
# Tab-separated
with open('data.tsv', 'r', newline='') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        print(row)
 
# Semicolon-separated (common in European locales)
reader = csv.reader(f, delimiter=';')
 
# Pipe-separated
reader = csv.reader(f, delimiter='|')

Quoting Modes

Control how fields get quoted:

import csv
 
# QUOTE_MINIMAL (default): only quote when necessary
writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
 
# QUOTE_ALL: quote every field
writer = csv.writer(f, quoting=csv.QUOTE_ALL)
 
# QUOTE_NONNUMERIC: quote strings, leave numbers bare
# Note: when reading, this converts unquoted fields to floats
writer = csv.writer(f, quoting=csv.QUOTE_NONNUMERIC)
 
# QUOTE_NONE: never quote (must set escapechar)
writer = csv.writer(f, quoting=csv.QUOTE_NONE, escapechar='\\')

Custom Dialects

If you're working with the same weird format repeatedly, register a dialect:

import csv
 
csv.register_dialect('european', 
    delimiter=';',
    quotechar='"',
    quoting=csv.QUOTE_MINIMAL
)
 
# Now use it by name
with open('data.csv', 'r', newline='') as f:
    reader = csv.reader(f, dialect='european')

Built-in dialects: 'excel', 'excel-tab', 'unix'.

The Encoding Gotcha: UTF-8 BOM

This one cost me hours. Excel on Windows saves CSV files with a UTF-8 BOM (Byte Order Mark)—three invisible bytes at the start of the file. When you read it with open() using encoding='utf-8', the BOM gets prepended to your first column name.

import csv
 
# This might give you '\ufeffname' as the first header!
with open('excel-export.csv', 'r', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row.keys())  # OrderedDict with '\ufeffname' as first key

The fix: use encoding='utf-8-sig', which handles the BOM automatically:

import csv
 
# utf-8-sig strips the BOM when reading
with open('excel-export.csv', 'r', newline='', encoding='utf-8-sig') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['name'])  # Works correctly now
 
# When writing for Excel, add the BOM
with open('for-excel.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.writer(f)
    writer.writerow(['name', 'email'])
    writer.writerow(['Müller', 'mueller@example.de'])

Detecting Encoding

When you don't know the encoding:

import csv
import chardet  # pip install chardet
 
# Detect encoding
with open('mystery.csv', 'rb') as f:
    raw = f.read(10000)  # Read a sample
    result = chardet.detect(raw)
    encoding = result['encoding']
 
# Now read with detected encoding
with open('mystery.csv', 'r', newline='', encoding=encoding) as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

Common Encodings You'll Hit

utf-8: Modern standard
utf-8-sig: UTF-8 with BOM (Excel on Windows)
cp1252: Windows "ANSI" (Western European)
iso-8859-1 / latin-1: Older Western European
cp437: DOS legacy

When in doubt, try utf-8-sig first, then fall back to latin-1 (which accepts any byte sequence).

Streaming Large CSV Files

CSV readers are iterators—they don't load the entire file into memory. But there are patterns that accidentally kill this:

import csv
 
# BAD: loads entire file into memory
with open('huge.csv', 'r', newline='') as f:
    reader = csv.DictReader(f)
    all_rows = list(reader)  # Don't do this with large files
 
# GOOD: process row by row
with open('huge.csv', 'r', newline='') as f:
    reader = csv.DictReader(f)
    for row in reader:
        process_row(row)  # Constant memory usage

Generator Pattern for Pipelines

Chain processing steps with generators:

import csv
 
def read_csv(path):
    with open(path, 'r', newline='', encoding='utf-8-sig') as f:
        reader = csv.DictReader(f)
        yield from reader
 
def filter_active(rows):
    for row in rows:
        if row['status'] == 'active':
            yield row
 
def transform(rows):
    for row in rows:
        yield {
            'name': row['name'].upper(),
            'email': row['email'].lower(),
        }
 
# Process a 10GB file with constant memory
pipeline = transform(filter_active(read_csv('huge.csv')))
 
with open('output.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=['name', 'email'])
    writer.writeheader()
    for row in pipeline:
        writer.writerow(row)

Chunked Processing

When you need to batch rows (e.g., for database inserts):

import csv
from itertools import islice
 
def chunked(iterable, size):
    """Yield chunks of `size` items."""
    iterator = iter(iterable)
    while chunk := list(islice(iterator, size)):
        yield chunk
 
with open('huge.csv', 'r', newline='') as f:
    reader = csv.DictReader(f)
    for batch in chunked(reader, 1000):
        insert_into_database(batch)  # Insert 1000 rows at a time

Common Gotchas

The newline Parameter

Always use newline='' when opening CSV files:

# WRONG: Can cause blank lines on Windows or corrupt data
with open('data.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['a', 'b', 'c'])
 
# RIGHT: Let csv module handle line endings
with open('data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['a', 'b', 'c'])

Why? The csv module does its own newline handling. Without newline='', Python's default newline translation can double-up \r\n on Windows, creating blank lines between rows.

Excel Compatibility

Excel has quirks:

import csv
 
# For Excel compatibility, use excel dialect (default on Windows)
with open('for-excel.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.writer(f, dialect='excel')
    writer.writerow(['name', 'amount'])
    writer.writerow(['Product', '€1,234.56'])

Watch out for:

UTF-8 BOM: Excel needs utf-8-sig to recognize UTF-8 encoding
Numbers with commas: Excel may misinterpret "1,234" as two columns
Leading zeros: Excel strips them from "numbers" like 00123
Long numbers: Excel converts long numbers to scientific notation

For problematic fields, prefix with a single quote (or equals sign) to force Excel to treat it as text:

# Force text interpretation in Excel
row = ['Product', '="00123"', '="1234567890123456"']

Type Conversion

CSV is all strings. Convert explicitly:

import csv
 
with open('data.csv', 'r', newline='') as f:
    reader = csv.DictReader(f)
    for row in reader:
        age = int(row['age'])
        salary = float(row['salary'])
        active = row['active'].lower() in ('true', '1', 'yes')

Empty vs Missing Fields

import csv
 
# Empty string vs no value
row = ['Alice', '', 'NYC']  # Middle field is empty string
 
# With DictReader, missing columns depend on restval
reader = csv.DictReader(f, restval='MISSING')

Quotes in Fields

The csv module handles this, but know the rules:

import csv
import io
 
# Fields with quotes get double-quoted
output = io.StringIO()
writer = csv.writer(output)
writer.writerow(['She said "hello"'])
print(output.getvalue())
# Output: "She said ""hello"""

Newlines in Fields

Fields can contain newlines, and they'll be quoted:

import csv
import io
 
output = io.StringIO()
writer = csv.writer(output)
writer.writerow(['Line 1\nLine 2', 'normal'])
print(output.getvalue())
# Output: "Line 1
# Line 2",normal

This trips up shell tools that assume one row per line. The csv module handles it correctly.

Quick Reference

import csv
 
# Basic reading
reader = csv.reader(file)
reader = csv.DictReader(file)
reader = csv.DictReader(file, fieldnames=['a', 'b', 'c'])
 
# Basic writing
writer = csv.writer(file)
writer = csv.DictWriter(file, fieldnames=['a', 'b', 'c'])
writer.writeheader()  # DictWriter only
writer.writerow(row)
writer.writerows(rows)
 
# Common options
csv.reader(f, 
    delimiter=',',      # Field separator
    quotechar='"',      # Quote character
    quoting=csv.QUOTE_MINIMAL,  # When to quote
    skipinitialspace=True,  # Ignore space after delimiter
)
 
# File opening pattern
with open('file.csv', 'r', newline='', encoding='utf-8-sig') as f:
    reader = csv.DictReader(f)

The csv module is one of Python's most useful standard library modules. Learn the gotchas once, and CSV handling becomes trivial.

React to this post:

#The Basics: reader and writer

#DictReader and DictWriter: Work with Names, Not Indices

#Handling Missing or Extra Fields

#Delimiters and Quoting: Beyond Commas

#Quoting Modes

#Custom Dialects

#The Encoding Gotcha: UTF-8 BOM

#Detecting Encoding

#Common Encodings You'll Hit

#Streaming Large CSV Files

#Generator Pattern for Pipelines

#Chunked Processing

#Common Gotchas

#The newline Parameter

#Excel Compatibility

#Type Conversion

#Empty vs Missing Fields

#Quotes in Fields

#Newlines in Fields

#Quick Reference

Keep Reading

Need help shipping fast?