I've been building Owen, a task automation system that coordinates AI coding agents. It started with zero tests. Now it has 360. Here's what I learned along the way.
The Numbers
- 360 tests across 17 test files
- 5,512 lines of test code
- ~0.1 seconds to collect all tests
- 8 packages in the monorepo
Not bragging—just context. More tests isn't always better, but coverage gives me confidence to ship fast.
What Actually Helped
1. Fixtures Over Setup Methods
Early on, I had setUp() methods everywhere. They got messy fast. Pytest fixtures are better:
@pytest.fixture
def mock_gmail():
with patch('owen.integrations.gmail.GmailClient') as mock:
mock.return_value.list_messages.return_value = []
yield mockNow I can compose fixtures. A test that needs Gmail and GitHub just lists both:
def test_full_workflow(mock_gmail, mock_github, tmp_path):
# Both mocks are ready, tmp_path gives me a scratch directory
...2. Mark the Slow Tests
Some tests hit real APIs. I mark them:
@pytest.mark.live
def test_live_current_weather():
result = weather.get_current("New York")
assert "temperature" in resultThen I can skip them in CI:
pytest -m "not live"Local development runs the full suite. CI runs the fast subset. Everyone's happy.
3. Test the Boundaries, Not the Glue
My decision engine has a lot of logic. I don't test every internal function. I test:
- Inputs: Does it parse config correctly?
- Outputs: Does it produce the right action?
- Edge cases: What happens with empty state?
The internal wiring changes. The contract doesn't.
4. Parametrize When You See Patterns
This was repetitive:
def test_parse_duration_30m():
assert parse_duration("30m") == timedelta(minutes=30)
def test_parse_duration_2h():
assert parse_duration("2h") == timedelta(hours=2)This is better:
@pytest.mark.parametrize("input,expected", [
("30m", timedelta(minutes=30)),
("2h", timedelta(hours=2)),
("1d", timedelta(days=1)),
])
def test_parse_duration(input, expected):
assert parse_duration(input) == expectedSame coverage, less code, easier to extend.
5. One Assert Per Concept
Not literally one assert statement—one concept being tested. This is fine:
def test_action_result():
result = engine.decide()
assert result.action_id == "check_email"
assert result.reason == "cooldown_expired"Both asserts verify the same concept: "the engine picked the right action."
This is bad:
def test_everything():
result = engine.decide()
assert result.action_id == "check_email"
assert config.load() is not None # unrelated
assert len(history) > 0 # also unrelatedIf this fails, I don't know which concept broke.
What I Got Wrong
Mocking Too Deep
I mocked internal methods instead of external boundaries. When I refactored, tests broke even though behavior was unchanged. Now I mock at the integration layer—HTTP clients, file systems, external APIs.
Not Running Tests Before Push
See my post on CI catching what I missed. Local test runs aren't optional.
Testing Implementation Details
I wrote tests like "check that _internal_helper() is called with X." Then I renamed the helper. Test broke. Behavior was fine.
Test what the code does, not how it does it.
The Payoff
I ship multiple times a day. Most pushes don't break anything. When they do, I usually know within seconds—not minutes, not hours.
360 tests sound like a lot. Spread across 8 packages, it's about 45 per package. That's not crazy. That's just... covering the important stuff.
Write the tests. Run the tests. Trust the tests.