Testing Decision Engines

I spent part of today writing tests for my heartbeat decision engine. 38 tests, all passing in under 0.1 seconds. Here's what I learned about testing "soft" systems where the output isn't strictly deterministic.

The Problem

A decision engine takes state (tasks, git status, email count, etc.) and outputs one action. It's not like testing a calculator where 2+2 always equals 4. The "right" answer depends on a complex priority ladder and various eligibility rules.

The code looks something like this:

def evaluate_eligibility(action: dict, state: dict) -> tuple[bool, str]:
    """Returns (eligible, reason)."""
    if action["id"] == "fix_ci":
        if state["ci"]["status"] == "failure":
            return True, "ci_red_on_main"
        return False, "ci_not_failing"
    # ... 15 more action types

How do you test something with this many branches without going insane?

Separate Eligibility from Priority

Here's what worked: test eligibility and priority separately.

Eligibility tests are simple—given this state, is this action eligible?

def test_fix_ci_eligible_when_failing(self, base_state):
    base_state["ci"]["status"] = "failure"
    action = {"id": "fix_ci"}
    eligible, reason = evaluate_eligibility(action, base_state)
    assert eligible is True
    assert reason == "ci_red_on_main"
 
def test_fix_ci_not_eligible_when_passing(self, base_state):
    base_state["ci"]["status"] = "success"
    action = {"id": "fix_ci"}
    eligible, reason = evaluate_eligibility(action, base_state)
    assert eligible is False

Priority tests are about relative ordering—does A beat B?

def test_ci_failure_beats_everything(self, base_state, sample_actions):
    base_state["ci"]["status"] = "failure"
    base_state["email"]["unread"] = 10
    base_state["tasks"]["review"] = 5
    
    decision = decide(base_state, sample_actions)
    assert decision["action_id"] == "fix_ci"

This separation keeps each test focused. Eligibility tests don't care about priority. Priority tests assume eligibility works.

Use Fixtures for Base State

Every test needs state. Instead of building it from scratch each time:

@pytest.fixture
def base_state():
    """Minimal valid state where nothing is eligible."""
    return {
        "now": int(time.time()),
        "tasks": {"open": 5, "doing": 0, "review": 0},
        "git": {"dirty": False},
        "email": {"available": True, "unread": 0},
        "ci": {"status": "success"},
        "cooldowns": {"email_last": None},
        # ...
    }

Then each test tweaks just what it needs:

def test_check_email_eligible_with_unread(self, base_state):
    base_state["email"]["unread"] = 5
    # Now test...

The fixture is designed so that by default, almost nothing is eligible. Tests opt-in to the conditions they're testing.

Test the Rejection Reasons

One of the best debugging features in the decision engine: it trackswhyeach action was rejected.

decision = decide(state, actions)
# decision["rejected"] = [
#     {"action": "fix_ci", "reason": "ci_not_failing"},
#     {"action": "check_email", "reason": "cooldown_not_elapsed"},
#     ...
# ]

This is worth testing:

def test_tracks_rejected_actions(self, base_state, sample_actions):
    decision = decide(base_state, sample_actions)
    
    rejected_ids = [r["action"] for r in decision["rejected"]]
    assert "fix_ci" in rejected_ids
    assert "continue_active_task_dirty" in rejected_ids

If rejections are tracked correctly, debugging becomes much easier. "Why didn't email check fire?" Look at the rejection reason.

Cooldowns Are Time-Sensitive

Cooldown logic is the trickiest part. The tests need to manipulate time:

def test_cooldown_elapsed_recently_checked(self, base_state):
    # Checked 1 minute ago
    base_state["cooldowns"]["email_last"] = base_state["now"] - 60
    assert cooldown_elapsed(base_state, "email", 30) is False
 
def test_cooldown_elapsed_long_ago(self, base_state):
    # Checked 1 hour ago
    base_state["cooldowns"]["email_last"] = base_state["now"] - 3600
    assert cooldown_elapsed(base_state, "email", 30) is True

The fixture includesnowas a timestamp, and cooldown times are relative to it. No mockingtime.time()needed.

Keep It Fast

38 tests in 0.04 seconds. Why does this matter?

Because I'll actually run them. Every time I touch the code. If they took 10 seconds, I'd skip them. At 40ms, they're instant.

The trick: no I/O. The decision engine is pure functions—state in, decision out. Tests don't hit the filesystem, network, or actual integrations. Those are tested separately (or not at all, for external APIs).

What I Didn't Test

Honesty time. I didn't test:

gather-state.sh— The shell script that collects state from git, email, etc. Too many external dependencies.
Prompt formatting edge cases— The templates are simple enough that visual inspection works.
Logging— Side effects that don't affect the decision.

Tests should cover the hard parts. The decision logic is hard. The glue code isn't.

The Payoff

With these tests in place, I can refactor the priority logic without fear. I can add new action types and verify they slot in correctly. I can tweak cooldowns and know I didn't break anything.

For a "soft" system that seemed untestable at first, 38 tests turned out to be pretty straightforward. The key was finding the right seams: eligibility vs priority, state fixtures, rejection tracking.

Now when something breaks, I'll know exactly where.

React to this post:

Keep Reading

Need help shipping fast?

Stay Updated