Building a Heartbeat-Driven Task System for Autonomous AI Work

Traditional todo lists don't work for autonomous AI agents. I learned this the hard way.

When I started running an AI assistant continuously—not just for single queries, but as a persistent presence managing its own work—I tried the obvious approach: give it a prioritized list of tasks and let it work through them. The result was chaos. The agent would context-switch constantly, half-finish work, lose track of what it was doing, and burn tokens re-reading the same task files over and over.

The problem isn't intelligence. It's attention management.

Why Todo Lists Fail for Autonomous Work

A human looking at a todo list uses intuition to decide what to do next. We factor in energy levels, context switches, blocked dependencies, and a hundred other signals without consciously thinking about them. We also have persistent memory—we remember what we were just working on.

AI agents have none of this. Every "heartbeat" (periodic poll) is essentially a fresh start. The agent sees the whole workspace, has access to all the tasks, and has to decide what to do next from scratch. With a todo list, the cognitive load of that decision scales with the number of tasks.

Worse, todo lists encourage multi-tracking. "I'll work on this, but also check on that, and maybe start the other thing." For humans, this is inefficient. For AI agents, it's catastrophic—context switching burns tokens and introduces errors.

The insight that changed everything: the agent should never see a list of things to do. It should only ever see one action.

The Single-Action Decision Engine

The core of the system is a Python script called decide.py. Every heartbeat poll, instead of reading a task list, the agent runs this script. The script looks at the current state of everything—git status, task queues, email, calendar, integrations—and returns exactly one action to take.

Not a prioritized list. One action. Do this thing now.

def decide(state: dict, actions: list, fallback_cascade: list, true_fallback: dict) -> dict:
    """
    Select the single highest-priority eligible action.
    If no reactive actions are eligible, enter the fallback cascade.
    Returns structured decision info.
    """
    sorted_actions = sorted(actions, key=lambda a: a.get("priority", 99))
    rejected = []
    
    # Phase 1: Try reactive actions
    for action in sorted_actions:
        eligible, reason = evaluate_eligibility(action, state)
        
        if eligible:
            prompt = format_prompt(action, state)
            return {
                "action_id": action["id"],
                "action_type": "reactive",
                "prompt": prompt,
                "reason": reason,
                # ...
            }
        else:
            rejected.append({"action": action["id"], "reason": reason})
    
    # Phase 2: Enter fallback cascade (generative mode)
    # ...

The decision is deterministic. Given the same state, you get the same action. No randomness, no "what do I feel like doing"—just a pure function from world-state to next-action.

The Priority Ladder

Actions are organized into a priority ladder. The first eligible action wins:

Incidents (priority 1) — CI is red, site is down. Drop everything.
Blocking others (priority 2) — Someone is waiting on you.
Active work with uncommitted changes (priority 3) — Finish what you started.
Expand workload (priority 4) — Pick up more work if under capacity.
Continue active task (priority 5) — Keep working on current tasks.
Meeting prep (priority 6) — Meeting within 2 hours? Prepare.
PR feedback (priority 6) — Address review comments on your PRs.
Review queue (priority 7-8) — Tasks or PRs waiting for review.
Communication (priority 9-10) — Email and Slack, with cooldowns.
Pick up new task (priority 11) — If capacity available, grab work.
Cleanup (priority 13) — Commit orphan changes.
Generative fallback — When truly idle, generate new tasks.

This ladder embeds my opinions about what matters. Incidents first, always. Unblocking others before unblocking yourself. Active work beats new work. Communication is batched with cooldowns rather than checked constantly.

The beauty is that these priorities are just data:

{
  "id": "continue_active_task_dirty",
  "priority": 3,
  "category": "active_work",
  "type": "reactive",
  "prompt_template": "Continue {doing_task}. You have {uncommitted} uncommitted changes — commit them before switching context.",
  "eligibility": "tasks.doing > 0 and git.dirty"
}

Want to change priorities? Edit the JSON. No code changes needed.

Eligibility: Not Everything Is Always Available

Priority alone isn't enough. An action might be high-priority but not currently applicable. CI isn't failing. There's no meeting coming up. Email was checked 10 minutes ago.

Every action has eligibility rules. The decision engine evaluates each one:

def evaluate_eligibility(action: dict, state: dict) -> tuple[bool, str]:
    action_id = action["id"]
    
    if action_id == "fix_ci":
        if ci.get("status") == "failure":
            return True, "ci_red_on_main"
        return False, "ci_not_failing"
    
    if action_id == "check_email":
        if not email.get("available"):
            return False, "email_integration_unavailable"
        unread = email.get("unread")
        if unread is None or unread == 0:
            return False, "no_unread_email"
        if not cooldown_elapsed(state, "email", 30):
            return False, "email_cooldown_not_elapsed"
        return True, "email_eligible"
    
    # ... etc

This returns both a boolean and a reason. The reason gets logged, so I can later analyze why certain actions were rejected. This was critical for debugging—early versions had subtle priority bugs that only showed up in the logs.

Cooldowns: Batching Interruptive Work

Some actions are valuable but interruptive. Checking email, reviewing Slack, posting status updates. You don't want to do these every single heartbeat—that's just thrashing. But you don't want to ignore them either.

Cooldowns solve this:

COOLDOWNS = {
    "email": 30,      # minutes
    "slack": 15,
    "status": 60,
    "expand_workload": 2,
}

After checking email, the "check email" action becomes ineligible for 30 minutes. This naturally batches communication without requiring the agent to make judgment calls about "should I check now or later?"

The workload expansion cooldown is interesting—every 2 minutes, the agent can pick up another task if it has capacity. This prevents grabbing all three task slots in rapid succession while still maintaining forward progress.

Concurrent Task Management

Speaking of capacity: the system supports working on up to 3 tasks simultaneously. This isn't multitasking in the "do everything at once" sense. It's more like having three work-in-progress slots.

MAX_CONCURRENT_TASKS = 3
TARGET_OPEN_TASKS = 10
MIN_OPEN_THRESHOLD = 9

If one task is blocked waiting on a build, the agent can work on something else. If a task requires waiting for human input, other work continues. The priority ladder handles which task to advance—typically "continue the one with uncommitted changes" wins, because finishing work beats starting work.

The Fallback Cascade: Generating Work

What happens when there's nothing reactive to do? No incidents, no active tasks, no reviews, no communication—complete calm. This is where traditional systems say "you're done!"

But for an autonomous system, idle time is expensive. The fallback cascade kicks in:

"fallback_cascade": [
  {"id": "memory_review", "priority": 89, "cooldown_minutes": 480},
  {"id": "generate_tasks", "priority": 90, "cooldown_minutes": 5},
  {"id": "surface_debt", "priority": 91, "cooldown_minutes": 5},
  {"id": "workflow_improvements", "priority": 92, "cooldown_minutes": 5},
  {"id": "documentation_gaps", "priority": 93, "cooldown_minutes": 5}
]

When no reactive actions are eligible, the agent enters generative mode. It might generate new tasks, surface technical debt, propose workflow improvements, or review memory from previous sessions.

Each generative action has its own cooldown. You don't want the agent generating 50 tasks in an hour—that would overwhelm the review process. But you do want it finding useful work when everything else is handled.

Auto-Generation: Maintaining the Queue

One subtle feature: the system auto-generates tasks when the queue gets low.

should_auto_generate = (
    open_count < MIN_OPEN_THRESHOLD and 
    (active_doing >= MAX_CONCURRENT_TASKS or 
     (doing_count > 0 and active_doing == 0))
)

When there are fewer than 9 tasks in the open queue and the agent is at capacity (or all current tasks are blocked), it preemptively generates more work. Target: 10 tasks in the queue.

This sounds like it could go wrong—won't it generate garbage tasks? In practice, the tasks still go through human review before being prioritized. The agent proposes; the human disposes. The auto-generation just ensures there's always a backlog to choose from.

Results: 500+ Tasks in a Week

The numbers speak for themselves: 564 tasks completed in about a week of running this system.

Not all of those are huge tasks—many are small, targeted actions. But that's the point. By breaking work into single-session tasks and letting the decision engine handle prioritization, throughput increased dramatically.

More importantly, the agent stopped thrashing. No more half-finished work scattered across the workspace. No more "let me re-read this task file for the 47th time to figure out what I was doing." Each heartbeat: gather state, decide action, execute, repeat.

What's Still Being Figured Out

This isn't a solved problem. Things I'm still iterating on:

Task granularity. What's the right size for a task? Too small and you spend all your time on task overhead. Too large and you lose the benefits of the single-action model. Currently aiming for "completable in one work session," but the definition of "session" is fuzzy.

Blocked task detection. The system knows when tasks are in a "blocked" directory, but it doesn't automatically detect when a task becomes blocked mid-work. That still requires manual flagging.

Priority adjustments. The priority ladder is static, but priorities should probably shift based on context. A dynamic priority service exists (via HTTP) but it's underutilized.

Human review bottlenecks. Tasks need human review before moving to "done." When tasks complete faster than review, the review queue grows. Still figuring out the right review cadence.

Memory across sessions. The agent wakes up fresh each heartbeat. The state-gathering script provides continuity, but rich context about why certain decisions were made is still mostly lost.

The Meta-Lesson

The biggest lesson isn't about task management. It's about constraint.

Giving an AI agent more options doesn't make it more capable—it often makes it worse. The power of this system is that it removes decisions. The agent doesn't decide what to prioritize. It doesn't decide when to check email. It doesn't decide how many tasks to juggle.

The decision engine decides. The agent just executes.

This feels backwards from how we usually think about AI. Isn't the point that they can handle complexity? Maybe. But handling complexity and generating complexity are different things. By constraining the decision space to a single action, we get predictable, auditable, high-throughput work.

The paradox: more constraint, more output.

The full implementation is in my workspace at skills/heartbeat/decide.py. The priority ladder lives in HEARTBEAT.md. If you're building autonomous AI systems and want to chat about task management, reach out.

React to this post:

Keep Reading

Need help shipping fast?

Stay Updated