Agent Conductor Writeup

2026/05/11


Source Code!

https://github.com/england2/aws-demo/

Overview

The Agent Conductor is responsible for scheduling and managing agents in reaction to tickets and AWS platform incidents.

At a high level, here is how this project works:

Database Scheduling System

The conductor is responsible for deciding when to schedule an agent to run, in addition to what type of agent it will be, e.g., an incident response or a ticket worker.

So far, there are two types of scheduling that the conductor can perform, which are ticket scheduling and incident scheduling, which spawn the two worker types above.

Here is how both work.

Ticket scheduling

Ticket scheduling is simpler than incident scheduling. Basically, the only point of the database here is to prevent repeating work.

Incident scheduling

Incident scheduling relates to CloudWatch alarms, such as high CPU alerts, error rates, cost anomalies, etc. We’re trying to spawn agents in reaction to AWS platform events, so there is slightly more logic involved than in ticket scheduling.

This way, a single incident which produces many alarms will only spawn a single worker to investigate per account.

Achieving Program Durability via Scheduling on a Database

The above describes simple scheduling logic that could easily be done in program memory without a database.

Of course, the issue with this is that program memory dies when the program dies.

The database is useful because it lets the conductor remember what has already been scheduled. If the conductor restarts, it can continue its chaining logic where it left off just based on the database.

Improved Program Testability With Database Scheduling

Deciding when to spawn an agent and with what context, permissions, repos, prompts, etc. to provide the agent with is (or will be) a very important part of the program.

It’s important to get scheduling right, as badly scheduled agents will likely produce many penalties:

Basing scheduling decisions on a database means that we can load many scenarios into different test databases and see exactly what scheduling and spawn-context decisions the scheduler would make in different situations.

For example, what scheduling decisions will be made if 100 alarms trigger in one hour across many accounts? Or if the system somehow gets sent 10 tickets per second? Will the conductor allocate agents to the correct repositories, will it choose to solve tickets instead of reacting to the incident, will it encounter an obscure logic error and not schedule anything at all, will it spawn agents that are at risk of encountering merge issues?

These are all questions that can be easily answered when the correct test data is written, because the scheduler mostly looks at its internal database to schedule agents and not ephemeral, in-memory data structures.

Notably, one could use old, existing system data to test scheduling in addition to test data generated by scripts.

This being said, current scheduling logic is relatively simple, and if the scheduler didn’t rely on a database it could absolutely still be tested. But scheduling based on a database makes testing easier, which is important in case more complex scheduling behavior were needed.

Note: Incident Scheduling vs Ticket Scheduling in Early Versions

The above penalties and potential complexity wouldn’t occur if the conductor only solved tickets instead of also reacting to alarms.

For this reason, I’m interested in getting a rock-solid base program with well-tested subsystems (e.g., git/platform permissions, multi-AWS account complexity sorted out, repo-claiming, etc.) before trying to implement potentially complex incident-based scheduling that may rely on OTel, CloudWatch, and other data to make decisions.

In short, the v1 production conductor would only respond to tickets. This would allow time to iron out the system’s fundamentals while it still has predictable ticket-only scheduling and the stakes/penalties are generally lower.

Finally, while the current scheduler may seem too simple to necessitate robust testing, note that the scheduling step could expand into these subsystems or features:

Conductor Deployment Process

We want to be able to update the conductor while it’s deployed without orphaning the workers it manages.

In other words, we can’t kill the server and restart it while there are still active jobs.

To solve this, the deployment pipeline and conductor do two things:

  1. After the pipeline activates, the conductor will stop accepting new jobs.
  2. The conductor does not shut down until the agents it is managing are finished with their work.

We achieve this using a simple system described below.

Conductor Safe Shutdown Gate

The main logic units of the conductor all run in goroutines, which on their own do not prevent the program from returning from main and exiting.

The conductor process stays open by maintaining a shutdown gate, which is just a for loop with a few conditionals. The conductor watches a file called IS_CONDUCTOR_SHUTTING_DOWN which starts as false. When we deploy a new version of a conductor, the deployment process flips this file to true, and the conductor will not schedule new jobs, allowing messages to safely pool in the SQS queue.

After the file is set to true, the shutdown gate counts the number of active workers, and the program exits when it reaches zero. Before exiting, the conductor writes the file CONDUCTOR_READY_FOR_SAFE_SHUTDOWN, informing the deploy script that the shutdown gate has concluded and a new version can be deployed.

Miscellaneous Thoughts, Improvements, Etc.

This section holds some thoughts not related to the primary writeup.


Agents Improve From Existing DevOps Practices Such as Vagrant

Agents benefit from living in a runtime container that has all of the tools they need to work. For instance, agents won’t be able to do their job if they lack rustc while working on Rust code. Additionally, they benefit from having the exact same tool versions that devs have, as well as miscellaneous scripts.

Therefore, if each environment/repo the agent may alter uses tools like Vagrant to normalize developer workflows, the scheduler could consume a map relating AWS accounts to their associated dev containers.

Instead of running an agent in a general purpose container, a step could be performed to drop the worker binary onto an existing developer environment container, serving as the agent’s execution environment.


Areas of Improvement + Possible Features (Non-exhaustive)