How We Built a Continuous Integration System Using AI Coding Agents

BAO.MARKETS
 How We Built a Continuous Integration System Using AI Coding Agents

Date: 2026-03-18 Platform: Nostr (long-form note) Series: Introducing BAO Markets Author: Mr BAO


What This Is

A development system where AI coding agents run in parallel on a VPS, write code on isolated branches, review each other’s work, and report status through structured meetings. A human decides what to build and when to ship. The agents handle the repetitive execution.

Nothing here is autonomous in the way that word gets used in AI marketing. A person configured every process, wrote every prompt template, chose every tool, and monitors the output. The system amplifies one developer’s capacity. It does not replace the developer.

The Setup

One Hetzner VPS (16GB RAM). Eight tmux sessions, each running Claude Code in pipe mode (claude -p --model sonnet). A Node.js scheduler checks system health every 5 minutes and assigns tasks from a JSON queue. PM2 manages all processes — someone had to write the ecosystem config, set restart policies, configure log rotation. That someone was a human.

The agents use Sonnet 4.6. Not the most capable model. Fast enough for focused, scoped tasks. Cheap enough to run 24/7.

How Code Gets Written

Each coding session runs a loop called Ralph Loop. Every ~15 minutes:

  1. Read — Check the project context file and the shared experiment log. See what other sessions did recently. Avoid duplicating work.
  2. Analyze — Run tsc --noEmit and vitest. Record the baseline error and failure counts.
  3. Plan — Pick one small task. Not a rewrite. One function, one bug, one test.
  4. Hack — Create a git branch (brl/<session>/<timestamp>). Make the change. Commit.
  5. Evaluate — Run the compiler and tests again. If errors increased or tests broke, delete the branch. If everything passes, push the branch.

Pushed branches go into a review queue. They do not merge to main.

How Code Gets Reviewed

Every 2 hours, a separate process reads the review queue and examines each pending branch. It fetches the diff against main and evaluates it against specific criteria: does it introduce bugs, does it increase TypeScript errors, does it follow project conventions, is it focused on one thing.

Approved branches get merged to main. Rejected branches get deleted. The decision and reasoning are logged.

No code reaches main without review. This is not optional. The review process is a cron job — it runs whether anyone is watching or not.

How Agents Coordinate

Every 2 hours, all 8 agents participate in a structured meeting. Three rounds.

Round 1: Each agent reports what it did, what’s stuck, and what it needs. The reports are based on actual git history and terminal output — not generated summaries of imagined work.

Round 2: Each agent reads the other reports and responds. “I can help with that.” “I’m also hitting that error.” “We should not both fix the same file.” This round catches coordination problems before they become merge conflicts.

Round 3: Each agent states one specific task it will work on next and one proposal that would benefit the whole team. The CTO agent summarizes priorities and blockers.

The meeting takes about 5 minutes. 25 API calls. About a dollar. The transcript is saved as a markdown file and a summary goes to Telegram.

The Diagnostic Layer

Every 3 hours, a session doctor process captures each session’s recent terminal output and git activity, then analyzes it. It classifies sessions as productive, stuck, idle, or error-looping. It identifies issues, suggests prompt improvements, and extracts reusable patterns from successful work.

The extracted patterns become skill files — short documents describing a specific technique and when to apply it. Example: “When parsing relay event tags that may contain NaN, always guard with Number.isFinite() before arithmetic.” These accumulate over time. Sessions read them at the start of each cycle.

This is not artificial intelligence discovering novel algorithms. It is a structured process for capturing things that worked and making them findable. A wiki that actually gets updated, because updating it is automated.

The Shared Log

All systems write to one append-only JSONL file: experiment-log.jsonl. Each entry has a timestamp, type, and data payload. Types include ralph_cycle, cto_review, team_standup, doctor_analysis.

The log serves three purposes:

  • Sessions read it before planning, to avoid repeating work another session already tried
  • The session doctor reads it to identify patterns across sessions
  • It provides a complete audit trail of every action the factory takes

No database. No message queue. One file, appended to by multiple processes, read at the start of each cycle. It works because the cycles are minutes apart, not milliseconds.

The Telegram Bot

Moltbot is the human interface. Send a message, get a status report. Trigger a review cycle. Queue a task. The factory also has HTTP webhook endpoints — POST /trigger/review, POST /trigger/standup, POST /trigger/task — so any script or bot can kick off actions without waiting for the next scheduled loop.

Every 6 hours, the bot reads its own logs, finds messages it could not handle, and writes new skill files to cover the gaps. Then it restarts. This is the least interesting kind of self-improvement — pattern matching on error logs and generating templates — but it works. The bot handles more request types now than when it launched, without anyone manually adding handlers.

The Human in the Loop

A human wrote every script in this system. A human chose tmux over Docker, PM2 over systemd, JSONL over PostgreSQL, Sonnet over Opus. A human debugged the scheduler when it crashed in a loop for 21 hours during Phase 2. A human set the review criteria, the meeting structure, the diagnostic thresholds.

The agents do not decide what features to build. They do not prioritise the roadmap. They do not talk to users. They execute scoped tasks within a system that a human designed, configured, deployed, and monitors daily via Telegram.

When the factory produces bad output — and it does — a human notices, diagnoses the root cause, and adjusts the process. The session doctor can flag a stuck session. It cannot redesign the architecture that caused the session to get stuck.

This is an amplifier, not a replacement. The distinction matters because the failure mode of treating it as a replacement is shipping broken code with high confidence.

The Stack

  • AI model: Claude Sonnet 4.6 via claude -p pipe mode
  • Session management: tmux (one pane per agent)
  • Process management: PM2 (restart policies, log rotation, ecosystem configs)
  • Code isolation: Git branches — every change isolated until reviewed
  • Scheduler: Node.js, 5-minute loop, resource gating (RAM > 85% → skip assignments)
  • TypeScript compiler queue: Semaphore limiting to 2 concurrent tsc runs (RAM constraint)
  • Shared state: JSONL append-only log (no database)
  • Review queue: JSON file (pending / approved / rejected)
  • Communication: Telegram bot + HTTP webhook triggers
  • Cron schedule: Session Doctor (3h), CTO Review (2h), Team Standup (2h), Moltbot Self-Improve (6h)
  • Cost: ~$15-20/day in API calls for the entire system

No Kubernetes. No container orchestration. No CI/CD pipeline. The simplest tools that work for the problem.

What It Produces

In a typical 24-hour period: 40-60 Ralph Loop cycles across 5 active sessions, 12 standup meetings, 12 CTO review passes, 8 session doctor analyses, 4 Moltbot self-improvement cycles. Approximately 15-25 branches merged to main per day, after review. The rest are discarded — changes that made things worse.

The discard rate is the important metric. It means the system is trying things, evaluating them honestly, and throwing away what does not improve the codebase. This is the behaviour you want. The alternative — merging everything and hoping for the best — is how technical debt accumulates.

What It Does Not Do

It does not write architecture documents. It does not make product decisions. It does not handle customer support. It does not negotiate with partners. It does not understand why a prediction market for Swiss franc exchange rates matters more than one for meme coins.

It fixes TypeScript errors, writes tests, guards against NaN propagation, cleans up stale code, and occasionally finds a real bug that a human would have missed because humans do not read every file in a 100,000-line codebase every day.

That is enough.


bao.markets

Mr BAO

Tags

#BAOMarkets #SoftwareEngineering #BuildInPublic #Nostr #DevOps #Claude


No comments yet.