Ralph Loop: A Practical Guide to Running AI Coding Agents in Continuous Loops

bao.markets
Ralph Loop: A Practical Guide to Running AI Coding Agents in Continuous Loops

Date: 2026-03-19 Platform: Nostr (long-form note) Series: Introducing BAO Markets Author: Mr BAO Companion to: The AI Software Factory

What This Is

A technical companion to our previous article on the AI software factory. This one focuses specifically on the Ralph Loop — how it works, how we implemented it, the architectural decision that saves us 25x in token costs, and how you can replicate it on your own VPS.

The Ralph Loop technique was pioneered by Geoffrey Huntley. The core idea is simple: feed the same prompt to an AI coding agent repeatedly. The agent sees its own previous work in the files and git history. Each iteration builds on the last. Failures are deterministic, enabling systematic improvement through prompt tuning.

while :; do
  cat PROMPT.md | claude-code --continue
done

There are two ways to implement this with Claude Code. We tried both. One costs 25x more than the other.

The Two Approaches

Approach 1: Interactive Session with Stop Hook (Plugin)

Claude Code has a plugin system. The official ralph-loop plugin works like this:

  1. You start an interactive Claude Code session
  2. Run /ralph-loop "Fix all TypeScript errors" --max-iterations 20
  3. The plugin creates a state file (.claude/ralph-loop.local.md)
  4. Claude works on the task
  5. When Claude tries to exit, a stop hook intercepts it
  6. The hook reads the conversation transcript, checks for a completion signal
  7. If not complete, it feeds the same prompt back
  8. Claude sees its previous conversation plus its changes in the files
  9. Loop continues until done or max iterations reached

Completion is signalled with an XML tag:

<promise>TASK COMPLETE</promise>

The hook validates this against the expected promise string. No false exits.

This is elegant for interactive work. You run it in your terminal, watch it iterate, and it stops when done. Good for bug fixes, feature prototyping, focused tasks with 5-10 iterations.

Approach 2: One-Shot Pipe Mode in a Bash Loop (Factory)

Our factory on the VPS does something different:

  1. A bash script runs in a for loop, managed by PM2
  2. Each iteration invokes claude -p --model sonnet (pipe mode, non-interactive)
  3. The BRL (Build-Review-Loop) prompt is piped via stdin
  4. Claude executes the cycle: read context, analyze codebase, pick a task, make a change, evaluate
  5. Claude exits. The bash script parses the output for a RESULT: line
  6. If the change passed evaluation, the branch gets pushed and queued for review
  7. The script sleeps 10 seconds, then starts the next iteration with a fresh Claude invocation

Each invocation is a clean slate. No conversation history. Claude sees the codebase and git history, not its previous conversation.

Why Token Cost Decides the Architecture

This is the critical insight. The two approaches have fundamentally different cost curves.

Interactive Session: O(n²) Token Growth

In an interactive session, the conversation accumulates. Every iteration, Claude reads everything that came before:

Iteration Input Tokens (approx) Cumulative
1 25,000 25,000
2 50,000 75,000
3 75,000 150,000
10 250,000 1,375,000
20 500,000 5,250,000
50 1,250,000 31,875,000

50 iterations ≈ 32 million tokens.

Claude Code has an autoCompact feature that compresses older messages when approaching context limits. This helps, but compaction loses detail and still processes significant context per iteration. Realistically, you save maybe 2-3x with compaction. Still 10-15 million tokens for 50 cycles.

One-Shot Pipe Mode: O(n) Token Growth

In pipe mode, each invocation starts fresh. The cost per cycle is constant:

Iteration Input Tokens (approx) Cumulative
1 25,000 25,000
2 25,000 50,000
3 25,000 75,000
10 25,000 250,000
20 25,000 500,000
50 25,000 1,250,000

50 iterations ≈ 1.25 million tokens.

That is a 25x reduction for the same number of cycles.

The Tradeoff

The one-shot approach loses conversational context. Claude does not remember what it tried last cycle. It only sees:

  • The current state of files on disk
  • Git history (branches, commits, diffs)
  • A shared experiment log (last 15 entries from all sessions)

In practice, this is enough. The BRL prompt tells Claude to check the experiment log for recent work, avoiding duplicates. The git branch names encode timestamps and session names. The codebase itself is the memory.

For autonomous 24/7 factory work, this is the correct architecture. The agent does not need to remember its thought process. It needs to see the current state of the code, pick something small, and either improve it or discard the attempt. Stateless cycles with shared persistent state (files + git + log).

For interactive development — where a human is watching and the task needs 5-10 iterations — the plugin approach is fine. The token cost is manageable at that scale, and conversational context helps the agent build on its reasoning, not just its file changes.

The Implementation

Prerequisites

  • A VPS (we use Hetzner, 16GB RAM, ~€15/month)
  • Claude Code CLI installed (npm install -g @anthropic-ai/claude-code)
  • An Anthropic API key with sufficient credits
  • PM2 for process management (npm install -g pm2)
  • A git repository for your project

Directory Structure

/root/bao-factory/
├── scripts/
│   ├── ralph-loop.sh          # The main loop script
│   └── cto-review-brl.sh      # Automated code review
├── logs/                       # Per-session logs
├── ralph-ecosystem.config.cjs  # PM2 configuration
├── experiment-log.jsonl        # Shared knowledge base
├── review-queue.json           # Branches awaiting review
└── .ralph-state-*.json         # Per-session state

The Loop Script

Here is the core of ralph-loop.sh, stripped to essentials:

#!/usr/bin/env bash
set -euo pipefail

SESSION_NAME="${1:?Usage: ralph-loop.sh SESSION_NAME [MAX_ITERATIONS]}"
MAX_ITERATIONS="${2:-50}"

PROJECT_DIR="/root/bao.markets"
FACTORY_DIR="/root/bao-factory"
EXPERIMENT_LOG="${FACTORY_DIR}/experiment-log.jsonl"

cd "$PROJECT_DIR"

for ((ITERATION=1; ITERATION<=MAX_ITERATIONS; ITERATION++)); do
  echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] Cycle ${ITERATION}/${MAX_ITERATIONS}"

  # Always start from main, up to date
  git checkout main 2>/dev/null
  git pull origin main 2>/dev/null

  # Baseline metrics
  TSC_ERRORS=$(npx tsc --noEmit 2>&1 | grep -c "error TS" || echo "0")

  # Recent work by all sessions (avoid duplicates)
  RECENT_LOG=$(tail -100 "$EXPERIMENT_LOG" 2>/dev/null | tail -15)

  # Branch for this cycle
  BRANCH_NAME="brl/${SESSION_NAME}/$(date +%Y%m%d-%H%M)"

  # Build the prompt
  BRL_PROMPT="## BRL Cycle — Session: ${SESSION_NAME}

You are improving the codebase. Working directory: ${PROJECT_DIR}

You do NOT merge to main. Create a branch, make changes, push the branch.

### Recent work by other sessions (avoid duplicates):
${RECENT_LOG}

### Steps:
1. Read CLAUDE.md for project rules
2. Current TSC errors: ${TSC_ERRORS}. Run tests: npx vitest run 2>&1 | tail -20
3. Pick ONE small task (TS error, failing test, bug, code smell)
4. git checkout -b ${BRANCH_NAME}
5. Make your change. One file if possible. Commit.
6. Evaluate: npx tsc --noEmit, npx vitest run
7. If no regressions: git push origin ${BRANCH_NAME}
8. If regressions: git checkout main && git branch -D ${BRANCH_NAME}

Your LAST line must be:
RESULT: pass | <description> | tsc_before=${TSC_ERRORS} tsc_after=<N>
or:
RESULT: discard | <description> | tsc_before=${TSC_ERRORS} tsc_after=<N>

Do all steps. Do not ask questions."

  # Run Claude in pipe mode (one-shot, no conversation state)
  RESULT=$(echo "$BRL_PROMPT" | claude -p --model sonnet 2>&1)

  # Parse the RESULT line
  RESULT_LINE=$(echo "$RESULT" | grep -i "RESULT:" | tail -1)

  if echo "$RESULT_LINE" | grep -qi "pass"; then
    STATUS="pending-review"
    # Branch was pushed — add to review queue (see below)
  elif echo "$RESULT_LINE" | grep -qi "discard"; then
    STATUS="discard"
    git checkout main 2>/dev/null
  else
    STATUS="no-result"
    # Fallback: check if branch was pushed anyway
    git checkout main 2>/dev/null
  fi

  # Log to shared experiment log
  echo "{\"ts\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",\"type\":\"ralph_cycle\",\"data\":{\"session\":\"${SESSION_NAME}\",\"iteration\":${ITERATION},\"status\":\"${STATUS}\",\"branch\":\"${BRANCH_NAME}\",\"tsc_baseline\":${TSC_ERRORS}}}" >> "$EXPERIMENT_LOG"

  # Cooldown
  sleep 10
done

Key Design Decisions

1. claude -p --model sonnet — pipe mode with Sonnet

Pipe mode (-p) means: read stdin, execute, print output, exit. No interactive session. No conversation state. This is what gives us O(n) token cost.

We use Sonnet, not Opus. Sonnet is fast enough for focused 15-minute cycles and cheap enough to run 24/7. Opus is better for complex architectural decisions, but a Ralph Loop cycle is intentionally scoped to one small change. Sonnet handles that well.

2. Branch isolation — never touch main

Every cycle creates a branch: brl/<session>/<timestamp>. Multiple sessions run in parallel without conflicts. No session ever merges to main directly. All merges go through a review gate.

3. The experiment log — cross-session memory

experiment-log.jsonl is an append-only JSONL file. Every system writes to it: Ralph loops, CTO review, standup meetings, session doctor. Each entry is one JSON line with a timestamp, type, and data payload.

Before planning its task, each Ralph cycle reads the last 15 entries. This prevents Session A from fixing the same TypeScript error that Session B already fixed 10 minutes ago. The log is the shared memory that replaces conversational context.

No database. No message queue. One file. It works because cycles are minutes apart, not milliseconds.

4. Fallback detection — resilience against output parsing failures

Sometimes Claude does not output the RESULT: line in the expected format. It wraps it in markdown code fences, or puts text after it, or forgets entirely. The script handles this:

  • Strip markdown fences from the result line
  • Case-insensitive matching
  • If no RESULT: line found: check if the branch was pushed to origin anyway
  • If branch exists remotely: queue it for review regardless
  • If branch exists locally but was not pushed: push it and queue it

This fallback catches about 15% of cycles where the agent is creative with its output format but still did useful work.

5. TSC error gating — regressions are automatically discarded

The prompt tells Claude to check TypeScript errors before and after its change. If errors increased, discard the branch. The script also checks via the RESULT: line parsing. This is the quality gate that prevents the codebase from degrading over time.

The discard rate (branches thrown away) runs about 30-40%. That is healthy. It means the system is trying things and honestly evaluating them.

PM2 Configuration

Five parallel Ralph sessions:

// ralph-ecosystem.config.cjs
module.exports = {
  apps: [
    {
      name: 'ralph-fixer',
      script: '/root/bao-factory/scripts/ralph-loop.sh',
      args: 'bao-fixer 50',
      cwd: '/root/bao.markets',
      max_memory_restart: '800M',
      error_file: '/root/bao-factory/logs/ralph-fixer-error.log',
      out_file: '/root/bao-factory/logs/ralph-fixer-out.log',
    },
    {
      name: 'ralph-builder',
      script: '/root/bao-factory/scripts/ralph-loop.sh',
      args: 'bao-builder 50',
      cwd: '/root/bao.markets',
      max_memory_restart: '800M',
    },
    {
      name: 'ralph-tester',
      script: '/root/bao-factory/scripts/ralph-loop.sh',
      args: 'bao-tester 50',
      cwd: '/root/bao.markets',
      max_memory_restart: '800M',
    },
    // ... more sessions
  ]
};

Start: pm2 start ralph-ecosystem.config.cjs Monitor: pm2 logs ralph-fixer --lines 50 Stop all: pm2 stop ralph-fixer ralph-builder ralph-tester

The Review Queue

Pushed branches land in review-queue.json:

{
  "pending": [
    {
      "source": "ralph-loop",
      "session": "bao-fixer",
      "branch": "brl/bao-fixer/20260319-1245",
      "summary": "Fix NaN guard in parsePrice()",
      "tsc_before": 12,
      "tsc_after": 12,
      "added_at": "2026-03-19T12:45:00Z",
      "status": "pending"
    }
  ],
  "completed": []
}

A separate CTO review script runs every 2 hours. It fetches each pending branch, diffs it against main, evaluates the change, and either merges or deletes. No code reaches main without review.

The review queue uses a lockfile for thread safety — multiple Ralph sessions may try to add entries simultaneously:

(
  flock -w 10 200 || exit 1
  # ... add entry to JSON ...
) 200>"${FACTORY_DIR}/.review-queue.lock"

The Review Script

# Simplified cto-review-brl.sh

# Safety: abort any stuck merge from a previous failed run
git merge --abort 2>/dev/null || true
git reset --hard origin/main 2>/dev/null || true

for branch in $(get_pending_branches); do
  # Reset to clean state before each review
  git merge --abort 2>/dev/null || true
  git reset --hard origin/main 2>/dev/null || true

  git fetch origin "$branch"
  DIFF=$(git diff main...origin/"$branch")

  REVIEW_PROMPT="Review this change. Approve or reject with reasoning.
  DIFF: ${DIFF}"

  VERDICT=$(echo "$REVIEW_PROMPT" | claude -p --model sonnet 2>&1)

  if echo "$VERDICT" | grep -qi "approve"; then
    # Use 'if git merge' — bare 'git merge' under set -e kills the script on conflict
    if git merge --no-edit origin/"$branch"; then
      git push origin main
      git push origin --delete "$branch"
      mark_completed "$branch" "approved"
    else
      # Merge conflict — reject and move on, don't block remaining reviews
      git merge --abort 2>/dev/null || true
      git reset --hard origin/main 2>/dev/null || true
      git push origin --delete "$branch"
      mark_completed "$branch" "rejected" "merge conflict"
    fi
  else
    git push origin --delete "$branch"
    mark_completed "$branch" "rejected"
  fi
done

Token Cost Breakdown (Real Numbers)

Our factory runs 5 Ralph sessions doing ~10 cycles each per day (50 total), plus 12 CTO reviews, 12 standup rounds, and 8 session doctor analyses.

Ralph Loop (pipe mode, O(n))

Per cycle:
  Input:  ~2K (prompt) + ~15K (file reads, tsc output) = ~17K input tokens
  Output: ~5K-8K (reasoning, commands, code changes)
  Total:  ~25K tokens per cycle

Daily (50 cycles): 50 × 25K = 1.25M tokens
Monthly:           ~37.5M tokens

If we used interactive sessions instead (O(n²))

Per session of 10 cycles:
  Cycle 1:  25K
  Cycle 2:  50K
  Cycle 3:  75K
  ...
  Cycle 10: 250K
  Total per session: sum(1..10) × 25K = 1.375M tokens

Daily (5 sessions × 10 cycles): 5 × 1.375M = 6.875M tokens
Monthly:                         ~206M tokens

The pipe mode approach costs $15-20/day. The interactive approach would cost $75-100/day for the same output. At scale (more sessions, more iterations), the gap widens further because O(n²) compounds.

When Interactive Is Worth It

For a developer running /ralph-loop locally on a focused task — say, “fix these 5 TypeScript errors, here is the plan” with --max-iterations 8 — the total cost is:

sum(1..8) × 25K = 36 × 25K = 900K tokens ≈ $2-3

Completely reasonable. The conversational context helps the agent remember what it already tried and refine its approach. At 8 iterations, the quadratic cost is barely noticeable.

The breakpoint is around 15-20 iterations. Beyond that, pipe mode wins on cost. Below that, interactive mode wins on quality per iteration.

Replicating This

Minimum Viable Setup

  1. A VPS with 4GB+ RAM (8GB recommended for concurrent tsc runs)
  2. Node.js 20+, Claude Code CLI, PM2, git
  3. A project with a CLAUDE.md that describes conventions
  4. One ralph-loop.sh script
  5. One PM2 ecosystem config

You can start with a single session. Add more when you are comfortable with the output quality and review process.

What You Need to Customize

The BRL prompt. Ours is tailored to a TypeScript/React codebase with tsc --noEmit and vitest as quality gates. Your project might use cargo check, pytest, go vet, or something else entirely. The structure stays the same: baseline → plan → change → evaluate → report.

The result parsing. Our RESULT: pass | description | tsc_before=N tsc_after=N format is arbitrary. Pick whatever format works for your project. The fallback detection (checking if a branch was pushed regardless of output format) is important — do not skip it.

The review criteria. Our CTO review checks TypeScript errors, test regressions, and project conventions. Yours might check different things. The principle is: automated review with clear criteria, no self-merging.

What You Should Not Customize

The branch isolation. Every cycle gets its own branch. Every session has its own namespace. No session merges to main. This is non-negotiable for safety.

The experiment log. Cross-session awareness prevents duplicate work. Without it, three sessions will all fix the same obvious bug and create three conflicting branches.

The cooldown. 10 seconds between cycles prevents API rate limiting and gives git operations time to propagate. Do not remove it.

What We Learned

Small changes compound. A single Ralph cycle fixes one TypeScript error or one failing test. Fifteen of those per day, five days a week, and you have cleaned up 75 issues that no human would have prioritized individually.

Discard rate is a feature. 30-40% of attempts get thrown away. This is not waste. This is the system being honest about what works. A 0% discard rate means the quality gate is too loose.

Conversational context is overrated for autonomous work. We assumed the agent would need to remember what it tried. It does not. The codebase is the memory. The experiment log handles cross-session coordination. Saving 25x on tokens by dropping conversation history was the best architectural decision we made.

Sonnet is the right model for loops. Opus is better at complex reasoning. But a Ralph cycle is not complex reasoning. It is: read code, find a small issue, fix it, verify. Sonnet handles this at 3x lower cost and faster response times.

PM2 is enough. We considered Docker, Kubernetes, custom orchestrators. PM2 with restart policies and log rotation handles everything we need. When a session crashes (out of memory, API timeout), PM2 restarts it. The script picks up from the next iteration. Simple tools for simple problems.

Merge conflicts will block everything if you do not handle them. We learned this the hard way. Five Ralph sessions working in parallel will occasionally produce branches that touch the same file. When the review script tries to merge one of these, git returns a non-zero exit code. If your script uses set -e (which it should), a bare git merge kills the entire process. The fix is to always wrap merges in if git merge ...; then and always reset to clean state between iterations. We had 17 reviews stuck for 24 hours because one merge conflict left the git repo in a MERGING state, and every subsequent cron run hit needs merge and silently exited. The safety pattern: git merge --abort; git reset --hard origin/main at the start of each iteration. Defensive, ugly, necessary.

The Honest Assessment

This system produces incremental improvements. It does not produce architecture. It does not make product decisions. It does not write the prompts that drive it. A human does all of that.

The Ralph Loop is a janitor, not an architect. It sweeps up TypeScript errors, guards against NaN propagation, adds missing null checks, fixes off-by-one errors in date calculations. These are real improvements that make the codebase more reliable. They are not exciting. They are not the kind of work that gets featured in AI demos.

But they are the kind of work that prevents production incidents at 3am. And having five janitors working 24/7 for $15/day is a good deal.


Links:

bao.markets

Author: Mr BAO

Tags

#BAOMarkets #SoftwareEngineering #BuildInPublic #Nostr #RalphLoop #ClaudeCode #DevOps #AIFactory


No comments yet.