Ralph Loop: A Practical Guide to Running AI Coding Agents in Continuous Loops
- What This Is
- The Two Approaches
- Why Token Cost Decides the Architecture
- The Implementation
- Token Cost Breakdown (Real Numbers)
- Replicating This
- What We Learned
- The Honest Assessment
- Author: Mr BAO
- Tags
Date: 2026-03-19 Platform: Nostr (long-form note) Series: Introducing BAO Markets Author: Mr BAO Companion to: The AI Software Factory
What This Is
A technical companion to our previous article on the AI software factory. This one focuses specifically on the Ralph Loop — how it works, how we implemented it, the architectural decision that saves us 25x in token costs, and how you can replicate it on your own VPS.
The Ralph Loop technique was pioneered by Geoffrey Huntley. The core idea is simple: feed the same prompt to an AI coding agent repeatedly. The agent sees its own previous work in the files and git history. Each iteration builds on the last. Failures are deterministic, enabling systematic improvement through prompt tuning.
while :; do
cat PROMPT.md | claude-code --continue
done
There are two ways to implement this with Claude Code. We tried both. One costs 25x more than the other.
The Two Approaches
Approach 1: Interactive Session with Stop Hook (Plugin)
Claude Code has a plugin system. The official ralph-loop plugin works like this:
- You start an interactive Claude Code session
- Run
/ralph-loop "Fix all TypeScript errors" --max-iterations 20 - The plugin creates a state file (
.claude/ralph-loop.local.md) - Claude works on the task
- When Claude tries to exit, a stop hook intercepts it
- The hook reads the conversation transcript, checks for a completion signal
- If not complete, it feeds the same prompt back
- Claude sees its previous conversation plus its changes in the files
- Loop continues until done or max iterations reached
Completion is signalled with an XML tag:
<promise>TASK COMPLETE</promise>
The hook validates this against the expected promise string. No false exits.
This is elegant for interactive work. You run it in your terminal, watch it iterate, and it stops when done. Good for bug fixes, feature prototyping, focused tasks with 5-10 iterations.
Approach 2: One-Shot Pipe Mode in a Bash Loop (Factory)
Our factory on the VPS does something different:
- A bash script runs in a
forloop, managed by PM2 - Each iteration invokes
claude -p --model sonnet(pipe mode, non-interactive) - The BRL (Build-Review-Loop) prompt is piped via stdin
- Claude executes the cycle: read context, analyze codebase, pick a task, make a change, evaluate
- Claude exits. The bash script parses the output for a
RESULT:line - If the change passed evaluation, the branch gets pushed and queued for review
- The script sleeps 10 seconds, then starts the next iteration with a fresh Claude invocation
Each invocation is a clean slate. No conversation history. Claude sees the codebase and git history, not its previous conversation.
Why Token Cost Decides the Architecture
This is the critical insight. The two approaches have fundamentally different cost curves.
Interactive Session: O(n²) Token Growth
In an interactive session, the conversation accumulates. Every iteration, Claude reads everything that came before:
| Iteration | Input Tokens (approx) | Cumulative |
|---|---|---|
| 1 | 25,000 | 25,000 |
| 2 | 50,000 | 75,000 |
| 3 | 75,000 | 150,000 |
| 10 | 250,000 | 1,375,000 |
| 20 | 500,000 | 5,250,000 |
| 50 | 1,250,000 | 31,875,000 |
50 iterations ≈ 32 million tokens.
Claude Code has an autoCompact feature that compresses older messages when approaching context limits. This helps, but compaction loses detail and still processes significant context per iteration. Realistically, you save maybe 2-3x with compaction. Still 10-15 million tokens for 50 cycles.
One-Shot Pipe Mode: O(n) Token Growth
In pipe mode, each invocation starts fresh. The cost per cycle is constant:
| Iteration | Input Tokens (approx) | Cumulative |
|---|---|---|
| 1 | 25,000 | 25,000 |
| 2 | 25,000 | 50,000 |
| 3 | 25,000 | 75,000 |
| 10 | 25,000 | 250,000 |
| 20 | 25,000 | 500,000 |
| 50 | 25,000 | 1,250,000 |
50 iterations ≈ 1.25 million tokens.
That is a 25x reduction for the same number of cycles.
The Tradeoff
The one-shot approach loses conversational context. Claude does not remember what it tried last cycle. It only sees:
- The current state of files on disk
- Git history (branches, commits, diffs)
- A shared experiment log (last 15 entries from all sessions)
In practice, this is enough. The BRL prompt tells Claude to check the experiment log for recent work, avoiding duplicates. The git branch names encode timestamps and session names. The codebase itself is the memory.
For autonomous 24/7 factory work, this is the correct architecture. The agent does not need to remember its thought process. It needs to see the current state of the code, pick something small, and either improve it or discard the attempt. Stateless cycles with shared persistent state (files + git + log).
For interactive development — where a human is watching and the task needs 5-10 iterations — the plugin approach is fine. The token cost is manageable at that scale, and conversational context helps the agent build on its reasoning, not just its file changes.
The Implementation
Prerequisites
- A VPS (we use Hetzner, 16GB RAM, ~€15/month)
- Claude Code CLI installed (
npm install -g @anthropic-ai/claude-code) - An Anthropic API key with sufficient credits
- PM2 for process management (
npm install -g pm2) - A git repository for your project
Directory Structure
/root/bao-factory/
├── scripts/
│ ├── ralph-loop.sh # The main loop script
│ └── cto-review-brl.sh # Automated code review
├── logs/ # Per-session logs
├── ralph-ecosystem.config.cjs # PM2 configuration
├── experiment-log.jsonl # Shared knowledge base
├── review-queue.json # Branches awaiting review
└── .ralph-state-*.json # Per-session state
The Loop Script
Here is the core of ralph-loop.sh, stripped to essentials:
#!/usr/bin/env bash
set -euo pipefail
SESSION_NAME="${1:?Usage: ralph-loop.sh SESSION_NAME [MAX_ITERATIONS]}"
MAX_ITERATIONS="${2:-50}"
PROJECT_DIR="/root/bao.markets"
FACTORY_DIR="/root/bao-factory"
EXPERIMENT_LOG="${FACTORY_DIR}/experiment-log.jsonl"
cd "$PROJECT_DIR"
for ((ITERATION=1; ITERATION<=MAX_ITERATIONS; ITERATION++)); do
echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] Cycle ${ITERATION}/${MAX_ITERATIONS}"
# Always start from main, up to date
git checkout main 2>/dev/null
git pull origin main 2>/dev/null
# Baseline metrics
TSC_ERRORS=$(npx tsc --noEmit 2>&1 | grep -c "error TS" || echo "0")
# Recent work by all sessions (avoid duplicates)
RECENT_LOG=$(tail -100 "$EXPERIMENT_LOG" 2>/dev/null | tail -15)
# Branch for this cycle
BRANCH_NAME="brl/${SESSION_NAME}/$(date +%Y%m%d-%H%M)"
# Build the prompt
BRL_PROMPT="## BRL Cycle — Session: ${SESSION_NAME}
You are improving the codebase. Working directory: ${PROJECT_DIR}
You do NOT merge to main. Create a branch, make changes, push the branch.
### Recent work by other sessions (avoid duplicates):
${RECENT_LOG}
### Steps:
1. Read CLAUDE.md for project rules
2. Current TSC errors: ${TSC_ERRORS}. Run tests: npx vitest run 2>&1 | tail -20
3. Pick ONE small task (TS error, failing test, bug, code smell)
4. git checkout -b ${BRANCH_NAME}
5. Make your change. One file if possible. Commit.
6. Evaluate: npx tsc --noEmit, npx vitest run
7. If no regressions: git push origin ${BRANCH_NAME}
8. If regressions: git checkout main && git branch -D ${BRANCH_NAME}
Your LAST line must be:
RESULT: pass | <description> | tsc_before=${TSC_ERRORS} tsc_after=<N>
or:
RESULT: discard | <description> | tsc_before=${TSC_ERRORS} tsc_after=<N>
Do all steps. Do not ask questions."
# Run Claude in pipe mode (one-shot, no conversation state)
RESULT=$(echo "$BRL_PROMPT" | claude -p --model sonnet 2>&1)
# Parse the RESULT line
RESULT_LINE=$(echo "$RESULT" | grep -i "RESULT:" | tail -1)
if echo "$RESULT_LINE" | grep -qi "pass"; then
STATUS="pending-review"
# Branch was pushed — add to review queue (see below)
elif echo "$RESULT_LINE" | grep -qi "discard"; then
STATUS="discard"
git checkout main 2>/dev/null
else
STATUS="no-result"
# Fallback: check if branch was pushed anyway
git checkout main 2>/dev/null
fi
# Log to shared experiment log
echo "{\"ts\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",\"type\":\"ralph_cycle\",\"data\":{\"session\":\"${SESSION_NAME}\",\"iteration\":${ITERATION},\"status\":\"${STATUS}\",\"branch\":\"${BRANCH_NAME}\",\"tsc_baseline\":${TSC_ERRORS}}}" >> "$EXPERIMENT_LOG"
# Cooldown
sleep 10
done
Key Design Decisions
1. claude -p --model sonnet — pipe mode with Sonnet
Pipe mode (-p) means: read stdin, execute, print output, exit. No interactive session. No conversation state. This is what gives us O(n) token cost.
We use Sonnet, not Opus. Sonnet is fast enough for focused 15-minute cycles and cheap enough to run 24/7. Opus is better for complex architectural decisions, but a Ralph Loop cycle is intentionally scoped to one small change. Sonnet handles that well.
2. Branch isolation — never touch main
Every cycle creates a branch: brl/<session>/<timestamp>. Multiple sessions run in parallel without conflicts. No session ever merges to main directly. All merges go through a review gate.
3. The experiment log — cross-session memory
experiment-log.jsonl is an append-only JSONL file. Every system writes to it: Ralph loops, CTO review, standup meetings, session doctor. Each entry is one JSON line with a timestamp, type, and data payload.
Before planning its task, each Ralph cycle reads the last 15 entries. This prevents Session A from fixing the same TypeScript error that Session B already fixed 10 minutes ago. The log is the shared memory that replaces conversational context.
No database. No message queue. One file. It works because cycles are minutes apart, not milliseconds.
4. Fallback detection — resilience against output parsing failures
Sometimes Claude does not output the RESULT: line in the expected format. It wraps it in markdown code fences, or puts text after it, or forgets entirely. The script handles this:
- Strip markdown fences from the result line
- Case-insensitive matching
- If no
RESULT:line found: check if the branch was pushed to origin anyway - If branch exists remotely: queue it for review regardless
- If branch exists locally but was not pushed: push it and queue it
This fallback catches about 15% of cycles where the agent is creative with its output format but still did useful work.
5. TSC error gating — regressions are automatically discarded
The prompt tells Claude to check TypeScript errors before and after its change. If errors increased, discard the branch. The script also checks via the RESULT: line parsing. This is the quality gate that prevents the codebase from degrading over time.
The discard rate (branches thrown away) runs about 30-40%. That is healthy. It means the system is trying things and honestly evaluating them.
PM2 Configuration
Five parallel Ralph sessions:
// ralph-ecosystem.config.cjs
module.exports = {
apps: [
{
name: 'ralph-fixer',
script: '/root/bao-factory/scripts/ralph-loop.sh',
args: 'bao-fixer 50',
cwd: '/root/bao.markets',
max_memory_restart: '800M',
error_file: '/root/bao-factory/logs/ralph-fixer-error.log',
out_file: '/root/bao-factory/logs/ralph-fixer-out.log',
},
{
name: 'ralph-builder',
script: '/root/bao-factory/scripts/ralph-loop.sh',
args: 'bao-builder 50',
cwd: '/root/bao.markets',
max_memory_restart: '800M',
},
{
name: 'ralph-tester',
script: '/root/bao-factory/scripts/ralph-loop.sh',
args: 'bao-tester 50',
cwd: '/root/bao.markets',
max_memory_restart: '800M',
},
// ... more sessions
]
};
Start: pm2 start ralph-ecosystem.config.cjs
Monitor: pm2 logs ralph-fixer --lines 50
Stop all: pm2 stop ralph-fixer ralph-builder ralph-tester
The Review Queue
Pushed branches land in review-queue.json:
{
"pending": [
{
"source": "ralph-loop",
"session": "bao-fixer",
"branch": "brl/bao-fixer/20260319-1245",
"summary": "Fix NaN guard in parsePrice()",
"tsc_before": 12,
"tsc_after": 12,
"added_at": "2026-03-19T12:45:00Z",
"status": "pending"
}
],
"completed": []
}
A separate CTO review script runs every 2 hours. It fetches each pending branch, diffs it against main, evaluates the change, and either merges or deletes. No code reaches main without review.
The review queue uses a lockfile for thread safety — multiple Ralph sessions may try to add entries simultaneously:
(
flock -w 10 200 || exit 1
# ... add entry to JSON ...
) 200>"${FACTORY_DIR}/.review-queue.lock"
The Review Script
# Simplified cto-review-brl.sh
# Safety: abort any stuck merge from a previous failed run
git merge --abort 2>/dev/null || true
git reset --hard origin/main 2>/dev/null || true
for branch in $(get_pending_branches); do
# Reset to clean state before each review
git merge --abort 2>/dev/null || true
git reset --hard origin/main 2>/dev/null || true
git fetch origin "$branch"
DIFF=$(git diff main...origin/"$branch")
REVIEW_PROMPT="Review this change. Approve or reject with reasoning.
DIFF: ${DIFF}"
VERDICT=$(echo "$REVIEW_PROMPT" | claude -p --model sonnet 2>&1)
if echo "$VERDICT" | grep -qi "approve"; then
# Use 'if git merge' — bare 'git merge' under set -e kills the script on conflict
if git merge --no-edit origin/"$branch"; then
git push origin main
git push origin --delete "$branch"
mark_completed "$branch" "approved"
else
# Merge conflict — reject and move on, don't block remaining reviews
git merge --abort 2>/dev/null || true
git reset --hard origin/main 2>/dev/null || true
git push origin --delete "$branch"
mark_completed "$branch" "rejected" "merge conflict"
fi
else
git push origin --delete "$branch"
mark_completed "$branch" "rejected"
fi
done
Token Cost Breakdown (Real Numbers)
Our factory runs 5 Ralph sessions doing ~10 cycles each per day (50 total), plus 12 CTO reviews, 12 standup rounds, and 8 session doctor analyses.
Ralph Loop (pipe mode, O(n))
Per cycle:
Input: ~2K (prompt) + ~15K (file reads, tsc output) = ~17K input tokens
Output: ~5K-8K (reasoning, commands, code changes)
Total: ~25K tokens per cycle
Daily (50 cycles): 50 × 25K = 1.25M tokens
Monthly: ~37.5M tokens
If we used interactive sessions instead (O(n²))
Per session of 10 cycles:
Cycle 1: 25K
Cycle 2: 50K
Cycle 3: 75K
...
Cycle 10: 250K
Total per session: sum(1..10) × 25K = 1.375M tokens
Daily (5 sessions × 10 cycles): 5 × 1.375M = 6.875M tokens
Monthly: ~206M tokens
The pipe mode approach costs $15-20/day. The interactive approach would cost $75-100/day for the same output. At scale (more sessions, more iterations), the gap widens further because O(n²) compounds.
When Interactive Is Worth It
For a developer running /ralph-loop locally on a focused task — say, “fix these 5 TypeScript errors, here is the plan” with --max-iterations 8 — the total cost is:
sum(1..8) × 25K = 36 × 25K = 900K tokens ≈ $2-3
Completely reasonable. The conversational context helps the agent remember what it already tried and refine its approach. At 8 iterations, the quadratic cost is barely noticeable.
The breakpoint is around 15-20 iterations. Beyond that, pipe mode wins on cost. Below that, interactive mode wins on quality per iteration.
Replicating This
Minimum Viable Setup
- A VPS with 4GB+ RAM (8GB recommended for concurrent
tscruns) - Node.js 20+, Claude Code CLI, PM2, git
- A project with a
CLAUDE.mdthat describes conventions - One
ralph-loop.shscript - One PM2 ecosystem config
You can start with a single session. Add more when you are comfortable with the output quality and review process.
What You Need to Customize
The BRL prompt. Ours is tailored to a TypeScript/React codebase with tsc --noEmit and vitest as quality gates. Your project might use cargo check, pytest, go vet, or something else entirely. The structure stays the same: baseline → plan → change → evaluate → report.
The result parsing. Our RESULT: pass | description | tsc_before=N tsc_after=N format is arbitrary. Pick whatever format works for your project. The fallback detection (checking if a branch was pushed regardless of output format) is important — do not skip it.
The review criteria. Our CTO review checks TypeScript errors, test regressions, and project conventions. Yours might check different things. The principle is: automated review with clear criteria, no self-merging.
What You Should Not Customize
The branch isolation. Every cycle gets its own branch. Every session has its own namespace. No session merges to main. This is non-negotiable for safety.
The experiment log. Cross-session awareness prevents duplicate work. Without it, three sessions will all fix the same obvious bug and create three conflicting branches.
The cooldown. 10 seconds between cycles prevents API rate limiting and gives git operations time to propagate. Do not remove it.
What We Learned
Small changes compound. A single Ralph cycle fixes one TypeScript error or one failing test. Fifteen of those per day, five days a week, and you have cleaned up 75 issues that no human would have prioritized individually.
Discard rate is a feature. 30-40% of attempts get thrown away. This is not waste. This is the system being honest about what works. A 0% discard rate means the quality gate is too loose.
Conversational context is overrated for autonomous work. We assumed the agent would need to remember what it tried. It does not. The codebase is the memory. The experiment log handles cross-session coordination. Saving 25x on tokens by dropping conversation history was the best architectural decision we made.
Sonnet is the right model for loops. Opus is better at complex reasoning. But a Ralph cycle is not complex reasoning. It is: read code, find a small issue, fix it, verify. Sonnet handles this at 3x lower cost and faster response times.
PM2 is enough. We considered Docker, Kubernetes, custom orchestrators. PM2 with restart policies and log rotation handles everything we need. When a session crashes (out of memory, API timeout), PM2 restarts it. The script picks up from the next iteration. Simple tools for simple problems.
Merge conflicts will block everything if you do not handle them. We learned this the hard way. Five Ralph sessions working in parallel will occasionally produce branches that touch the same file. When the review script tries to merge one of these, git returns a non-zero exit code. If your script uses set -e (which it should), a bare git merge kills the entire process. The fix is to always wrap merges in if git merge ...; then and always reset to clean state between iterations. We had 17 reviews stuck for 24 hours because one merge conflict left the git repo in a MERGING state, and every subsequent cron run hit needs merge and silently exited. The safety pattern: git merge --abort; git reset --hard origin/main at the start of each iteration. Defensive, ugly, necessary.
The Honest Assessment
This system produces incremental improvements. It does not produce architecture. It does not make product decisions. It does not write the prompts that drive it. A human does all of that.
The Ralph Loop is a janitor, not an architect. It sweeps up TypeScript errors, guards against NaN propagation, adds missing null checks, fixes off-by-one errors in date calculations. These are real improvements that make the codebase more reliable. They are not exciting. They are not the kind of work that gets featured in AI demos.
But they are the kind of work that prevents production incidents at 3am. And having five janitors working 24/7 for $15/day is a good deal.
Links:
- Original Ralph technique: https://ghuntley.com/ralph/
- Ralph Orchestrator: https://github.com/mikeyobrien/ralph-orchestrator
- Claude Code: https://docs.anthropic.com/en/docs/claude-code
bao.markets
Author: Mr BAO
Tags
#BAOMarkets #SoftwareEngineering #BuildInPublic #Nostr #RalphLoop #ClaudeCode #DevOps #AIFactory