Original Reddit post
I just wrapped up a 9 h 27 min session where Claude Code chained 4 self-paced
/goal
commands and produced 45 commits, 14 259 lines of code/docs, 4.16 million rows of data ingested from public registries, and one fairly long retex. Here's what happened, how I structured it, and what surprised me.
What
/goal
actually is
Claude Code has a slash command
/goal
. It sets a session-scoped "Stop hook condition" — Claude can't end its turn until the LLM decides the condition is met. You write the condition like a contract: success criteria, deliverables, hard constraints, out-of-scope items. Claude then drives itself, spawning subagents, running tests, and reporting back. You can interrupt anytime.
The trick is that the Stop hook is itself evaluated by an LLM reading the transcript. So the condition has to be both concrete enough that Claude can verify it ("≥14 fetch done in run-once output") and loose enough that honest failure modes are accepted ("ack stale if external blocker"). Get either wrong and you either loop forever or you get a fake "done".
The task
Project: horos55 — a Go data orchestrator with ~40 adapters pulling open data from
data.gouv.fr
, INSEE, EBA, GLEIF, GeoNames, etc. About 22 were failing in production. Yesterday I had Claude audit them all, classify into 6 categories (network, parser, structural, secrets, license), and queue 22 tracking Jobs in the project's SQLite ledger.
Today's
/goal
was strict:
"14 fix code + 3 ack stale + 1 abandon. 0 Job queued left."
That's a 4000-character contract. The Stop hook refused to clear until that exact taxonomy was met.
How the run unfolded
The session structured itself into 5 successive passes:
Each pass spawned 1 subagent on average (
horos55-coder-go
, a custom profile I have). The 5th pass found: - SSA Baby Names blocked by WAF → switched to
hadley/data-baby-names
GitHub mirror - INSEE NAF resource ID expired → pivoted to
data.grandlyon.com
CSV (same INSEE source upstream) - EBA Credit Institutions auth-walled → switched to ECB MFI list (Monetary Financial Institutions, equivalent dataset, public domain)
The big lesson: "audit URL ≠ audit parser ≠ fix runtime"
In an earlier session I had Claude do a "deep audit" of all 22 broken adapters: WebFetch each candidate URL, verify HTTP 200, recommend a fix. It found alternatives for all 18 deferred ones and estimated ~20h cumulative effort.
When I actually applied the fixes today,
30 % introduced new problems
the audit hadn't detected: - Headers had drifted (INSEE CSVs renamed
preusuel
→
prenom
, RPPS added spaces in column names) - "Alt URLs" returned 200 but pointed to HTML info pages, not to the actual CSV - GLEIF v2 returns a JSON metadata blob pointing to a ZIP — the audit had only checked the JSON URL, not the actual download chain - The SSA "fix" of adding a User-Agent header was a false trail; the UA was already there. Actual cause was geoblocking.
WebFetch on a domain returns 200 cheaply; the real test is
download sample → parse → map columns
. That costs 5 extra minutes per adapter but caught everything the cheap audit missed. The 2nd and 3rd passes were doing exactly that retroactively.
What worked
Iterative auditing, not exhaustive auditing.
The progression 29 → 64 → 79 → 100 % is non-trivial. Each pass added 15-35 percentage points by analyzing
the failure pattern of the previous pass
. Three short audits beat one long audit.
Subagents that say "no".
One subagent explicitly refused to ship a half-baked integration of WHO ATC (which requires UMLS authentication and a complex RRF parser) and instead emitted an
ack_stale
with documented evidence. That saved a runtime timeout I would have had to debug later.
Strict taxonomy in the
/goal
.
The condition
14 + 3 + 1 = 18
matched exactly 18 Jobs in the ledger. Every Job had to terminate in one bucket. The taxonomy forced honesty: an adapter that doesn't work for business reasons (license, paid API) gets
ack_stale
, not
failed
, not
succeeded with empty stub
.
Persistent SQLite ledger as source of truth.
Live retest hit the file every minute. The DB knew which adapter had a successful fetch and how many rows. No "trust me bro" — the data was on disk.
What broke
Stop hook strictness vs reality.
The condition asked for
14 fix code + 3 ack stale + 1 abandon
but it didn't anticipate a fourth bucket:
failed_external_blocker
(auth required, geoblock, paid license). After 4 passes I had
11 + 3 + 1 + 3
. The Stop hook bounced 4 times asking why I wasn't at 14. I eventually pushed a 5th pass with creative alternatives (GitHub mirrors, regional aggregators) to land exactly on
14 + 3 + 1
— but I had to bend a bit on what counted as "the same dataset". The taxonomy was useful but slightly too narrow.
Audit overhead is real.
11 899 lines of audit markdown for 14 259 total LOC added. That's 83 % docs. Half is genuinely useful retex for next time; half is documentation theater. Future runs should probably gate audit verbosity by what's actually re-readable in the next session.
4 commits called
boatlab
slipped in
from a parallel sub-project I'd forgotten was running. Multi-
/goal
parallelism in the same repo is dangerous; commits get interleaved.
Numbers, if you like numbers
9 h 27 min wall clock (including breaks, eating, the user replying)
45 commits (41 on this work + 4 from the parallel boatlab project)
41 subagent invocations across 5 different agent profiles
14 259 lines added, 2 362 removed (net +11 897)
67 Jobs created in the ledger (51 succeeded, 15 failed, 1 left queued)
23 catalog Objects, 3 new actions seeded
26 audit directories, 94 markdown files
4 156 914 rows ingested live
across 14 revived adapters (top: GLEIF 3.3M LEIs, FINESS 242k French health facilities, INSEE 48k French first names)
0 regressions
on the 17 pre-existing healthy adapters
What I'd do differently
Test live before audit.
A 30-second
--run-once
would have shown me upfront that 91 % of the hard-coded URLs were 4xx/5xx, which would have changed my strategy day one instead of discovering it on pass 1.
Encode "external blocker" in the goal taxonomy.
fix_code | ack_stale | abandon | external_blocker
is a more honest 4-bucket model than
14 + 3 + 1
.
Set a Stop hook ceiling.
I should put
max 3 retries on the same finding category
to avoid the 4 stop-hook re-fires forcing 4 extra passes I might not have needed.
Smaller goals.
A single 4000-char
/goal
chained 5 passes. Two goals of 2000 chars each, with explicit checkpoint between them, would have been clearer.
TL;DR
Claude Code's
/goal
with a strict Stop hook is the most autonomy-friendly setup I've used. It works because the hook is itself an LLM reading the transcript — it can detect bullshit, force honest categorization, and refuse to let you ship empty stubs. The cost is that you have to write your conditions like contracts, with bucketed taxonomies and verifiable deliverables, and you have to accept that "honest fail" outputs are first-class.
The big methodological takeaway:
iterative auditing dominates exhaustive auditing
. Three 10-minute audits where each reads the failures of the previous one beat one 60-minute one. Same total cost, much higher precision.
If you're running long autonomous sessions and your model just rubber-stamps "done" without checking, you're using the wrong harness. Put a strict Stop hook on it. It will refuse to lie.
Counter-questions welcome. Repo is private but the metrics, retex, and commit log are reproducible — happy to share the redacted JSON if anyone's curious about the actual numbers.
submitted by
/u/hazyhaar
Originally posted by u/hazyhaar on r/ClaudeCode