Skip to content

Commit 8d0cfcd

Browse files
committed
Try to improve CI monitoring protocol
1 parent 9c628cb commit 8d0cfcd

File tree

2 files changed

+181
-43
lines changed

2 files changed

+181
-43
lines changed

.claude/ci/ci-watch

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
#!/usr/bin/env -S uv run --script
2+
"""Watch a check-ci output file and exit when actionable.
3+
4+
Usage: ci-watch OUTPUT_FILE [--start-offset N] [--stale-timeout SECS] [--poll-interval SECS]
5+
6+
Exits when:
7+
- FAILED: line(s) detected (exit 1)
8+
- All pipelines completed (exit 0)
9+
- check-ci timed out (exit 3)
10+
- No new output for N secs (stale) (exit 2)
11+
12+
Always prints "RESUME_OFFSET: <N>" before exiting so the caller can
13+
restart with --start-offset to skip already-processed content.
14+
15+
Streams new lines as progress while waiting.
16+
"""
17+
18+
import argparse
19+
import os
20+
import re
21+
import sys
22+
import time
23+
24+
FAILURE_PATTERN = re.compile(r"^FAILED:", re.MULTILINE)
25+
SUCCESS_PATTERN = re.compile(
26+
r"^(All pipelines completed|Stopping script after maximum)",
27+
re.MULTILINE,
28+
)
29+
TIMEOUT_PATTERN = re.compile(r"^Timeout after \d+s", re.MULTILINE)
30+
31+
32+
def parse_args():
33+
p = argparse.ArgumentParser(description=__doc__)
34+
p.add_argument("output_file", help="Path to check-ci output file")
35+
p.add_argument(
36+
"--start-offset",
37+
type=int,
38+
default=0,
39+
help="Start reading from this byte offset (default: 0)",
40+
)
41+
p.add_argument(
42+
"--stale-timeout",
43+
type=int,
44+
default=300,
45+
help="Exit after this many seconds without new output (default: 300)",
46+
)
47+
p.add_argument(
48+
"--poll-interval",
49+
type=int,
50+
default=5,
51+
help="Seconds between polls (default: 5)",
52+
)
53+
return p.parse_args()
54+
55+
56+
def read_new_content(path, offset):
57+
"""Read bytes from offset to EOF. Returns (new_text, new_offset)."""
58+
try:
59+
size = os.path.getsize(path)
60+
except OSError:
61+
return "", offset
62+
if size <= offset:
63+
return "", offset
64+
with open(path, "r", errors="replace") as f:
65+
f.seek(offset)
66+
text = f.read()
67+
return text, size
68+
69+
70+
def find_matching_lines(text, pattern):
71+
return [line for line in text.splitlines() if pattern.search(line)]
72+
73+
74+
def watch(output_file, start_offset, stale_timeout, poll_interval):
75+
offset = start_offset
76+
last_activity = time.monotonic()
77+
full_text = ""
78+
79+
while True:
80+
new_content, offset = read_new_content(output_file, offset)
81+
82+
if new_content:
83+
last_activity = time.monotonic()
84+
sys.stdout.write(new_content)
85+
if not new_content.endswith("\n"):
86+
sys.stdout.write("\n")
87+
sys.stdout.flush()
88+
full_text += new_content
89+
90+
failures = find_matching_lines(full_text, FAILURE_PATTERN)
91+
if failures:
92+
print("\n=== FAILURES DETECTED ===")
93+
print("\n".join(failures))
94+
completions = (
95+
find_matching_lines(full_text, SUCCESS_PATTERN)
96+
+ find_matching_lines(full_text, TIMEOUT_PATTERN)
97+
)
98+
if completions:
99+
print("=== FINAL STATUS ===")
100+
print("\n".join(completions))
101+
else:
102+
print("=== check-ci still running ===")
103+
print(f"RESUME_OFFSET: {offset}")
104+
return 1
105+
106+
timeouts = find_matching_lines(full_text, TIMEOUT_PATTERN)
107+
if timeouts:
108+
print("\n=== FINAL STATUS (check-ci timed out) ===")
109+
print("\n".join(timeouts))
110+
print(f"RESUME_OFFSET: {offset}")
111+
return 3
112+
113+
completions = find_matching_lines(full_text, SUCCESS_PATTERN)
114+
if completions:
115+
print("\n=== FINAL STATUS ===")
116+
print("\n".join(completions))
117+
print(f"RESUME_OFFSET: {offset}")
118+
return 0
119+
120+
elapsed = time.monotonic() - last_activity
121+
if elapsed >= stale_timeout:
122+
print(f"\n=== STALE: no new output for {stale_timeout}s ===")
123+
tail = full_text.rstrip("\n").splitlines()[-5:]
124+
if tail:
125+
print("Last lines:")
126+
print("\n".join(tail))
127+
print(f"RESUME_OFFSET: {offset}")
128+
return 2
129+
130+
time.sleep(poll_interval)
131+
132+
133+
def main():
134+
args = parse_args()
135+
sys.exit(watch(args.output_file, args.start_offset, args.stale_timeout, args.poll_interval))
136+
137+
138+
if __name__ == "__main__":
139+
main()

.claude/ci/index.md

Lines changed: 42 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -179,7 +179,7 @@ curl -s -H "PRIVATE-TOKEN: $GITLAB_PERSONAL_ACCESS_TOKEN" \
179179
"https://gitlab.ddbuild.io/api/v4/projects/355/jobs/<JOB_ID>/trace"
180180
```
181181

182-
### Monitoring a pipeline
182+
### Checking CI (Gitlab)
183183

184184
Use `.claude/ci/check-ci` to follow a pipeline until all jobs complete.
185185

@@ -192,7 +192,7 @@ Exit codes: 0 = all passed, 1 = failures or threshold reached.
192192

193193
#### Invocation pattern
194194

195-
Available options: `--commit <ref>`, `--pipeline <id>`,
195+
Available options: `--commit <ref>` OR `--pipeline <id>`,
196196
`--discovery-timeout <s>` (default 60), `--poll-interval <s>` (default 60),
197197
`--max-failures <n>` (default 50), `--timeout <s>` (default 7200 = 2 h),
198198
`--list-jobs` (see below).
@@ -204,9 +204,7 @@ immediately — does not monitor or download logs. Useful for a quick
204204
snapshot of what ran and what failed:
205205

206206
```bash
207-
export GITLAB_PERSONAL_ACCESS_TOKEN=$(jq -r \
208-
'.mcpServers.gitlab.env.GITLAB_PERSONAL_ACCESS_TOKEN' ~/.claude.json)
209-
PYTHONUNBUFFERED=1 .claude/ci/check-ci --pipeline <id> --list-jobs
207+
.claude/ci/check-ci --commit HEAD --list-jobs
210208
```
211209

212210
Output format:
@@ -218,6 +216,11 @@ Pipeline 105413994 (status: failed):
218216
...
219217
```
220218

219+
#### Monitor CI
220+
221+
If --list-jobs is not passed, check-ci will run until all monitored pipelines
222+
finish, until a timeout, or until the maximum number of failures is reached.
223+
221224
**Step 1 — Start check-ci in the background (Bash tool,
222225
`run_in_background: true`):**
223226

@@ -234,54 +237,50 @@ Output is being written to: /path/to/tasks/<id>.output
234237
```
235238
Note that path — it is the output file for the next step.
236239

237-
**Step 2 — Launch a Haiku agent IN THE FOREGROUND (not backgrounded) with
238-
this prompt, substituting the actual output-file path.**
239-
240-
**Do NOT read or poll the output file yourself — all monitoring must happen
241-
inside the Haiku subagent.**
240+
**Step 2 — Run ci-watch in the background (Bash tool,
241+
`run_in_background: true`):**
242242

243+
```bash
244+
.claude/ci/ci-watch [--start-offset N] OUTPUT_FILE
243245
```
244-
Monitor the CI output file OUTPUT_FILE and report when there is
245-
something to act on.
246-
247-
Loop with ~60 s sleeps. Each iteration, run TWO Bash commands:
248246

249-
1. grep "FAILED:" OUTPUT_FILE
250-
2. grep -E "All pipelines completed|Stopping script after maximum" OUTPUT_FILE
247+
`ci-watch` tails the output file and exits when there is something to
248+
act on. Run it with `run_in_background: true` — you will be notified
249+
when it completes. While it runs, you can do other work.
251250

252-
After EACH iteration, check the grep results (do NOT skip to the
253-
next sleep). Exit as soon as ONE condition is met:
251+
Exit codes:
252+
- 0 — all pipelines completed (no failures)
253+
- 1 — one or more FAILED: lines detected
254+
- 2 — stale: no new output for 5 minutes
255+
- 3 — check-ci timed out
254256

255-
- If grep 1 found matches: exit immediately. Return ALL the
256-
FAILED: lines. Do NOT wait for more failures.
257-
- If grep 2 found a match: return that line (it has the final
258-
passed/failed counts).
257+
On exit, ci-watch always prints `RESUME_OFFSET: <N>`. Record this
258+
value — pass it as `--start-offset N` when re-running ci-watch to
259+
skip already-processed content and wait for further failures.
259260

260-
Before exiting, call the speak_when_done MCP tool:
261-
- "All CI jobs passed" if the final-status line shows failed=0.
262-
- "<N> CI jobs failed" otherwise.
263-
264-
If you exit due to FAILED: lines and grep 2 had no match, say
265-
the script is still running so the main agent knows to come back.
266-
```
261+
When ci-watch completes, immediately call the `speak_when_done` MCP tool:
262+
- "All CI jobs passed" if exit 0.
263+
- "<N> CI jobs failed" if exit 1 (count is
264+
`grep "^FAILED:" OUTPUT_FILE | wc -l`).
265+
- "CI monitor timed out" if exit 2 or 3.
267266

268-
**Step 3 — Main agent acts on the result**
267+
**Step 3 — Act on the result**
269268

270-
When the Haiku agent returns, you (the main agent) decide what to do:
269+
Choose mong these actions, as appropriate:
271270

272271
- **Just report:** summarise the result to the user and stop.
273-
- **Investigate failures:** read `fail_logs/<job_id>.log` under the output
274-
directory for each failed job and diagnose the root cause. Once done,
275-
check again the file with the list of failures -- maybe new ones were
276-
reported in the interim and investigate them or just move to the next step
277-
(kill the script).
278-
- **Kill the script:** if `check-ci` is still running and you want to stop
279-
monitoring, kill it by its PID (noted from Step 1).
280-
- **Push fixes**: if a) the user asked you to (NOT OTHERWISE), AND b) you have
281-
made changes to fix the CI failures AND c) the current branch has an upstream
282-
branch, then commit and push. Then go back to step one. If any of the three
283-
preconditions don't match, stop and report the results (and your findings, if
284-
any).
272+
- **Investigate failures:** read `fail_logs/<job_id>.log` under the
273+
output directory for each failed job and diagnose the root cause.
274+
- **Wait for more failures:** if check-ci is still running and you want
275+
to keep watching after investigating, re-run ci-watch with
276+
`--start-offset <RESUME_OFFSET>` (back to Step 2).
277+
- **Kill check-ci:** if you want to stop monitoring entirely, kill it
278+
by its task ID or PID (noted from Step 1).
279+
- **Push fixes**: if a) the user asked you to (NOT OTHERWISE), AND b)
280+
you have made changes to fix the CI failures AND c) the current
281+
branch has an upstream branch, then commit and push. Then go back to
282+
Step 1. If any of the three preconditions don't match, stop and
283+
report the results (and your findings, if any).
285284

286285
### Downloading artifacts
287286

0 commit comments

Comments
 (0)