Monitoring with Claude Code

Most of the now-playing kiosk was built with my autopilot workflow, which I've written about before. I describe a feature, it gets planned, built, and reviewed, and it ships. For most of the project that's the whole story. But the kiosk's job is to recognize records playing on a real turntable through a real audio chain, and the autopilot has no way to drop a needle and check that what it built actually works. For anything that touches the hardware, "the autopilot shipped it" isn't the same as "it works," and the only way to tell the difference is to play a record and watch.

So that's what I do, and have done since the first day I wired the turntable in: deploy the change to the Pi, put a record on, and watch the live log while it plays.

Watching it run on the Pi

The watching happens inside the same Claude Code session I'm already working in. I have Claude stream the Pi's log through Claude Code's Monitor tool, available since Claude Code v2.1.98. The docs describe it plainly: "Claude writes a small script for the watch, runs it in the background, and receives each output line as it arrives." From one of my sessions, the actual tool input looks like this:

{
  "description": "vinyl recognitions on Pi",
  "persistent": true,
  "command": "ssh nowplaying-pi 'journalctl -u nowplaying-orchestrator -f --no-pager -n 0 -o cat' | grep --line-buffered -E \"recognize: method=shazam|recognize: method=unmatched|publish: clients.*title=\""
}

The tool takes a description for the progress display, the command to run, and one of two lifecycle options — a timeout_ms for runs that should end on their own, or persistent: true for runs that should stay alive until I stop them. Mine uses persistent because I want the stream open while I work, then I stop it explicitly with TaskStop when the record ends. Monitor inherits Bash's permission rules, so anything you've allowed for Bash is allowed in a monitor too.

When the monitor starts, Claude Code tells the model not to sit and wait on it: "You will be notified on each event. Keep working — do not poll or sleep. Events may arrive while you are waiting for the user — an event is not their reply." So the stream runs underneath the conversation. I put a record on, narrate what I'm hearing — "dropping the needle now," "we just flipped to side B" — and each matching log line comes back as its own event:

<task-notification>
<summary>Monitor event: "vinyl recognitions on Pi"</summary>
<event>… nowplaying.main recognize: method=shazam …</event>
</task-notification>

Claude didn't reach for the Monitor tool on its own at first. It tried to poll with sleep 30 && ssh …, and the harness blocked it: "To wait for a condition, use Monitor with an until-loop… Do not chain shorter sleeps to work around this block." That block is what turned a polling habit into a real stream.

Watching means lining up three things: the log stream, which is the orchestrator's reasoning; what the kiosk screen is showing; and the record that's actually playing, which is the one signal only I can supply. The bug is usually in the gap between them — the log says one thing and the screen shows another, or both look fine and neither matches what's on the platter. The grep filter on the front of the stream grew as I learned which lines mattered, starting at recognize: and publish: and picking up fingerprint:, promotion:, pin applied, and the side-flip and anchor events as the debugging moved around. It isn't a clean feed — quiet passages and runout grooves produce stretches of nothing useful, and when two records play close together the notifications cross.

What it catches

The kiosk got debugged this way from the very first day. The first time I wired the turntable into the Pi I didn't write a test — I told Claude "let's wire in the vinyl, I've got the browser open and am watching," started the stream, and dropped a needle. Within a few clips Claude had it: "Found the bug — clip is 11 seconds of silence + 1 second of music. The rolling buffer captures retroactively from before the needle drop. ShazamIO needs ~3-5s of music to recognize." The capture window was grabbing the wrong stretch of audio. A unit test wouldn't have found it, because a test feeds in a clean clip — the whole problem was when the real clip got taken.

Later I had the autopilot ship a fingerprint-coverage feature, where the kiosk learns a record's tracks from confirmed plays. It passed its tests and the review signed off. Then I played a record for seven minutes and watched heartbeat after heartbeat come back method=unmatched with not one promotion: line. The feature that worked in every test was never firing on the Pi. Claude's read: "It demos perfectly in tests but doesn't fire on the real Pi."

The stream also lets me catch the kiosk being wrong before it settles. Testing recognition on J Dilla's Donuts — "let's go with donuts. i'll start it now. let's capture this" — the first publish came back as the wrong track, and I caught it the moment it hit the screen: "that's the wrong song though." The log line that produced it was already in front of us, so the fix started from evidence instead of a guess.

What the stream can't give you

The stream also catches the agent being confidently wrong. During one session Claude narrated every heartbeat as "still gated correctly" while the kiosk kept matching the wrong artist. It wasn't until I looked up at the screen, saw the wrong record displayed, and sent a photo that it turned around: "You're right — I was wrong. I kept saying those were 'gated correctly' but they clearly aren't." It could reason confidently from the log stream, but it couldn't see the screen, and the screen was what settled it.

The same limit showed up in the timing. Claude was tracking elapsed time to predict the next track and announced "Approaching end of Leo (2:15 of 3:06)." I had to correct it: "the issue is that you thought we were 2:15 into leo when we were 3:06 into leo because we didn't start counting leo until we matched. but we had already been in leo for at least 30 seconds." The clock in the log started when recognition fired, not when the needle actually landed — a gap only someone in the room could see. The stream gives the agent everything the software knows; what's actually happening in the room is the part I have to supply.

Where the autopilot hands off

The autopilot does the capture, the plan, the implementation, and the review, and hands me a feature that passes its tests. For most of the project that's the same as done. For anything that depends on the hardware it isn't — the feature is done when I've put a record on and watched it work, and that's the one step the autopilot can't run. So I run it myself: deploy the change, start the stream, and drop the needle.

The now-playing source is at github.com/schuettc/now-playing.