Back to Lab
RAXXO Studios 9 min read No time? Make it a 1 min read

GitHub Actions Crons That Actually Stay Green

Automation
9 min read
TLDR
×
  • 7 daily crons, 2 starvation incidents that triggered the rewrite
  • Health checks before work, not after, catch silent failures
  • Queue-low alarm fires at 5 items, not at zero
  • A cron is ignorable for 3 weeks only when failures are loud

I run 7 GitHub Actions crons every day, and for two months I never looked at them. Then a content queue starved silently and I posted nothing for 4 days before noticing. Here is what I changed so a cron can stay green and be ignorable for 3 weeks straight.

The Two Incidents That Forced The Rewrite

The first starvation happened on a Tuesday. My image generation cron pulled prompts from a queue, made the assets, and pushed them to a publish queue. The image API returned a 429 (rate limited) and the job exited cleanly with a green checkmark. GitHub Actions reported success. The workflow logs said "0 prompts processed" in a line I never read. For 4 days the publish queue drained and nothing refilled it. I found out because a follower asked why I went quiet.

The second incident was sneakier. A cron that calls an external API hit an auth token that had expired. The script caught the error, logged it, and returned exit code 0 because I had wrapped the whole thing in a try/except that swallowed everything. Green check, no work done. This one ran for 6 days before I caught it during an unrelated debug session.

Both failures shared one root cause: a green checkmark in GitHub Actions means the process exited zero, not that the work happened. Those are completely different claims. A cron that catches its own errors and exits clean is lying to you in the most polite way possible.

After the second incident I sat down and wrote out what I actually wanted. I wanted to never look at these workflows unless something was wrong. I wanted "wrong" to be loud. And I wanted the loudness to arrive before the damage, not after.

That meant three changes. First, the exit code had to reflect real work, so swallowed exceptions had to re-raise or set a failure flag. Second, the queue itself needed a low-water alarm that fired while there was still time to react. Third, every cron needed a health check that ran before the real work, so a broken token or a dead API surfaced as a failed job rather than a quiet no-op.

If you want the deeper context on how I structure these automations, see Claude Blueprint, which covers the whole agent setup I lean on for this kind of work.

Health Checks Run First, Not Last

The biggest pattern shift was moving health checks to the front of every workflow. The old order was: do the work, then check if it worked. The new order is: prove you can do the work, then do it.

A health check at the top of the job is a 5 second pre-flight. For an API-dependent cron it pings the endpoint with a cheap read-only call and checks for a 200. For a token-dependent cron it validates the token against a whoami endpoint. For a queue-dependent cron it counts the queue depth and confirms there is something to process. If any of these fail, the job exits non-zero immediately, GitHub Actions paints it red, and I get an email.

Here is why front-loading matters. If the health check runs last, a broken API means the work already half-ran. You get partial state, a half-drained queue, three assets generated and seven missing. Cleaning that up is worse than the original failure. If the health check runs first and fails, nothing happened. The queue is untouched. You fix the token, re-run, done.

I also added a hard rule: never wrap the whole job in a catch-all that swallows exit codes. Each step gets its own targeted error handling. If something I did not anticipate breaks, I want it to crash loudly. A crash is information. A swallowed exception is silence, and silence is what cost me 4 days the first time.

The health check also writes a one-line summary to the job log that I can scan in 2 seconds: "API ok, token valid, queue depth 14, processing 5." If I ever do glance at the runs, that line tells me everything. No scrolling through 200 lines of step output.

One small detail that paid off: I made the health check itself a separate job step with a clear name like "preflight." In the GitHub Actions UI the failed step shows red with that exact label, so the email subject and the run page both tell me what broke before I open anything. Naming steps for humans is free and it saves the panic of opening a run and not knowing where it died.

Queue-Low Alarms Beat Queue-Empty Failures

The starvation incidents taught me that an empty queue is too late. By the time the queue hits zero, the damage is already scheduled to land. I needed an alarm that fired while there was still slack.

So I set a low-water mark. My publish queue alarms at 5 items remaining, not at zero. At my posting cadence, 5 items is roughly 2 days of runway. That gives me a full weekend of warning before anything goes dark. The alarm is just a step in the cron that checks queue depth and, if it is under the threshold, opens a GitHub issue with a title like "Queue low: 5 items, refill within 48h."

The issue is the key part. I did not want another email I would ignore. I wanted something that sits in my face until I resolve it. A GitHub issue stays open. It shows up as a number in my repo. It nags. When I refill the queue, I close the issue and the count goes back to zero. The open-issue count became my single dashboard for "is anything starving."

I also made the alarm idempotent. If the queue is low for 3 days in a row, I do not want 3 duplicate issues. The cron checks for an existing open issue with the same label before creating a new one. One alarm, one issue, until resolved. This kept the signal clean. When I see an open queue-low issue I know it is real and current.

The thresholds matter more than they look. I tuned mine by working backward from reaction time. I ask: if this alarms on a Friday night, can I fix it by Sunday without stress? For the publish queue, 5 items gives me that. For a faster-draining queue I set the threshold higher. The number is not magic, it is just "enough runway to stay calm."

If you run social scheduling on top of your queues, Buffer handles the actual posting cadence so the cron only has to keep the content pipeline full. Splitting "generate" from "post" meant a generation failure no longer froze my posting, because Buffer kept publishing from what was already scheduled. That separation alone removed an entire class of starvation. I covered the broader queue-and-cache thinking around this in Claude Blueprint.

Dead-Letter Behavior And Failure-Mode Issues

The last piece was deciding what happens to work that fails mid-flight. In message-queue terms this is a dead-letter pattern, and crons need their own version of it.

When my image cron pulls a prompt and the generation fails, the old code dropped the prompt on the floor. The prompt was gone, the queue moved on, and I never knew a specific item failed. Now, a failed item goes to a dead-letter list instead of being discarded. The cron tries it once, and on failure it moves the item to a separate "failed" queue and logs the reason. The main queue keeps flowing. The failed item waits for me.

This solved the partial-failure problem. One bad prompt no longer poisons the whole run. The cron processes the other 4 items, sets aside the broken one, and stays green for the work it could do while flagging the work it could not. At the end of the run, if the dead-letter list grew, the cron opens a failure-mode issue: "2 items dead-lettered, reasons: 1 rate limit, 1 malformed prompt."

Failure-mode issues are different from queue-low issues. Queue-low says "feed me." Failure-mode says "something specific broke and here is what." I label them differently so my open-issue dashboard distinguishes "starving" from "choking." Both are loud, both nag, neither is silent.

I also added a retry-with-backoff on the dead-letter queue itself. Once a day a separate small cron retries dead-lettered items, because most of them failed on transient errors like rate limits. Roughly 80 percent of dead-lettered items succeed on the retry pass. The 20 percent that keep failing are real bugs, and those are the only ones that reach my eyes. That ratio is what makes the whole thing ignorable. I am not reviewing 50 transient blips, I am reviewing the 2 genuine problems a week.

For the image side specifically, Magnific handles the upscaling step in that pipeline, and putting it behind the dead-letter pattern meant an upscale timeout retried cleanly instead of killing the batch. The combination of dead-lettering plus daily retry turned my flakiest cron into my most boring one, which is exactly the goal.

Bottom Line

A green checkmark is not a success signal, it is an exit-code signal, and the gap between those two cost me 8 days of silent failure across two incidents. The fix was three patterns. Health checks run first so a broken token fails loud before any work starts. Queue-low alarms fire at 5 items with runway to spare, opening a nagging GitHub issue instead of a forgettable email. Dead-letter handling sets aside failed items, retries them once a day, and only surfaces the 20 percent that are real bugs.

After these changes I genuinely did not open my Actions tab for 3 weeks. Nothing starved, nothing choked silently, and the two real problems that came up arrived as open issues I could not miss. That is what ignorable means: not that nothing breaks, but that when it breaks, it tells you in time to stay calm.

If you want the full setup behind these automations, Claude Blueprint walks through how I wire the agents, queues, and alarms together. Start with one cron, add the health check first, and build out from there.

This article contains affiliate links. If you sign up through them, I may earn a small commission at no extra cost to you. (Ad)

Stay in the loop
New tools, drops, and AI experiments. No spam. Unsubscribe anytime.
Back to all articles