Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to exit runner if docker isn't available #3733

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

felixlut
Copy link

@felixlut felixlut commented Mar 5, 2025

This PR implements an option to exit the runner in case Docker is not available with RUNNER_WAIT_FOR_DOCKER_EXIT_ON_FAILURE. It already has an option to wait a set amount of time for Docker to become available (RUNNER_WAIT_FOR_DOCKER_IN_SECONDS), but if Docker is still not ready after that time the runner simply ignores it and trucks along. For my use-case that is not the desired behavior, and I'd rather the runner exit with an error instead.

I'd argue that a runner starting without Docker is faulty given the many GitHub Actions features depending on it (container jobs, Docker Container Actions, service containers, ...), or at the very least that it's an option to prevent it from starting in such a state. The runner already have a similar mechanism for sudo, so it's not a stretch to do the same here.

My use-case - Action Runner Controller in AKS

While running ARC in an AKS cluster I've noticed intermittent issues with starting the docker:dind sidecar container for new nodes during the first few minutes of a nodes lifecycle. The issue resolves itself given a couple of minutes, but not before causing issues due to the initial set of started runners that runs without Docker, resulting in crashes for workflows depending on it. I'd rather have the runner exit with an error, which in the Kubernetes world would mean a retry of the pod which (eventually) resolves the issue. This is the timeline of events as of now:

  1. New node starting up
  2. New runner starting on the new node
    1. Error starting docker:dind. It is not retried
    2. Runner waits for Docker for RUNNER_WAIT_FOR_DOCKER_IN_SECONDS seconds, but when the timer runs out it continues without it
  3. Workflows depending on Docker start crashing (container job, Docker actions, ...)

Notably I've also tried bumping the RUNNER_WAIT_FOR_DOCKER_IN_SECONDS to a higher number, but the creation of the docker:dind container is not retried automatically, meaning that once a runner has encountered this error in will eventually start without Docker available. It might be possible to configure AKS to do the retry, but in either case I believe it should be a supported use-case to simply kill the runner if it's faulty.

@felixlut felixlut requested a review from a team as a code owner March 5, 2025 19:52
@felixlut felixlut marked this pull request as draft March 5, 2025 19:54
@felixlut felixlut marked this pull request as ready for review March 6, 2025 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant