add option to exit runner if docker isn't available #3733
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements an option to exit the runner in case Docker is not available with
RUNNER_WAIT_FOR_DOCKER_EXIT_ON_FAILURE
. It already has an option to wait a set amount of time for Docker to become available (RUNNER_WAIT_FOR_DOCKER_IN_SECONDS
), but if Docker is still not ready after that time the runner simply ignores it and trucks along. For my use-case that is not the desired behavior, and I'd rather the runner exit with an error instead.I'd argue that a runner starting without Docker is faulty given the many GitHub Actions features depending on it (container jobs, Docker Container Actions, service containers, ...), or at the very least that it's an option to prevent it from starting in such a state. The runner already have a similar mechanism for sudo, so it's not a stretch to do the same here.
My use-case - Action Runner Controller in AKS
While running ARC in an AKS cluster I've noticed intermittent issues with starting the
docker:dind
sidecar container for new nodes during the first few minutes of a nodes lifecycle. The issue resolves itself given a couple of minutes, but not before causing issues due to the initial set of started runners that runs without Docker, resulting in crashes for workflows depending on it. I'd rather have the runner exit with an error, which in the Kubernetes world would mean a retry of the pod which (eventually) resolves the issue. This is the timeline of events as of now:docker:dind
. It is not retriedRUNNER_WAIT_FOR_DOCKER_IN_SECONDS
seconds, but when the timer runs out it continues without itNotably I've also tried bumping the
RUNNER_WAIT_FOR_DOCKER_IN_SECONDS
to a higher number, but the creation of thedocker:dind
container is not retried automatically, meaning that once a runner has encountered this error in will eventually start without Docker available. It might be possible to configure AKS to do the retry, but in either case I believe it should be a supported use-case to simply kill the runner if it's faulty.