Server-side container image builds #43

AdrianoKF · 2024-08-07T08:27:51Z

As a data scientist / ML engineer, I do not want to go through the overhead of downloading all sorts of dependencies and uploading them back to the internet as a container image, before I can sub my job for execution (since I am on a limited-bandwidth connection).

Instead, I want these container images to be built on a build server, according to the image specification given as part of my job metadata.

This approach yields additional benefits:

Improved caching (esp. in teams), since images don't need to be built multiple times on each team members' local machine
No architecture gap (e.g., macOS/Linux, ARM/x64) between clients and compute server
No need for data scientists to install container tooling on their machine
Potential for security measures on the build server (e.g., vulnerability scanning, image signing, SBOM, ...)

High-level Design

Overview

Server offers an asynchronous POST /builds API endpoint that takes the following information
- Image spec / Dockerfile
- Build context archive (as .tar.gz)
- Image tag
- Target platform (optionally)
Actual image building happens as a background task, the client may poll the status through a GET /builds/<id> API endpoint
Build logs can be made available to the client through a GET /builds/<id>/logs endpoint (both historic and streaming logs)

Server-side

Need to provide the above APIs
Actual image build process should be decoupled from the API
- Simplest approach: FastAPI BackgroundTasks
- Could be a completely separate service
- Decide whether to use Docker daemon (as we have before) or Buildkit (more flexible, especially when it comes to the output options and separation between multiple projects) to build the image
Image publishing needs to happen here (see open questions below)

Client-side

Putting together the build context in a similar fashion to the Docker CLI (unfortunately, this functionality is not exposed to the user)
- this means respecting the ignore patterns given in a .dockerignore or deriving sensible defaults (e.g., from .gitignore)
Removing all image build functionality and replacing it with API calls
Moving all image spec functionality into the backend (basically the entire jobq.assembler package)

Additional Considerations

Security

Extracting a user-supplied build context on the build server exposes a DoS risk due to disk space exhaustion (c.f. zipbomb)
- On Linux cgroups might come in handy to set resource quotas for the extraction process.
Cross-team layer sharing could leak information (but it also desirable to speed up build process)
- Mitigation might involve separate buildx builders or buildkitd instances per jobq team/project (Define Architecture and Basic Concepts for Project Mechanism #130)

Correctness

Since the assembly process for the build context will have to re-implemented, we should make sure that we closely mirror the behavior of docker build.
Choosing sensible defaults for the context ignore patterns in the absence of a .dockerignore file

Open Questions

Control flow between the POST /jobs and image build API?
- Does the client need to orchestrate the build process, or will the job submission endpoint change to take in all the required data (i.e., build context and image build spec) and abstract the image build process internally?
  - When the client needs to orchestrate the API calls, it also implies that it will have to periodically poll the image build status before calling the job submission (ergo, it needs to block the user indefinitely).
  - Combining image builds with job submission probably means extending the job state machine, with extra lifecycle states for the image build process (so that the submission endpoint can return a valid job ID right away while the job has not been submitted yet to the cluster). Problem: These are not coupled to Kueue since the job has not yet been submitted to the cluster. This directly raises the question of persistent state storage in the backend (see below).
Storing persistent state in the backend (see also Define Architecture and Basic Concepts for Project Mechanism #130): Eventually, the backend will have to persist some state (e.g., build job status, or even a full database of jobs and related entities). Do we bite the bullet now and design a persistent state management component, or can we get away with something ephemeral for now
- E.g., if the client orchestrates the build process, it could be okay if the server returns an HTTP 404 error for the build status if it was restarted during the image build - the client would have to resubmit the image build request in this case (Q: does it need to keep the build context around until then, or is it acceptable to just re-create the archive, possibly with modified contents?).
Image naming: Who decides how images are named and tagged (user, CLI, API)?
Image destination: Where will images be published? Should we bundle an image registry with the default setup or delegate to external ones (which might complicate the topic of passing credentials)?
Should image build resource usage be monitored and used for accounting purposes?
Should it be possible to set resource limits for image builds, similar to the workflow execution? If yes, could these be unified (i.e., implement image build tasks as Kueue workloads)?

Client-based Orchestration

sequenceDiagram
    participant Job API
    actor CLI as Jobq User
    participant Build API
    participant Background Tasks
    participant Builder

    CLI->>Build API: POST /build
    activate CLI
    activate Build API

    Build API -) Background Tasks: Trigger image build
    Background Tasks ->> Builder: Build image
    activate Builder
    Build API-->>CLI: HTTP 202, build_id, image_ref
    deactivate Build API

    par
        loop status != "completed"
        CLI ->> Build API: GET /builds/<build_id>
        activate Build API
        Build API -->> CLI: Build status
        deactivate Build API
        end
    and
        Builder -->> Background Tasks: image_ref
        deactivate Builder
        Note right of Background Tasks: Publish image here?
    end
    
    CLI->>Job API: POST /jobs (image_ref)
    activate Job API
    Job API -->> CLI: HTTP 200, job_id
    deactivate Job API

    deactivate CLI

The text was updated successfully, but these errors were encountered:

AdrianoKF · 2024-09-13T11:50:57Z

I've hacked together a Trie-based path matching class that can be used to validate Gitignore (or Dockerignore) patterns against a file path: https://gist.github.com/AdrianoKF/d5bf77f200592c2cab2b8633b85f8a97

This can serve as the starting point when building an archive of the build context locally for upload to the server.

AdrianoKF added the enhancement New feature or request label Aug 7, 2024

maxmynter assigned maxmynter and unassigned maxmynter Sep 2, 2024

This was referenced Oct 18, 2024

Draft: Backend implementation for server-side image builds #131

Open

Define Architecture and Basic Concepts for Project Mechanism #130

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server-side container image builds #43

Server-side container image builds #43

AdrianoKF commented Aug 7, 2024 •

edited

Loading

AdrianoKF commented Sep 13, 2024

Server-side container image builds #43

Server-side container image builds #43

Comments

AdrianoKF commented Aug 7, 2024 • edited Loading

High-level Design

Overview

Server-side

Client-side

Additional Considerations

Security

Correctness

Open Questions

Client-based Orchestration

AdrianoKF commented Sep 13, 2024

AdrianoKF commented Aug 7, 2024 •

edited

Loading