Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server-side container image builds #43

Open
AdrianoKF opened this issue Aug 7, 2024 · 1 comment
Open

Server-side container image builds #43

AdrianoKF opened this issue Aug 7, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@AdrianoKF
Copy link
Collaborator

AdrianoKF commented Aug 7, 2024

As a data scientist / ML engineer, I do not want to go through the overhead of downloading all sorts of dependencies and uploading them back to the internet as a container image, before I can sub my job for execution (since I am on a limited-bandwidth connection).

Instead, I want these container images to be built on a build server, according to the image specification given as part of my job metadata.

This approach yields additional benefits:

  • Improved caching (esp. in teams), since images don't need to be built multiple times on each team members' local machine
  • No architecture gap (e.g., macOS/Linux, ARM/x64) between clients and compute server
  • No need for data scientists to install container tooling on their machine
  • Potential for security measures on the build server (e.g., vulnerability scanning, image signing, SBOM, ...)

High-level Design

Overview

  • Server offers an asynchronous POST /builds API endpoint that takes the following information
    • Image spec / Dockerfile
    • Build context archive (as .tar.gz)
    • Image tag
    • Target platform (optionally)
  • Actual image building happens as a background task, the client may poll the status through a GET /builds/<id> API endpoint
  • Build logs can be made available to the client through a GET /builds/<id>/logs endpoint (both historic and streaming logs)

Server-side

  • Need to provide the above APIs
  • Actual image build process should be decoupled from the API
    • Simplest approach: FastAPI BackgroundTasks
    • Could be a completely separate service
    • Decide whether to use Docker daemon (as we have before) or Buildkit (more flexible, especially when it comes to the output options and separation between multiple projects) to build the image
  • Image publishing needs to happen here (see open questions below)

Client-side

  • Putting together the build context in a similar fashion to the Docker CLI (unfortunately, this functionality is not exposed to the user)
    • this means respecting the ignore patterns given in a .dockerignore or deriving sensible defaults (e.g., from .gitignore)
  • Removing all image build functionality and replacing it with API calls
  • Moving all image spec functionality into the backend (basically the entire jobq.assembler package)

Additional Considerations

Security

  • Extracting a user-supplied build context on the build server exposes a DoS risk due to disk space exhaustion (c.f. zipbomb)
    • On Linux cgroups might come in handy to set resource quotas for the extraction process.
  • Cross-team layer sharing could leak information (but it also desirable to speed up build process)

Correctness

  • Since the assembly process for the build context will have to re-implemented, we should make sure that we closely mirror the behavior of docker build.
  • Choosing sensible defaults for the context ignore patterns in the absence of a .dockerignore file

Open Questions

  • Control flow between the POST /jobs and image build API?
    • Does the client need to orchestrate the build process, or will the job submission endpoint change to take in all the required data (i.e., build context and image build spec) and abstract the image build process internally?
      • When the client needs to orchestrate the API calls, it also implies that it will have to periodically poll the image build status before calling the job submission (ergo, it needs to block the user indefinitely).
      • Combining image builds with job submission probably means extending the job state machine, with extra lifecycle states for the image build process (so that the submission endpoint can return a valid job ID right away while the job has not been submitted yet to the cluster). Problem: These are not coupled to Kueue since the job has not yet been submitted to the cluster. This directly raises the question of persistent state storage in the backend (see below).
  • Storing persistent state in the backend (see also Define Architecture and Basic Concepts for Project Mechanism #130): Eventually, the backend will have to persist some state (e.g., build job status, or even a full database of jobs and related entities). Do we bite the bullet now and design a persistent state management component, or can we get away with something ephemeral for now
    • E.g., if the client orchestrates the build process, it could be okay if the server returns an HTTP 404 error for the build status if it was restarted during the image build - the client would have to resubmit the image build request in this case (Q: does it need to keep the build context around until then, or is it acceptable to just re-create the archive, possibly with modified contents?).
  • Image naming: Who decides how images are named and tagged (user, CLI, API)?
  • Image destination: Where will images be published? Should we bundle an image registry with the default setup or delegate to external ones (which might complicate the topic of passing credentials)?
  • Should image build resource usage be monitored and used for accounting purposes?
  • Should it be possible to set resource limits for image builds, similar to the workflow execution? If yes, could these be unified (i.e., implement image build tasks as Kueue workloads)?

Client-based Orchestration

sequenceDiagram
    participant Job API
    actor CLI as Jobq User
    participant Build API
    participant Background Tasks
    participant Builder

    CLI->>Build API: POST /build
    activate CLI
    activate Build API

    Build API -) Background Tasks: Trigger image build
    Background Tasks ->> Builder: Build image
    activate Builder
    Build API-->>CLI: HTTP 202, build_id, image_ref
    deactivate Build API

    par
        loop status != "completed"
        CLI ->> Build API: GET /builds/<build_id>
        activate Build API
        Build API -->> CLI: Build status
        deactivate Build API
        end
    and
        Builder -->> Background Tasks: image_ref
        deactivate Builder
        Note right of Background Tasks: Publish image here?
    end
    
    CLI->>Job API: POST /jobs (image_ref)
    activate Job API
    Job API -->> CLI: HTTP 200, job_id
    deactivate Job API

    deactivate CLI
Loading
@AdrianoKF AdrianoKF added the enhancement New feature or request label Aug 7, 2024
@maxmynter maxmynter assigned maxmynter and unassigned maxmynter Sep 2, 2024
@AdrianoKF
Copy link
Collaborator Author

I've hacked together a Trie-based path matching class that can be used to validate Gitignore (or Dockerignore) patterns against a file path: https://gist.github.com/AdrianoKF/d5bf77f200592c2cab2b8633b85f8a97

This can serve as the starting point when building an archive of the build context locally for upload to the server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants