You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a data scientist / ML engineer, I do not want to go through the overhead of downloading all sorts of dependencies and uploading them back to the internet as a container image, before I can sub my job for execution (since I am on a limited-bandwidth connection).
Instead, I want these container images to be built on a build server, according to the image specification given as part of my job metadata.
This approach yields additional benefits:
Improved caching (esp. in teams), since images don't need to be built multiple times on each team members' local machine
No architecture gap (e.g., macOS/Linux, ARM/x64) between clients and compute server
No need for data scientists to install container tooling on their machine
Potential for security measures on the build server (e.g., vulnerability scanning, image signing, SBOM, ...)
High-level Design
Overview
Server offers an asynchronous POST /builds API endpoint that takes the following information
Image spec / Dockerfile
Build context archive (as .tar.gz)
Image tag
Target platform (optionally)
Actual image building happens as a background task, the client may poll the status through a GET /builds/<id> API endpoint
Build logs can be made available to the client through a GET /builds/<id>/logs endpoint (both historic and streaming logs)
Server-side
Need to provide the above APIs
Actual image build process should be decoupled from the API
Simplest approach: FastAPI BackgroundTasks
Could be a completely separate service
Decide whether to use Docker daemon (as we have before) or Buildkit (more flexible, especially when it comes to the output options and separation between multiple projects) to build the image
Image publishing needs to happen here (see open questions below)
Client-side
Putting together the build context in a similar fashion to the Docker CLI (unfortunately, this functionality is not exposed to the user)
this means respecting the ignore patterns given in a .dockerignore or deriving sensible defaults (e.g., from .gitignore)
Removing all image build functionality and replacing it with API calls
Moving all image spec functionality into the backend (basically the entire jobq.assembler package)
Additional Considerations
Security
Extracting a user-supplied build context on the build server exposes a DoS risk due to disk space exhaustion (c.f. zipbomb)
On Linux cgroups might come in handy to set resource quotas for the extraction process.
Cross-team layer sharing could leak information (but it also desirable to speed up build process)
Since the assembly process for the build context will have to re-implemented, we should make sure that we closely mirror the behavior of docker build.
Choosing sensible defaults for the context ignore patterns in the absence of a .dockerignore file
Open Questions
Control flow between the POST /jobs and image build API?
Does the client need to orchestrate the build process, or will the job submission endpoint change to take in all the required data (i.e., build context and image build spec) and abstract the image build process internally?
When the client needs to orchestrate the API calls, it also implies that it will have to periodically poll the image build status before calling the job submission (ergo, it needs to block the user indefinitely).
Combining image builds with job submission probably means extending the job state machine, with extra lifecycle states for the image build process (so that the submission endpoint can return a valid job ID right away while the job has not been submitted yet to the cluster). Problem: These are not coupled to Kueue since the job has not yet been submitted to the cluster. This directly raises the question of persistent state storage in the backend (see below).
Storing persistent state in the backend (see also Define Architecture and Basic Concepts for Project Mechanism #130): Eventually, the backend will have to persist some state (e.g., build job status, or even a full database of jobs and related entities). Do we bite the bullet now and design a persistent state management component, or can we get away with something ephemeral for now
E.g., if the client orchestrates the build process, it could be okay if the server returns an HTTP 404 error for the build status if it was restarted during the image build - the client would have to resubmit the image build request in this case (Q: does it need to keep the build context around until then, or is it acceptable to just re-create the archive, possibly with modified contents?).
Image naming: Who decides how images are named and tagged (user, CLI, API)?
Image destination: Where will images be published? Should we bundle an image registry with the default setup or delegate to external ones (which might complicate the topic of passing credentials)?
Should image build resource usage be monitored and used for accounting purposes?
Should it be possible to set resource limits for image builds, similar to the workflow execution? If yes, could these be unified (i.e., implement image build tasks as Kueue workloads)?
Client-based Orchestration
sequenceDiagram
participant Job API
actor CLI as Jobq User
participant Build API
participant Background Tasks
participant Builder
CLI->>Build API: POST /build
activate CLI
activate Build API
Build API -) Background Tasks: Trigger image build
Background Tasks ->> Builder: Build image
activate Builder
Build API-->>CLI: HTTP 202, build_id, image_ref
deactivate Build API
par
loop status != "completed"
CLI ->> Build API: GET /builds/<build_id>
activate Build API
Build API -->> CLI: Build status
deactivate Build API
end
and
Builder -->> Background Tasks: image_ref
deactivate Builder
Note right of Background Tasks: Publish image here?
end
CLI->>Job API: POST /jobs (image_ref)
activate Job API
Job API -->> CLI: HTTP 200, job_id
deactivate Job API
deactivate CLI
Loading
The text was updated successfully, but these errors were encountered:
As a data scientist / ML engineer, I do not want to go through the overhead of downloading all sorts of dependencies and uploading them back to the internet as a container image, before I can sub my job for execution (since I am on a limited-bandwidth connection).
Instead, I want these container images to be built on a build server, according to the image specification given as part of my job metadata.
This approach yields additional benefits:
High-level Design
Overview
POST /builds
API endpoint that takes the following informationGET /builds/<id>
API endpointGET /builds/<id>/logs
endpoint (both historic and streaming logs)Server-side
BackgroundTasks
Client-side
.dockerignore
or deriving sensible defaults (e.g., from.gitignore
)jobq.assembler
package)Additional Considerations
Security
buildx
builders orbuildkitd
instances per jobq team/project (Define Architecture and Basic Concepts for Project Mechanism #130)Correctness
docker build
..dockerignore
fileOpen Questions
POST /jobs
and image build API?Client-based Orchestration
The text was updated successfully, but these errors were encountered: