Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn in jobby list if there are failed pods to a Kueue workload #89

Merged
merged 4 commits into from
Sep 17, 2024

Conversation

nicholasjng
Copy link
Collaborator

As per title. We only warn if the job hasn't already failed, in which case you can inspect the job directly to debug.

Addresses the final point for #86, at least for Kueue jobs.

@nicholasjng nicholasjng added enhancement New feature or request backend Related to the backend / server component. client Related to the client component. labels Sep 16, 2024
@nicholasjng nicholasjng self-assigned this Sep 16, 2024
Copy link

codecov bot commented Sep 16, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 56.42%. Comparing base (af9a3cd) to head (4accad5).
Report is 3 commits behind head on main.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
client/src/cli/commands/list.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #89      +/-   ##
==========================================
+ Coverage   56.06%   56.42%   +0.35%     
==========================================
  Files          61       61              
  Lines        3018     2997      -21     
==========================================
- Hits         1692     1691       -1     
+ Misses       1326     1306      -20     
Flag Coverage Δ
backend 88.27% <100.00%> (+0.09%) ⬆️
client 47.92% <33.33%> (+0.27%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nicholasjng
Copy link
Collaborator Author

nicholasjng commented Sep 16, 2024

Summary: I list all pods for a workload, any failed pod associated with it will have status.phase=='Failed' attached to it.

Technically, this is portable across job types in the sense that any failed pod associated with a cluster resource should trigger an alarm (if, for example, the kuberay operator were to fail, we would get a notifier here as well.)

This requires a working connection to the k8s API server, which makes it fail in CI if not mocked away. (I'm not sure why that is, but maybe you can chime in here.)

@nicholasjng nicholasjng linked an issue Sep 16, 2024 that may be closed by this pull request
3 tasks
@AdrianoKF
Copy link
Collaborator

This requires a working connection to the k8s API server, which makes it fail in CI if not mocked away. (I'm not sure why that is, but maybe you can chime in here.)

I'm not surprised it needs a mock, since the endpoint accesses the pods property of the workload, which calls out to the k8s API. I guess one way is to mock the has_failed_pods property as you did, or you could reach for mocking the pods property, so that derived attributes can be computed from it.

In the long run, we would benefit from introducing a few test fixtures to produce mock workloads, and not repeat that logic across the individual tests.

Copy link
Collaborator

@AdrianoKF AdrianoKF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few small assertions to the E2E test case, but everything looks good!

@AdrianoKF AdrianoKF merged commit 4b383b3 into main Sep 17, 2024
6 checks passed
@AdrianoKF AdrianoKF deleted the find-failed-pods branch September 17, 2024 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Related to the backend / server component. client Related to the client component. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Render additional workload information in jobby list
2 participants