Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition when initializing HealthCheckedEndpointGroup #6188

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

hyunw9
Copy link

@hyunw9 hyunw9 commented Mar 31, 2025

Motivation

Fixes #6018

A race condition can occur when adding a listener immediately after creating a HealthCheckedEndpointGroup.
Because the endpoint group initialization is asynchronous, the listener may receive endpoints in an incomplete or intermediate state.

Modification

Explicitly await the completion of HealthCheckedEndpointGroup initialization using whenReady().join() before adding a listener.
This ensures the listener will always receive a fully initialized endpoint list, eliminating any race conditions.

Result
• Race condition is fixed.
• Listener always receives a fully initialized EndpointGroup, ensuring reliable and predictable behavior.

@hyunw9 hyunw9 closed this Mar 31, 2025
@hyunw9 hyunw9 reopened this Mar 31, 2025
@hyunw9 hyunw9 closed this Mar 31, 2025
@hyunw9 hyunw9 reopened this Mar 31, 2025
Copy link
Contributor

@jrhee17 jrhee17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the test in question has failed again: https://github.com/line/armeria/actions/runs/14173631131/job/39703141791?pr=6188

@hyunw9
Copy link
Author

hyunw9 commented Apr 1, 2025

I noticed that the test failure in HealthCheckedEndpointGroupTest#cacheReflectsAttributeChanges seemed to stem from an unexpected ordering issue during initialization.

In this test, registering the endpoint is essentially part of the setup phase. However, the internal initialization of HealthCheckedEndpointGroup is asynchronous, and addListener(..., true) can synchronously invoke listener.accept(latest) before the system is ready, which introduces a race condition.

Initially, I thought calling whenReady().join() would fully synchronize the group’s initialization. However, after digging into the code, I realized that although join() ensures the group is initialized, it doesn’t control the timing of the listener callback.

Internally, addListener() acquires a lock to update the listener list, but the actual call to listener.accept(latest) (triggered by notifyLatestValue = true) occurs outside the lock. That means a race condition can still occur between listener.accept(...) and the external setup logic that expects a stable state.

To address this, I changed the approach:
1. I registered the listener using addListener(endpointListener, false) to avoid immediate invocation.
2. Then, after the group was fully initialized via whenReady().join(), I manually called listener.accept(...) to reflect the initial state in a safe and deterministic manner.

This ensures that the state propagation is explicitly controlled and prevents any premature callback from breaking the test setup.

@hyunw9 hyunw9 requested a review from jrhee17 April 1, 2025 14:15
Copy link
Contributor

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 567 to 568
endpointGroup.addListener(endpointsListener, false);
endpointsListener.accept(endpointGroup.endpoints());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we do this?

Suggested change
endpointGroup.addListener(endpointsListener, false);
endpointsListener.accept(endpointGroup.endpoints());
endpointGroup.addListener(endpointsListener, true);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works ! I think Fix duplicate endpoint handling in HealthCheckedEndpointGroup
this change helps the new group include endpoints from the previous group temporarily, which avoids any brief disappearance during the transition.

Copy link

codecov bot commented Apr 9, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.53%. Comparing base (8150425) to head (64a9e31).
Report is 52 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #6188      +/-   ##
============================================
+ Coverage     74.46%   74.53%   +0.06%     
- Complexity    22234    22364     +130     
============================================
  Files          1963     1969       +6     
  Lines         82437    82767     +330     
  Branches      10764    10777      +13     
============================================
+ Hits          61385    61688     +303     
- Misses        15918    15949      +31     
+ Partials       5134     5130       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants