Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-3.5] [Solution 2] Auto sync members in v3store if Islearner is the only field that differs between v2store and v3store #19606

Open
wants to merge 1 commit into
base: release-3.5
Choose a base branch
from

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Mar 14, 2025

Second solution to provide a fix for the users who have already been affected by #19557. It's based on the comment #19586 (comment),

  • good side
    • Sync the v3store before the raft loop getting started, and also in appy workflow. So no any potential conflict
  • bad side
    • created two exceptions
      • Forcibly apply member promotion to v3store even when shouldApplyV3 is false
      • Use LookOutsideApply inside the apply workflow, so we have to disable the verification in e2e test case.
    • updated the apply workflow...

The first solution is #19586

  • good side
    • did not change the apply workflow
    • it's only executed on boostrap, it's an one-time operation
  • bad side
    • It updates the backend outside apply workflow, so has potential conflict. But actually it's protected by lock, so it's still safe.

Both solutions work. If there is no strong objection or preference, let's go for solution 2 (this PR)

cc @fuweid @ivanvc @jmhbnz @serathius @siyuanfoundation

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ahrtr
Copy link
Member Author

ahrtr commented Mar 17, 2025

cc @fuweid @ivanvc @jmhbnz @serathius @siyuanfoundation

Hi folks, can you spend at least 1 hour each day to move this forward?

@ivanvc ivanvc mentioned this pull request Mar 17, 2025
1 task
@@ -277,6 +277,9 @@ func StartEtcd(inCfg *Config) (e *Etcd, err error) {
return e, err
}
}

e.Server.SyncLearnerPromotionIfNeeded()
Copy link
Member

@serathius serathius Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure about consequences of calling sync here. I don't know if we really don't run any goroutines in background at this point, and proving so might be very hard.

By proposing to run it during bootstrap I meant within bootstrap.go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is release-3.5 isntead of main or release-3.6, there is no bootstrap.go. Note that we need to do this after corruption check,

https://github.com/ahrtr/etcd/blob/2f55a7c5144d084541114340eb058fb3a37ad9b5/server/embed/etcd.go#L260-L281

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you think this change will interact with corrupt check? Even if we do the sync before corrupt check on one member, next member will start with corrupt check before sync.

Copy link
Member Author

@ahrtr ahrtr Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we really don't run any goroutines in background at this point, and proving so might be very hard.

We call SyncLearnerPromotionIfNeeded before starting the raft loop, before starting all the [periodically] jobs (i.e. monitorVersions, linearizableReadLoop, monitorKVHash, etc), also before starting to serve client & peer requests. It's safe. What's your concern here?

How do you think this change will interact with corrupt check?

If there is data corrupt, we should leave it as it is.

Copy link
Member

@fuweid fuweid Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking if we still need this migration.

I found there is publish API call during starting. Since we already allow to override member information, this publish can help us migration (if I understand it correctly).

Path: membership.MemberAttributesStorePath(s.id),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serathius please let me know if you still have any concern on my previous comment. I think we should be all good now.

I am thinking if we still need this migration.

I found there is publish API call during starting. Since we already allow to override member information, this publish can help us migration (if I understand it correctly).

@fuweid I don't think we are on the same page. The publish logic is about attribute, what we are resolving/talking about in this PR (and issue 19557) is about IsLearner, which is part of RaftAttribute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override member information

Yes, this is true. thx for pointing out this.

Although the publish logic is only supposed to update attribute, but the implementation just overwrites the whole member info into v3store (bbolt), so it can automatically fix the issues which are affected by #19557 on top of the patch #19563

Just raised another PR with only e2e test cases #19629, which I believe is better than #19627

Copy link
Member Author

@ahrtr ahrtr Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For release-3.6, we don't need any further validation (i.e. requiring 3.5.20+ before upgrading to 3.6) any more, reasons:

  • If there are 2 or more (already promoted) learners, then etcd will crash anyway.
  • If there is only 1 (already promoted) learner, then etcd will automatically fix it due to the same reason as mentioned above.

So for release-3.6, we only need to add two e2e test cases:

Note that we still need to document the requirement of requiring 3.5.20+ before upgrading to 3.6, to avoid the potential upgrading failure (as mentioned above, etcd will crash if there are 2 or more learners).

@ahrtr ahrtr force-pushed the learner_sync_20250313_s2 branch from 3c99280 to 2f55a7c Compare March 18, 2025 11:39
…ers between v2store and v3store

Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants