Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot Add a member during network partition with StrictReconfigCheck enabled #10114

Closed
andrewjstone opened this issue Sep 21, 2018 · 13 comments
Closed
Labels

Comments

@andrewjstone
Copy link

As implemented, StrictReconfigCheck is very valuable in that it prevents adding unhealthy nodes to the cluster and removing nodes if it would result in quorum loss. However, there is a 3rd check that I believe is counterproductive, and was wondering if we could go ahead and remove it.

With StrictReconfigCheck enabled, it is impossible to add new nodes to the cluster when one node is down or partitioned, even if there is quorum. See

if !isConnectedFullySince(s.r.transport, time.Now().Add(-HealthInterval), s.ID(), s.cluster.Members()) {

This requirement makes it such that the procedure for replacing a "healthy" node is different than the procedure for replacing an "unhealthy" node. For the former, it is recommended that new nodes should be added first, and then old nodes removed. For the latter the old node should be removed first. This is an inconsistency, and it makes automating cluster changes slightly more complicated.

More importantly however, the other node may just be temporarily partitioned, yet perfectly healthy. Forcing removal in order to increase fault tolerance seems to be a very restrictive requirement.

I am failing to see any practical benefit to this rule, and therefore am requesting that it be removed. Removal would make the node replacement procedure consistent, and allow increased resiliency in the face of partitions.

@xiang90
Copy link
Contributor

xiang90 commented Sep 21, 2018

I am failing to see any practical benefit to this rule, and therefore am requesting that it be removed.

If the newly joined node itself is misconfigured, the will be no quorum anymore.

@andrewjstone
Copy link
Author

Wouldn't it be better to check that the newly added node is healthy before allowing it to join the cluster? I thought that was already being done. Are you saying that there is no check for that, so instead you rely on full connectivity as a substitute?

@xiang90
Copy link
Contributor

xiang90 commented Sep 21, 2018

Wouldn't it be better to check that the newly added node is healthy before allowing it to join the cluster?

That requires the learner feature that @gyuho and @jpbetz are working on. Basically, we need to test if the newly added node is able to catch up with the cluster before promoting it to participant into raft group.

@xiang90
Copy link
Contributor

xiang90 commented Sep 21, 2018

node healthy cannot simply be inferred from network connectivity.

@andrewjstone
Copy link
Author

If the newly joined node itself is misconfigured, the will be no quorum anymore.

This is only true in 3 node clusters.

node healthy cannot simply be inferred from network connectivity.

Yes, of course. But that's also exactly what this check does. I'm simply recommending substituting this check for the same one on the joining node, which I actually already thought existed. Since it doesn't, I suppose it requires larger changes, such as the coming learner change.

I understand you are unwilling to make this change, and will happily wait for the learner change. It was quite a shock though to discover tests failing on our system because we couldn't add a node even though we had quorum. More specific documentation on that front may be helpful.

Thanks.

@andrewjstone
Copy link
Author

This is only true in 3 node clusters.

3 and 4 node I should have said. I was only considering odd numbered cluster sizes above 2.

@xiang90
Copy link
Contributor

xiang90 commented Sep 21, 2018

3 and 4 node I should have said. I was only considering odd numbered cluster sizes above 2.

We probably can loose that checking for lager clusters.

@andrewjstone
Copy link
Author

Great. Thanks.

@wenjiaswe
Copy link
Contributor

cc @jpbetz

@stale
Copy link

stale bot commented Apr 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 7, 2020
@purpleidea
Copy link
Contributor

hi bot! please stop pinging here

@stale stale bot removed the stale label Apr 7, 2020
@andrewjstone
Copy link
Author

What a silly thing it is to have a bot closing open bugs.

@stale
Copy link

stale bot commented Jul 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 7, 2020
@stale stale bot closed this as completed Jul 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants