Skip to content

Cannot Add a member during network partition with StrictReconfigCheck enabled #10114

Closed
@andrewjstone

Description

@andrewjstone

As implemented, StrictReconfigCheck is very valuable in that it prevents adding unhealthy nodes to the cluster and removing nodes if it would result in quorum loss. However, there is a 3rd check that I believe is counterproductive, and was wondering if we could go ahead and remove it.

With StrictReconfigCheck enabled, it is impossible to add new nodes to the cluster when one node is down or partitioned, even if there is quorum. See

if !isConnectedFullySince(s.r.transport, time.Now().Add(-HealthInterval), s.ID(), s.cluster.Members()) {

This requirement makes it such that the procedure for replacing a "healthy" node is different than the procedure for replacing an "unhealthy" node. For the former, it is recommended that new nodes should be added first, and then old nodes removed. For the latter the old node should be removed first. This is an inconsistency, and it makes automating cluster changes slightly more complicated.

More importantly however, the other node may just be temporarily partitioned, yet perfectly healthy. Forcing removal in order to increase fault tolerance seems to be a very restrictive requirement.

I am failing to see any practical benefit to this rule, and therefore am requesting that it be removed. Removal would make the node replacement procedure consistent, and allow increased resiliency in the face of partitions.

Activity

xiang90

xiang90 commented on Sep 21, 2018

@xiang90
Contributor

I am failing to see any practical benefit to this rule, and therefore am requesting that it be removed.

If the newly joined node itself is misconfigured, the will be no quorum anymore.

andrewjstone

andrewjstone commented on Sep 21, 2018

@andrewjstone
Author

Wouldn't it be better to check that the newly added node is healthy before allowing it to join the cluster? I thought that was already being done. Are you saying that there is no check for that, so instead you rely on full connectivity as a substitute?

xiang90

xiang90 commented on Sep 21, 2018

@xiang90
Contributor

Wouldn't it be better to check that the newly added node is healthy before allowing it to join the cluster?

That requires the learner feature that @gyuho and @jpbetz are working on. Basically, we need to test if the newly added node is able to catch up with the cluster before promoting it to participant into raft group.

xiang90

xiang90 commented on Sep 21, 2018

@xiang90
Contributor

node healthy cannot simply be inferred from network connectivity.

andrewjstone

andrewjstone commented on Sep 21, 2018

@andrewjstone
Author

If the newly joined node itself is misconfigured, the will be no quorum anymore.

This is only true in 3 node clusters.

node healthy cannot simply be inferred from network connectivity.

Yes, of course. But that's also exactly what this check does. I'm simply recommending substituting this check for the same one on the joining node, which I actually already thought existed. Since it doesn't, I suppose it requires larger changes, such as the coming learner change.

I understand you are unwilling to make this change, and will happily wait for the learner change. It was quite a shock though to discover tests failing on our system because we couldn't add a node even though we had quorum. More specific documentation on that front may be helpful.

Thanks.

andrewjstone

andrewjstone commented on Sep 21, 2018

@andrewjstone
Author

This is only true in 3 node clusters.

3 and 4 node I should have said. I was only considering odd numbered cluster sizes above 2.

xiang90

xiang90 commented on Sep 21, 2018

@xiang90
Contributor

3 and 4 node I should have said. I was only considering odd numbered cluster sizes above 2.

We probably can loose that checking for lager clusters.

andrewjstone

andrewjstone commented on Sep 21, 2018

@andrewjstone
Author

Great. Thanks.

wenjiaswe

wenjiaswe commented on Sep 24, 2018

@wenjiaswe
Contributor
stale

stale commented on Apr 7, 2020

@stale

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

purpleidea

purpleidea commented on Apr 7, 2020

@purpleidea
Contributor

hi bot! please stop pinging here

andrewjstone

andrewjstone commented on Apr 8, 2020

@andrewjstone
Author

What a silly thing it is to have a bot closing open bugs.

stale

stale commented on Jul 7, 2020

@stale

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @purpleidea@andrewjstone@xiang90@wenjiaswe

        Issue actions

          Cannot Add a member during network partition with StrictReconfigCheck enabled · Issue #10114 · etcd-io/etcd