Description
G'day,
I'm using ElasticSearch 0.19.11 with the unicast Zen discovery protocol.
With this setup, I can easily split a 3-node cluster into two 'hemispheres' (continuing with the brain metaphor) with one node acting as a participant in both hemispheres. I believe this to be a significant problem, because now minimum_master_nodes
is incapable of preventing certain split-brain scenarios.
Here's what my 3-node test cluster looked like before I broke it:
Here's what the cluster looked like after simulating a communications failure between nodes (2) and (3):
Here's what seems to have happened immediately after the split:
- Node (2) and (3) lose contact with one another. (
zen-disco-node_failed
...reason failed to ping
) - Node (2), still master of the left hemisphere, notes the disappearance of node (3) and broadcasts an advisory message to all of its followers. Node (1) takes note of the advisory.
- Node (3) has now lost contact with its old master and decides to hold an election. It declares itself winner of the election. On declaring itself, it assumes master role of the right hemisphere, then broadcasts an advisory message to all of its followers. Node (1) takes note of this advisory, too.
At this point, I can't say I know what to expect to find on node (1). If I query both masters for a list of nodes, I see node (1) in both clusters.
Let's look at minimum_master_nodes
as it applies to this test cluster. Assume I had set minimum_master_nodes
to 2. Had node (3) been completely isolated from nodes (1) and (2), I would not have run into this problem. The left hemisphere would have enough nodes to satisfy the constraint; the right hemisphere would not. This would continue to work for larger clusters (with an appropriately larger value for minimum_master_nodes
).
The problem with minimum_master_nodes
is that it does not work when the split brains are intersecting, as in my example above. Even on a larger cluster of, say, 7 nodes with minimum_master_nodes
set to 4, all that needs to happen is for the 'right' two nodes to lose contact with one another (a master election has to take place) for the cluster to split.
Is there anything that can be done to detect the intersecting split on node (1)?
Would #1057 help?
Am I missing something obvious? :)
Activity
moscht commentedon Dec 18, 2012
We also had at some point a similar issue, where minimum_master_nodes did not prevent the cluster from having two different views of the nodes at the same time.
As our indices were created automatically, some of the indices were created twice, once in every half of the cluster with the two masters broadcasting different states, and after a full cluster restart some shards were unable to be allocated, as the state has been mixed up. This was on 0.17. so I am not sure, if data would still be lost, as the state is now saved with the shards. But the other question is what happens when an index exists twice in the cluster (as it has been created on every master).
I think we should have a method to recover from such a situation. As I don't know how the zen discovery works exactly, I can not say how to solve it, but IMHO a node should only be in one cluster, in your second image node 1 should either be with 2, preventing 3 from becoming master, or with node 3, preventing 2 from staying master.
tallpsmith commentedon Dec 18, 2012
see Issue #2117 as well, I'm not sure if the Unicast discovery is making it worse for you, but I think we captured the underlying problem over on that issue, but would like your thoughts too.
saj commentedon Dec 20, 2012
From #2117:
Ditto.
I see a split on the first partial isolation. To me, these bug reports look like two different problems.
trollybaz commentedon Apr 3, 2013
I believe I ran into this issue yesterday in a 3 node cluster- a node elects itself master when the current master is disconnected from it. The remaining partipant node toggles between having the other nodes as its master before settling on one. Is this what you saw @saj?
saj commentedon Apr 3, 2013
Yes, @trollybaz.
I ended up working around the problem (in testing) by using elasticsearch-zookeeper in place of Zen discovery. We already had reliable Zookeeper infrastructure up for other applications, so this approach made a whole lot of sense to me. I was unable to reproduce the problem with the Zookeeper discovery module.
tallpsmith commentedon Apr 4, 2013
I'm pretty sure we're suffering from this in certain situations, and I don't think that it's limited to unicast discovery.
We've had some bad networking, some Virtual Machine stalls (result of SAN issues, or VMWare doing weird stuff), or even heavy GC activity can cause enough pauses for aspects of the split brain to occur.
We were originally running pre-0.19.5 which contained an important fix for an edge case I thought we were suffering from, but since moving to 0.19.10 we've had at least one split brain (VMware->SAN related) that caused 1 of the 3 ES nodes to lose touch with the master, and declare itself master, while still then maintaing links back to other nodes.
I'm going to be tweaking our ES logging config to output DEBUG level discovery to a separate file so that I can properly trace these cases, but there have just been too many of these not to consider ES not handling these adversarial environment cases.
I believe #2117 is still an issue and is an interesting edge case, but I think this issue here best represents the majority of the issues people are having. My gut/intuition seems to indicate that the probability of this issue occurring does drop with a larger cluster, so the 3-node, minimum_master_node=2 is the most prevalent case.
It seems like when the 'split brain' new master connects to it's known child nodes, any node that already has an upstream connection to an existing master probably should be flagging it as a problem, and telling the newly connected master node "hey, I don't think you fully understand the cluster situation".
brusic commentedon Apr 5, 2013
I believe there are two issues at hand. One being the possible culprits for a node being disconnected from the cluster: network issues, large GC, discover bug, etc... The other issue, and the more important one IMHO, is the failure in the master election process to detect that a node belongs to two separate clusters (with different masters). Clusters should embrace node failures for whatever reason, but master election needs to be rock solid. Tough problem in systems without an authoritative process such as ZooKeeper.
To add more data to the issue: I have seen the issue on two different 0.20RC1 clusters. One having eight nodes, the other with four.
tallpsmith commentedon Apr 5, 2013
I'm not sure the former is really something ES should be actively dealing with, the latter I agree, and is the main point here, in how ES detects and recovers from cases where 2 masters have been elected.
There was supposed to have been some code in, I think, 0.19.5 that 'recovers' from this state by choosing the side that has the most recent ClusterStatus object (see Issue #2042) , but it doesn't appear in practice to be working as expected, because we get these child nodes accepting connections from multiple masters.
I think gathering the discovery-level DEBUG logging from the multiple nodes and presenting it here is the only way to get further traction on this case.
It's possible going through the steps in Issue #2117 may uncover edge cases related to this one (even though the source conditions are different); at least it might be a reproducible case to explore.
@s1monw nudge - have you had a chance to look into #2117 at all... ? :)
brusic commentedon Apr 5, 2013
Paul, I agree that the former is not something to focus on. Should have stated that. :) The beauty of many of the new big data systems is that they embrace failure. Nodes will come and go, either due to errors or just simple maintenance. #2117 might have a different source condition, but the recovery process after the fact should be identical.
I have enabled DEBUG logging at the discovery level and I can pinpoint when a node has left/joined a cluster, but I still have no insights on the election process.
tallpsmith commentedon May 24, 2013
suffered from this the other day when an accidental provisioning error had a 4GB ES Heap instance running on a 4GB O/S memory, which was always going to end up in trouble. The node swapped, process hung, and the intersection issue described here happened.
Yes, the provisioning error could have been avoided, yes, probably use of mlockall may have prevented the destined-to-die-a-horrible-swap-death, but there's other scenarios that could cause a hung process (bad I/O causing stalls for example) where the way ES handles the cluster state is poor, and leads to this problem.
we hope very much someone is looking hard into ways to make ES a bit more resilient when facing these situations to improve data integrity... (goes on bended knees while pleading)
otisg commentedon May 24, 2013
Btw. why not adopt ZK, which I believe would make this situation impossible(?)? I don't love the extra process/management that the use of ZK would imply..... though maybe it could be embedded, like in SolrCloud, to work around that?
brusic commentedon May 24, 2013
From my understanding, the single embedded Zookeeper model is not ideal for production and that a full Zookeeper cluster is preferred. Never tried myself, so I cannot personally comment.
s1monw commentedon May 24, 2013
FYI - there is a zookeeper plugin for ES
otisg commentedon May 24, 2013
Oh, I didn't mean to imply a single embedded ZK. I meant N of them in different ES processes. Right Simon, there is the plugin, but I suspect people are afraid of using it because it's not clear if it's 100% maintained, if it works with the latest ES and such. So my Q is really about adopting something like that and supporting it officially. Is that a possibility?
mpalmer commentedon May 24, 2013
@otisg: The problem with the ZK plugin is that with clients being part of the cluster, they need to know about ZK in order to be able to discover the servers in the cluster. Some client libraries (such as the one used by the application that started this bug report -- I'm a colleague of Saj's) doesn't support ZK discovery. In order for ZK to be a useful alternative in general, there either needs to be universal support of ZK in client libraries, or a backwards-compatible way for non-ZK-aware client libraries to discover the servers (perhaps a ZK-to-Zen translator or something... I don't know, I've got bugger-all knowledge of how ES actually works under the hood).
89 remaining items
aphyr commentedon Apr 4, 2015
I'm not sure why this issue was closed--people keep citing it and saying the problem is solved, but the Jepsen test from earlier in this thread still fails. Partial network partitions (and, for that matter, clean network partitions, and single-node partitions, and single-node pauses) continue to result in split-brain and lost data, for both compare-and-set and document-creation tests. I don't think the changes from #7493 were sufficient to solve the problem, though they may have improved the odds of successfully retaining data.
For instance, here's a test in which we induce randomized 120-second long intersecting partitions, for 600 seconds, with 10 seconds of complete connectivity in between each failure. This pattern resulted in 22/897 acknowledged documents being lost due to concurrent, conflicting primary nodes. You can reproduce this in Jepsen 7d0a718 by going to the
elasticsearch
directory and runninglein test :only elasticsearch.core-test/create-bridge
--may take a couple runs to actually trigger the race though.bleskes commentedon Apr 4, 2015
This issue, as it is stated, relates to have two master nodes elected during partial network split, despite of min_master_nodes. This issue should be solved now. The thinking is that we will open issues for different scenarios as they are discovered. An example is #7572 as well as your recent tickets (#10407 & #10426). Once we figure out the root cause of those failure (and the one mentioned in your previous comment) and if it turns out to be similar to this issue, it will of course be re-opened.
speedplane commentedon Feb 26, 2016
Not directly on topic to this issue, but why is it so difficult to avoid/prevent this split brain issue? If there are two master nodes on a network (ie, a split brain configuration), why can't there be some protocol for the two masters to figure out which one should become a slave?
I imagine some mechanism would need to detect that the system is in a split-brain state, and then a heuristic would be applied to choose the real master (e.g., oldest running server, most number of docs, random choice, etc.). This probably takes work to do, but it does not seem too difficult.
hamiltop commentedon Feb 26, 2016
Michael: Split brain occurs precisely because the two masters can't
communicate. If they could they would resolve it.
On Thu, Feb 25, 2016 at 6:54 PM Michael Sander notifications@github.com
wrote:
speedplane commentedon Feb 26, 2016
Got it. Earlier this week two nodes in my cluster appeared to be fighting for who was the master of the cluster. They were both on the same network and I believe were in communication with each other, but they went back and forth over which was the master. I shut down one of the nodes, gave it five minutes, restarted that node, and everything was fine. I thought that this was a split brain issue, but I guess it may be something else.
jasontedor commentedon Feb 26, 2016
@speedplane Do you have exactly two master-eligible nodes? Do you have minimum master nodes set to two (if you're going to run with exactly two master-eligible nodes you should, although this means that your cluster becomes semi-unavailable if one of the masters faults; ideally if you have multiple master-eligible nodes you'll have at least three and have minimum master nodes set to a quorum of them)?
Split brain is when two nodes in a cluster are simultaneously acting as masters for that cluster.
speedplane commentedon Feb 26, 2016
@jasontedor Yes, I had exactly two nodes, and minimum master nodes was set to one. I did this intentionally for the exact reason you described. It appeared that the two nodes were simultaneously acting as a master, but they were both in communication with each other, so shouldn't they be able to resolve it, as @hamiltop suggests?
jasontedor commentedon Feb 26, 2016
@speedplane This is bad because it does subject you to split brain.
That's not what I recommend. Either drop to one (and lose high-availability), or increase to three (and set minimum master nodes to two).
What evidence do you have that they were simultaneously acting as master? How do you know that they were in communication with each other? What version of Elasticsearch?
speedplane commentedon Feb 26, 2016
In the Big Desk plugin, the little star next to node name kept on bouncing back and forth between my two nodes (see screenshot).
I don't think I explicitly tested whether one could contact the other, but I was able to ssh into both, they were on the same network, and there did not appear to be any network issues.
1.7.3
jasontedor commentedon Feb 26, 2016
@speedplane I'm not familiar with the Big Desk plugin, sorry. Let's just assume that it's correct and as you say. Have you checked the logs or any other monitoring for repeated long-running garbage collections pauses on both of these nodes?
Networks are fickle things but I do suspect something else here.
Thanks.
XANi commentedon Feb 26, 2016
@speedplane "2 node situation" is inherently hard to deal with because there is no one metric iy could be decided which one should be shot down.
"Most written to" or "last written to" doesnt really mean much and in most cases alerting that something is wrong is preferable to "just throw away whatever other node had".
That is why a lot of distributed software recommends at least 3 nodes, because with 3 there is always majority, so you can set it up to only allow requests if at least
n/2+1
nodes are up