Understanding the Split-Brain Problem in Distributed Systems -

Distributed architectures have become increasingly popular in recent years. Many of us enjoy—or aspire to—working with these systems. As exciting as it sounds, choosing a distributed architecture introduces a completely different set of challenges. In this article, we’ll explore one of the challenges with distributed systems — the split-brain problem. It occurs when two or more nodes or groups of nodes stop communicating with each other due to a network partition or some other reason. This can potentially cause data inconsistencies or incorrect system behaviours.

This article delves into the split-brain problem in distributed systems and examines strategies to mitigate its occurrence.

Understanding the Split-Brain Problem

A split-brain condition is a medical phenomenon where communication between the left and right hemispheres of the brain is disrupted. It creates a situation where each hemisphere operates independently, resulting in two distinct “brains” within the same body, each with its own perceptions, thoughts, and impulses.

Split-Brain Problem in the Real World

First, before diving into the technicalities, let’s see a real-world analogy.

Imagine a team of software developers- Amit, Bala, and Charu – discussing a pressing production defect. Amit suggests, “This should be quick, I can fix the issue, anyone can”.

However, Amit got disconnected from the meeting shortly afterwards because of internet connectivity problems. Since Amit was unreachable, Charu decided to fix the defect and informed Bala.

Meanwhile, Amit assumed he would take care of the defect as he had already taken the lead. Consequently, both Amit and Charu ended up doing duplicate work, leading to confusion and potential conflict within the team.

All this confusion happened because, in the absence of proper communication, both Amit and Charu assumed the leadership and started working on the same thing.

Split-Brain Problem in Distributed Systems

Interestingly, something similar to Amit and Charu can happen with the nodes in the distributed systems in the split-brain scenario.

So, instead of Amit, Bala, and Charu, let’s say, we have an application running on three nodes – Node A, Node B, and Node C.

Assume Node A is the leader, whereas Node B and Node C are the replicas.

Therefore, in this case, only Node A accepts all the write requests, ensuring the application behaves consistently.

Then, a network issue occurs, and Node A goes down.

This leaves the application running on Node B and Node C only.

As a result, like Charu in our real-life example, Node C takes over the leadership of the cluster and starts accepting write requests.

There have been no issues so far; all is well except one less node in the cluster.

Now things take an interesting turn when Node A comes back up and can’t join the cluster due to some issue.

As Node A was the leader before going down, it continues to do so and starts accepting the write requests again.

This leads to what we call the split-brain problem. Now, two leaders, Node A and Node C, are accepting write requests.

Implications of the Split-Brain Problem

So, what can go wrong in the above example? As we have seen, both Node A and Node C started receiving write requests, unaware of the writes happening on the other. This can lead to data inconsistencies.

For example, Node A receives a write request and updates the bank balance to 5000. Simultaneously, Node C receives a write request to update the balance to 3000. Now, which one to consider?

From this example, it is clear that the split-brain problem can potentially leave the system in an inconsistent state. Consequently, it can cause the application to behave incorrectly, making it less reliable.

Causes of the Split-Brain Problem

As discussed earlier, the split-brain problem in distributed systems is primarily caused by disruptions in communication and coordination between the nodes or clusters.

Common causes:

Network Partitioning: The most common one, where the connectivity between the nodes is lost. This results in nodes or groups of nodes working in isolation.
Erroneous Fault Detection Mechanism: It may mistakenly assume that a leader node has failed, resulting in unwanted corrective actions by the system.
Absence of Quorum Mechanism: In the absence of a quorum mechanism, isolated nodes may continue operating when they lack the majority of nodes.
Faulty Leader Election Algorithm: This may cause the application to select multiple leaders, resulting in the split-brain issue.

Dealing With the Split-Brain Problem

As per the wiki, there are two approaches for dealing with the Split-Brain problem – optimistic and pessimistic.

The optimistic approach prioritises availability over strong consistency. It allows isolated nodes to continue functioning during a network partition. Once the network issue is resolved, it relies on an automatic or manual reconciliation mechanism to restore the system to a consistent state.

In contrast, the pessimistic approach emphasizes consistency over availability. A common solution in this scenario is the quorum-based consensus mechanism. This technique ensures that only the cluster with a majority of nodes remains operational in case of a network partition. For instance, in the earlier example, a quorum-based method would halt write requests to Node A and redirect them to Node C.

Split-Brain Protection in Distributed Technologies

Many distributed technologies have built-in mechanisms to address the split-brain problem. My first encounter with the split-brain scenario was while working with Hazelcast. In this section, we’ll explore how some well-known distributed technologies tackle the challenges posed by the split-brain issue.

Hazelcast

Hazelcast provides split-brain protection by requiring a minimum cluster size for operation. It also supports split-brain recovery (the optimistic approach) for some distributed data structures.

MongoDB

MongoDB’s replica-set architecture uses quorum-based consensus, the arbiter node mechanism, failover (a node steps down to the secondary role if it loses quorum), and node priorities to avoid split-brain issues.

Elasticsearch

Elasticsearch also uses quorum-based consensus and master election protocol to avoid split-brain scenarios. It ensures that only one master node is active at a time.

Etcd

As per the Etcd documentation:

A network partition divides the etcd cluster into two parts; one with a member majority and the other with a member minority. The majority side becomes the available cluster and the minority side is unavailable; there is no “split-brain” in etcd.

Conclusion

To summarise, the split-brain problem is a complex challenge we must address in distributed systems.

In this article, we’ve discussed the challenges that can occur if this problem is left unchecked. However, we discussed, by understanding its causes and impacts, and implementing effective strategies such as quorum-based decision-making, robust leader election protocols, and leveraging fault-tolerant technologies, we can safeguard against these issues.

If you enjoyed this deep dive into the split-brain problem, don’t forget to check out my other articles on architecture.