Tangaroa, is a Byzantine tolerant Raft implementation in Haskell, developed by Christopher Copeland and Hongxia Zhong for a Distributed Systems class at Stanford. The authors apply many of the approaches in PBFT to Raft, allowing for the Byzantine failure of nodes. My interest in this work is how can you stop unnecessary leader elections in Raft algorithm using more conservative election approaches, given that election are too common in some environments (see s4.2 in ARC) .
Copeland and Zhong quickly identify this is one of the key barriers to developing a byzantine raft and describe their lazy voters approach. As with the original Raft protocol, a node time out, becomes a candidate, increments its terms and dispatches RequestVotes RPC’s. In addition to the normal restrictions on granting a vote, Tangaroa adds two extra conditions
- new term must be < current_term + H, where H is a statistically unlikely number of split votes.
- the current leader must have failed to dispatch AppendEntries within the heartbeat.
I am not convinced that the first condition will not lead the system to become permanently unavailable. For example, a network partition could separate the nodes into two groups thus one group will have a leader in a low term and the other group constantly trying to re-elect a new leader as it cannot get a majority. When network connectivity is restored, the nodes in will observe a sudden unbounded term increase. Thus nodes will not give out their votes and the system is stalled.
The author’s highlight how the second condition could be implemented. Nodes who are in a position to grant a vote except that the second condition isn’t meet, will record the vote locally and then dispatch the votes only when they observe a leader failure for themselves. This means that a node may only replace the leader when a majority of nodes detect a failure (or the client does) instead of when a single node detects a failure. The paper doesn’t go into much detail about this and how this would impact performance, I suspect that this approach is a little too conservative and many in some cases make it significantly more difficult to replace a fault leader. Various implementation details need to be consider here, for example, if a node times out and dispatches its vote, how long until the node steps up to candidate itself? or after how long does it “forget” its local vote, the election timeout maybe.
Writing a Byzantine Fault Tolerant Raft was always going to be hard, Tangaroa is a good first step in the right direction.
I’m implementing this protocol myself and encountered this problem.
Besides some inconsistency between what paper says about lazy voting and what reference implementation does (the latter only records one lazy vote, and replaces it each time when a lower term vote arrives, within the permitted window), there is the obvious problem with a ‘run-away candidate(s)’ you stated (it can even happen without network splitting because of election time-out randomization).
What I was thinking about, is a mechanism of telling those run-aways: “Hey, dude, catch with us!”. I’m still in the process of thinking about details, because it can easily open a vector for an attack.
The idea is to record last term with the leader and allowing looking back in the interval [lastLeaderTerm, myTerm] under certain conditions, e.g. when myTerm > lastLeaderTerm + H.
Two cases:
1. The cluster has a leader. Catch an AppendEntries from a server. Then request it’s quorum, as in protocol. Possible problem is replay attack from a byzantine server that was a leader in [lastLeaderTerm, myTerm]. This shouldn’t be a problem if committed entries are not overwritten (it should be taken care of). Other than that, it reproduces historical advancement of the protocol. If a newer leader is alive and reachable, the process will repeat itself.
2. The cluster does not have a leader. Then RequestVote RPCs are flying around, and the node to adjust itself to the median term among them after it has a quorum of messages from different senders. Only requests with term higher than lastLeaderTerm are considered.
This should work both forward and backward, and should retain log prefix property since any quorum will be a real quorum for a term higher than node’s thus the node will make progress adjusting its lastLeaderTerm. Under regular conditions the protocol is unchanged.
What do you think?