Category Archives: Consensus

Release of “ARC: Analysis of Raft Consensus”

 “ARC: Analysis of Raft Consensus” is now available online as a UCAM technical report. 
http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-857.pdf

Abstract

The Paxos algorithm, despite being synonymous with distributed consensus for a decade, is famously difficult to reason about and implement due to its non-intuitive approach and underspecification. In response, this project implemented and evaluated a framework for constructing fault-tolerant applications, utilising the recently proposed Raft algorithm for distributed consensus. Constructing a simulation framework for our implementation enabled us to evaluate the protocol on everything from understandability and efficiency to correctness and performance in diverse network environments. We propose a range of optimisations to the protocol and released to the community a testbed for developing further optimisations and investigating optimal protocol parameters for real-world deployments.

Thank you everyone for your feedback.

Seeking Feedback on “ARC: Analysis of Raft Consensus”

My undergraduate dissertation “ARC: Analysis of Raft Consensus” will be submitted as a UCAM tech report. A draft is available here and I would be very grateful of any feedback.

Title: ARC: Analysis of Raft Consensus

Abstract:
The Paxos algorithm, despite being synonymous with distributed consensus for a decade, is famously difficult to reason about and implement due to its non-intuitive approach and underspecification. In response, this project implemented and evaluated a framework for constructing fault-tolerant applications, utilising the recently proposed Raft algorithm for distributed consensus. Constructing a simulation framework for our implementation enabled us to evaluate the protocol on everything from understandability and efficiency to correctness and performance in diverse network environments. We propose a range of optimisations to the protocol and released to the community a testbed for developing further optimisations and investigating optimal protocol parameters for real-world deployments.

EDIT 1: Regarding the difference between this tech report and my dissertation. I have cut out material i didn’t believe would be of general interest, such as how i used VC or lessons learned. If you would like a copy of the original dissertation (probably because your a Part 2 student yourself), just email me and I’ll be happy to provide you a copy.

EDIT 2: I’m pretty much happy to take feedback by any format, Comment below or email me at hh360 @ cam . ac . uk

EDIT 3: A massive thankyou to everyone who has provided feedback and help to disseminate this draft (by retweeting it)

EDIT 4: The code is open source (MIT licence) and available on GitHub. I’ve not linked to as its currently undergoing a refactoring / documenting process ready for release of v0.1. My plan is split the code base into two separate libraries, one will be a event-based simulator for distributed system and the other will be a standalone Raft implementation. I’ll update this blog (& twitter) when the code is ready

EDIT 5:  Wow. The response to this draft has been much greater than I expected (300+ downloads so far). Thank you so much to everyone in the community and of course Diego Ongaro. Diego’s Raft paper is online here and the Raft consensus site is here.

 

The problem with consensus

A distributed system is collection of nodes, each which there own local memory, which are able to communicate via message passing, cooperate to perform a computation. CAP theorem [1,2] argues that its not possible to achieve consistency, availability and partition tolerance. But machines will fail and there failures are tolerated (to varying extents) by replication. These replicates now need to agree on consistent worldviews, leading to the problem of consensus, originally proposed by [6], Consensus is when these nodes agree on a value, applications for this include:

  • mutable exclusion locks
  • committing a transaction to a database
  • distributed storage such as NFS
  • Implementing reliable broadcast
  • leader election

A protocol for consensus must provide the following to be “correct”:

  • Agreement: all correct nodes arrive at the same value, (the safety property)
  • Validity: the value chosen is one that was proposed by a correct node, (the non-triviality property)
  • Termination: all correct nodes eventually decide on a value, (the liveness property)

A correct node is a node that will eventency make progress so its hasn’t yet and will not experience any of the failures listed below:

QUESTION: Some papers [7] , list a 4th condition called unanimity, stating that if all nodes propose the same value, this will be the value chosen. It appears to me that this is just a case of validity, since the value chosen must have proposed by a node, so if all nodes propose the same value then there is only one possible value to be chosen, according to the validity condition. But this used in a few paper so I can’t be right, so why ? 

The following failures are possible:

  • Fail-stop – nodes may stop but will not restart
  • Fail-recover – nodes may stop and restart
  • Byzantine – nodes behavior unexpectedly, from fault or malicious

Assuming synchronous communication (reliable unicast with known bounded message delay and execution of nodes)  and less than 2/3ds of nodes experience Byzantine failures , we can achieve consensus [3] but a completely asynchronous consensus protocol cannot guarantee consensus with just a single fail-stop node [4], the intuition behind this is that you can’t detect a fail-stopped node, as it my just be slow or its messages not yet delivered. This can be masked by trying to detect failures with waiting on unresponsive nodes, comprising liveness or timeouts/heartbeats comprising accuracy. All consensus algorithms provide consistency but for this they comprises partition tolerance and/or availability.

Henceforth, I would like to consider processes communicating over TCP/IP, making the following assumptions:

  • Nodes have local state, no shared memory and local clocks (though they may be shared when processes are running on the same host)
  • If a path is available, message passing is reliable unicast with unbounded delay (thanks to TCP’s reliable delivery, data integrity and in order delivery) but network partitions are possible
  • Failure detection methods don’t need to be accurate, thus its ok to use timeouts and assume a node is dead when it is infact live
  • A node can tell which node sent a message (e.g. MACs/IPs) so a Byzantine node cannot forge a message which appears to come from an honest host

Approaches:

Two phase commit is depend on a single node, whereas three phase commit [5] can tolerant half its nodes failing but not network partitions or unbounded network delays.

My next post will consider some more complex approaches meeting the above requirements

 

[1] Brewer, E.  Rowards Robust Distributed System. Symposium on Principles of Distributed Computing (PODC).(2000).

[2] Gilbert, Seth, and Nancy Lynch. “Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services.” ACM SIGACT News 33, no. 2 (2002): 51-59.

[3 Lamport, Leslie, Robert Shostak, and Marshall Pease. "The Byzantine generals problem." ACM Transactions on Programming Languages and Systems (TOPLAS) 4, no. 3 (1982): 382-401.

[4] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (April 1985), 374-382.

[5] Skeen, Dale; Stonebraker, M. (May 1983). “A Formal Model of Crash Recovery in a Distributed System”. IEEE Transactions on Software Engineering: 219–228

[6] Pease, M., Shostak, R., Lamport, L.: Reaching agreement in the presence of faults. J. ACM 27(2) (April 1980) 228–234