Category Archives: Consensus

Paper Notes: The Network is Reliable [ACMQ July’14]

The Network is Reliable is an excellent article which attempts to formalise the discussion on real world failures for distributed systems. There is currently great debate on whether the assumption that network partitions are rare is too strong or too weak, for modern networks. Much of the data which we could use to answer this question is not published, instead is takes the form of anecdotes shared between sysadmins over a beer. This article lets the reader sit at the table and share the stories.

It is worth noting that network partition is used here in the broadest possible sense. We are looking at end-to-end communications and anything which can hinder these. This goes beyond simple link and switch failures to distributed GC, bugs in the NIC, and slow IO.

Here 5 examples of the anecdotes covered:

split brain caused by non-transitive reachability on EC2
redundancy doesn’t always prevent link failure
asymmetric reachability due to bugs in the NIC
GC and blocking for I/O can cause huge runtime latencies
short transient failures become long term problem with existing algorithms

So, what does this mean for Unanimous and my research on reliable consensus for real world networks?

My original inspiration for working on consensus algorithms was the need for reliable consensus at the internet edge. With consensus comes fault tolerance, from this we can construct new systems for preserving privacy and handling personal data, offering users a viable alternative to 3rd party centralised services.

However, it has become apparent that the issues of reliable consensus is much more general. This article illustrates that even within the constrained setting of a datacenter, distributed systems are failing to tolerance the wide range of possible failures.

EuroSysDW’15 Poster on Unanimous

Leave a reply

Here is the draft of my EuroSys doctoral workshop workshop poster. I would appreciate any feedback on the poster, I hope to sent it to the printers in 3/4 hours.

Unanimous: System Research Group talklet

Leave a reply

I’m looking forward to sharing my thoughts on consensus for the edge network with the SRG today, abstract below

Many projects in the SRG at the moment (HAT, UCN, contacts app, MirageOS for ARM, Jitsu, databox, signposts) are trying to give individuals an viable alternative to 3rd party centralised services and put them back in control of their personal data. However developing applications for the hostile edge network, with its heterogeneous hosts and networks, trust issues and poorly understood middle boxes is tricky. This is made worse by the fact that consensus algorithms are famously difficult to use, underspecified and based on decade old assumption about the internet. In this talklet, I will motivate Unanimous, a new consensus algorithm for the modern internet.

NB: this is a practice talk for EuroSys doctoral workshop next Tuesday, thus this 5 min talk will simply motivate a research direction instead of presenting a complete solution.

EDIT (17/4): these slides are now online

Conservative election in Byzantine Fault Tolerant Raft

1 Reply

Tangaroa, is a Byzantine tolerant Raft implementation in Haskell, developed by Christopher Copeland and Hongxia Zhong for a Distributed Systems class at Stanford. The authors apply many of the approaches in PBFT to Raft, allowing for the Byzantine failure of nodes. My interest in this work is how can you stop unnecessary leader elections in Raft algorithm using more conservative election approaches, given that election are too common in some environments (see s4.2 in ARC) .

Copeland and Zhong quickly identify this is one of the key barriers to developing a byzantine raft and describe their lazy voters approach. As with the original Raft protocol, a node time out, becomes a candidate, increments its terms and dispatches RequestVotes RPC’s. In addition to the normal restrictions on granting a vote, Tangaroa adds two extra conditions

new term must be < current_term + H, where H is a statistically unlikely number of split votes.
the current leader must have failed to dispatch AppendEntries within the heartbeat.

I am not convinced that the first condition will not lead the system to become permanently unavailable. For example, a network partition could separate the nodes into two groups thus one group will have a leader in a low term and the other group constantly trying to re-elect a new leader as it cannot get a majority. When network connectivity is restored, the nodes in will observe a sudden unbounded term increase. Thus nodes will not give out their votes and the system is stalled.

The author’s highlight how the second condition could be implemented. Nodes who are in a position to grant a vote except that the second condition isn’t meet, will record the vote locally and then dispatch the votes only when they observe a leader failure for themselves. This means that a node may only replace the leader when a majority of nodes detect a failure (or the client does) instead of when a single node detects a failure. The paper doesn’t go into much detail about this and how this would impact performance, I suspect that this approach is a little too conservative and many in some cases make it significantly more difficult to replace a fault leader. Various implementation details need to be consider here, for example, if a node times out and dispatches its vote, how long until the node steps up to candidate itself? or after how long does it “forget” its local vote, the election timeout maybe.

Writing a Byzantine Fault Tolerant Raft was always going to be hard, Tangaroa is a good first step in the right direction.

Poster for “Life on the Edge: New Abstractions for Edge Network Distributed Computing”

Leave a reply

Here is a draft copy of the A1 poster I’ll be presenting at the 2nd Annual Oxbridge Women in Computer Science Conference in Oxford. The poster abstract is in a previous post. Any feedback would be greatly appreciated.

version 1 (9:20 11/3)

version 2 (10:50 11/3)

now with left alignment of text on the left and right alignment of text on the right, gateway text on black router removed

VERSION 3 (11:00 11/3)

now with bolded keywords

FINAL VERSION (11:23 11/3)

new text for the aim box

Raft Refloated in the morning paper

I am incredibly excited to announce that Raft Refloated will be featured in The Morning Paper, by Adrian, as part of a series of 10 posts on distributed consensus. Last month, the morning paper also featured our work on the Databox.

Will the raft paper sink or swin? Let’s wait and see.

We’re coming to letter box near you

The January edition of the Operating System Review is now out, look out for “Raft Refloated: Do We Have Consensus?”. Also available without dead trees.

Raft Refloated: Do We Have Consensus?

Leave a reply

The January edition of SIGOPS Operating Systems Review is out now and thus is the aptly named “Raft Refloated: Do We Have Consensus?”. This is my first journal paper and I’m really existed to see what the community makes of it.

Title: Raft Refloated: Do We Have Consensus?
Authors: Heidi Howard, Malte Schwarzkopf, Anil Madhavapeddy and Jon Crowcroft
Paper: acm dl, open access link
Abstract: The Paxos algorithm is famously difficult to reason about and even more so to implement, despite having been synonymous with distributed consensus for over a decade. The recently proposed Raft protocol lays claim to being a new, understandable consensus algorithm, improving on Paxos without making compromises in performance or correctness.

In this study, we repeat the Raft authors’ performance analysis. We developed a clean-slate implementation of the Raft protocol and built an event-driven simulation framework for prototyping it on experimental topologies. We propose several optimizations to the Raft protocol and demonstrate their effectiveness under contention. Finally, we empirically validate the correctness of the Raft protocol invariants and evaluate Raft’s understandability claims.

Below is the key figure of the paper, showing a side-by-side comparison of the simulation results next to the authors’ original results.

Link

My undergraduate dissertation has made it to the front page of Hackernews with 80 points, https://news.ycombinator.com/item?id=8736868.

Release of “ARC: Analysis of Raft Consensus”

Leave a reply

“ARC: Analysis of Raft Consensus” is now available online as a UCAM technical report.
http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-857.pdf

Abstract

The Paxos algorithm, despite being synonymous with distributed consensus for a decade, is famously difficult to reason about and implement due to its non-intuitive approach and underspecification. In response, this project implemented and evaluated a framework for constructing fault-tolerant applications, utilising the recently proposed Raft algorithm for distributed consensus. Constructing a simulation framework for our implementation enabled us to evaluate the protocol on everything from understandability and efficiency to correctness and performance in diverse network environments. We propose a range of optimisations to the protocol and released to the community a testbed for developing further optimisations and investigating optimal protocol parameters for real-world deployments.

Thank you everyone for your feedback.

Read, Write, Execute

Notebook of a researcher in distributed systems.

Category Archives: Consensus

Paper Notes: The Network is Reliable [ACMQ July’14]

EuroSysDW’15 Poster on Unanimous

Unanimous: System Research Group talklet

Conservative election in Byzantine Fault Tolerant Raft

Poster for “Life on the Edge: New Abstractions for Edge Network Distributed Computing”

version 1 (9:20 11/3)

version 2 (10:50 11/3)

VERSION 3 (11:00 11/3)

FINAL VERSION (11:23 11/3)

Raft Refloated in the morning paper

We’re coming to letter box near you

Raft Refloated: Do We Have Consensus?

Release of “ARC: Analysis of Raft Consensus”

Abstract