OSCON & Code Mesh

I’m pleased to announce that I’ll be speaking at OSCON and Code Mesh in London this winter. Having heard great things about both events and looking at the lineup, I am very much looking forward to it. Drop me a line if your around in London during either event.

Title: Distributed Consensus: Making Impossible Possible


In this talk, we explore how to construct resilient distributed systems on top of unreliable components.

Starting, almost two decades ago, with Leslie Lamport’s work on organising parliament for a Greek island. We will take a journey to today’s datacenters and the systems powering companies like Google, Amazon and Microsoft. Along the way, we will face interesting impossibility results, machines acting maliciously and the complexity of today’s networks.

Ultimately, we will discover how to reach agreement between many parties and from this, how to construct new fault-tolerance systems that we can depend upon everyday.

Paper notes on S-Paxos [SRDS’12]

The following is a paper notes for “S-Paxos: Offloading the Leader for High Throughput State Machine Replication”. This paper was recommended to me as a example of high-throughput consensus, achieved by offloading responsibilities from the leader.

The paper starts by demonstrates that JPaxos is throughput limited by leader CPU, peaking at 70 kreqs/sec, where as the throughput of S-Paxos can reach 500 kreqs/sec.


The (normal case) algorithms works as follows:
  • Any node is able to receive a request for a client, lets call this coordinator
  • Coordinator sends request and its ID to all nodes
  • All nodes send ack with ID to all other nodes
  • When leader receives f+1 asks then sends phase 2a for ID
  • When leader receives f+1 successful phase 2B for ID then send commit for ID to all
  • When coordinator receives commit for ID, then execute request and reply to client
Path of request:
client -> all -> all -> all -> leader -> all -> client
1 + n + n^2 + n + n + n + 1 = n^2 + 4 + 1
client -> all -> all -> all -> all -> client
1 + n + n^2 + n + n^2 + 1 = 2n^2 + 2 + 1
depending on if message 2b is sent to all nodes or just leader
for comparison multi-paxos is:
client -> leader -> all -> leader -> client
1 + n + n + 1 = 2n + 2
client -> leader -> all -> all -> client
1 + n + n^2 + 1
depending on if message 2b is sent to all nodes or just leader
The paper proposes various optimizations such as batching and pipelining and message piggybacking to reduce network load


The evaluation demonstrated that under the right conditions S-Paxos can achieve 5x the throughput of JPaxos. Throughput the evaluation, graph x-axis shows number of closed loop clients, which are client who send the next request when the previous response is received. Without any indication of client latency, this did not tell us much about the rate of incoming requests. For example, the graphs in Figure 4 do need seem to be a fair comparison as # of client for Paxos and S-Paxos represents quite at different workloads.


I like the basic idea of this paper and its is interesting to see that latency was increased only by 1/3. However, the system places a substantially load on the network when compared to Paxos and the leader is still required to execute phase 2 paxos for each client request.

Paper Notes on PBFT [OSDI’99]

Practical Byzantine Fault Tolerance (PBFT) is the foundational paper in Byzantine consensus algorithm. Typically, distributed consensus algorithms assume nodes may fail (and possible recovery), but BFT extends this to tolerate nodes acting arbitrarily.


BPFT makes strong assumption about these arbitrary failures. Notability that at most f failures will occur in a system of 3f+1 nodes, this bound applies to both safety and progress guarantees. In contrast, in typical paxos style consensus algorithms the bound of f failures in 2f+1 node only applies to progress guarantee, safety will always be guaranteed even if the system is not able to make progress. BPFT makes the usual asynchronous assumption, common to many consensus algorithms.


Section 4.2 describes the algorithm in details, the core idea is that once the nodes have been made aware of the request for a given log index, they under go a two step process where each node notifies every one of their agreement and proceed when at least f+1 nodes in the 2f+1 responses also agree. At end of the second stage, the client can be notified. Likewise, the view change protocol is a similarly enhanced version of view change from VRR.


BPFT takes 1.5 RRTs to commit a request in the common case, all requests must reach the system via a master node and after this time, all nodes have knowledge that the request was committed. In contract, typical multi-paxos algorithms (e.g., Raft and VRR) take 1 RTT to commit a request in the common case, again all requests must go via a master node but after this time only the master node has knowledge that the request was committed. Typical leaderless consensus algorithm (e.g., single-valued paxos) take at least 2 RTT to commit a request but requests can be submitted to any node. All cases we exclude the RTT from the client to the system node and back.

BPFT differs from typical consensus algorithms in the number of messages exchanged. Whilst there exists many optimizations to piggyback messages and reduce message load, we can approximate estimate of number of messages for algorithms (this time we are including client communication). A typical multi-paxos algorithms, might send 2n+2 messages for a commit in a system of n nodes, whereas BPFT might send 1 + 2n + 2n^2, a significant difference.


PBFT in an interesting read and an excellent contribution to consensus. However, I am not sure how common it is that a modern datacenter might meet assumptions laid out by the paper, in particular that at most f nodes might be faulty in a system 3f+1 nodes. In this environment, nodes are usually running a same OS, configurations, software stacks and consensus implementation. Thus if an adversary is able compromise a machine then it’s likely they can compromise many, thus failures are very much dependent on each other. Lets see if the 17 years of research since has been able to improve on this.