OSCON & Code Mesh

I’m pleased to announce that I’ll be speaking at OSCON and Code Mesh in London this winter. Having heard great things about both events and looking at the lineup, I am very much looking forward to it. Drop me a line if your around in London during either event.

Title: Distributed Consensus: Making Impossible Possible


In this talk, we explore how to construct resilient distributed systems on top of unreliable components.

Starting, almost two decades ago, with Leslie Lamport’s work on organising parliament for a Greek island. We will take a journey to today’s datacenters and the systems powering companies like Google, Amazon and Microsoft. Along the way, we will face interesting impossibility results, machines acting maliciously and the complexity of today’s networks.

Ultimately, we will discover how to reach agreement between many parties and from this, how to construct new fault-tolerance systems that we can depend upon everyday.

Paper notes on S-Paxos [SRDS’12]

The following is a paper notes for “S-Paxos: Offloading the Leader for High Throughput State Machine Replication”. This paper was recommended to me as a example of high-throughput consensus, achieved by offloading responsibilities from the leader.

The paper starts by demonstrates that JPaxos is throughput limited by leader CPU, peaking at 70 kreqs/sec, where as the throughput of S-Paxos can reach 500 kreqs/sec.


The (normal case) algorithms works as follows:
  • Any node is able to receive a request for a client, lets call this coordinator
  • Coordinator sends request and its ID to all nodes
  • All nodes send ack with ID to all other nodes
  • When leader receives f+1 asks then sends phase 2a for ID
  • When leader receives f+1 successful phase 2B for ID then send commit for ID to all
  • When coordinator receives commit for ID, then execute request and reply to client
Path of request:
client -> all -> all -> all -> leader -> all -> client
1 + n + n^2 + n + n + n + 1 = n^2 + 4 + 1
client -> all -> all -> all -> all -> client
1 + n + n^2 + n + n^2 + 1 = 2n^2 + 2 + 1
depending on if message 2b is sent to all nodes or just leader
for comparison multi-paxos is:
client -> leader -> all -> leader -> client
1 + n + n + 1 = 2n + 2
client -> leader -> all -> all -> client
1 + n + n^2 + 1
depending on if message 2b is sent to all nodes or just leader
The paper proposes various optimizations such as batching and pipelining and message piggybacking to reduce network load


The evaluation demonstrated that under the right conditions S-Paxos can achieve 5x the throughput of JPaxos. Throughput the evaluation, graph x-axis shows number of closed loop clients, which are client who send the next request when the previous response is received. Without any indication of client latency, this did not tell us much about the rate of incoming requests. For example, the graphs in Figure 4 do need seem to be a fair comparison as # of client for Paxos and S-Paxos represents quite at different workloads.


I like the basic idea of this paper and its is interesting to see that latency was increased only by 1/3. However, the system places a substantially load on the network when compared to Paxos and the leader is still required to execute phase 2 paxos for each client request.