Author Archives: Heidi-ann

Fast Flexible Paxos

It’s official! Fast Flexible Paxos: Relaxing Quorum Intersection for Fast Paxos, my latest collaboration with Aleksey Charapko and Richard Mortier will appear in next year’s edition of the International Conference on Distributed Computing and Networking (ICDCN). If you don’t want to wait, then the paper is available right now on Arxiv and in the next few paragraphs, I’ll explain why I think this short paper opens the door to some exciting new directions for the future of high-performance state machine replication protocols. 

State machine replication

State machine replication (aka SMR) is a widely used technique for building strongly consistent, fault-tolerant services by replicating an application’s state machine across a set of servers and ensuring that each copy of the state machine receives the same sequence of operations. This is the primary responsibility of an SMR protocol such as Paxos, Chubby, Zookeeper[1] or Raft. SMR protocols are often criticized for performing poorly. Typically distributed systems opt for weaker consistency guarantees and reserve state machine replication only for the most critical aspects of a system such as managing metadata, distributed locking or group membership. 

Most SMR protocols take a similar approach to deciding the order that operations are applied to the state machines. They elect one server to be a leader and all operations for the state machines are sent first to the leader. The leader decides what order operations will be applied. Before the leader can apply an operation to its state machine, it first replicates it to a majority of servers to back up the operation. The leader can then safely apply the operation to its state machine. If the leader fails, then another server is chosen to become the new leader. A new leader must ensure that it knows about all operations which were applied by the failed leader. Since the previous leader replicated all applied operations to a majority of servers, the next leader can learn of all previously applied operations by communicating with the majority of servers. This is because any two majorities will overlap by at least one server.

In practice, it takes a while for an SMR protocol to apply an operation to the state machine. There are two main issues: firstly, all operations must be sent to the leader and secondly, all operations must then be replicated on a majority of servers. 

Fast Paxos

Various SMR protocols have been proposed to overcome the leader bottleneck. Some allow any server to directly replicate operations, without first sending them to the leader. I will refer to this as fast replication. The first protocol to support fast replication was Fast Paxos, which was proposed by Leslie Lamport in 2005. 

One downside of fast replication is that when a server replicates an operation, it now needs to make more copies before it is safe to apply the operation. Fast Paxos requires approximately 3/4 of servers to have a copy of an operation before it can be applied [2]. Fast Paxos might have handled one issue, the leader bottleneck, but in doing so, we have made the other issue even worse, we now need around 1/4 more copies of an operation before it can be applied.[3]

Flexible Paxos

Regarding the issue of majority replication, Flexible Paxos showed that leaders could safely replicate operations to fewer servers, provided that more servers were used during leader election. The equation below describes the tradeoff between the number of servers which must participate in replication (R), the number of servers which must participate in leader election (L) and the total number of servers (N).

L + R > N 

Since leader election is rare, it benefits performance to reduce the number of servers which must replicate operations (R)  at the expense of increasing the number of servers which must participate in leader election (L).

For example, if N=10 then we might use R=5 and L=6.

Or we might choose to use R=3 and L=8.

Note that we could not use R=3 and L=7. Our equation ensures that L and R always overlap by at least one server.

Fast Paxos + Flexible Paxos = Fast Flexible Paxos

Unfortunately, Flexible Paxos does not directly apply to Fast Paxos. State machine replication protocols thus need to choose between using a leader and requiring fewer copies of operations (Flexible Paxos) or allowing servers to bypass the leader but requiring more copies of operations (Fast Paxos). In this short paper, we prove that the approach of Flexible Paxos can be applied to Fast Paxos. Fast Paxos can safely replicate operations to fewer servers. We call this Fast Flexible Paxos.

Like Flexible Paxos, Fast Flexible Paxos observes that we can tradeoff the number of servers which replicate operations and participate in leader elections. The equations below describe this tradeoff in terms of R, L, N and the number of servers which must participate in fast replication (F).

L + R > N 

L + 2F > 2N

Consider the previous example where N=10, R=5 and L=6, our second equation requires that F=8.

Or consider the example where N=10, R=3 and L=8, this requires that F=7.

Note that it would not be ok to use N=10, L=8 and F=6, regardless of the value of R, as this does not satisfy our second equation. The second equation ensures that L always overlaps with any pair of Fs, as shown in the diagrams below.

As always with distributed systems, it is a matter of tradeoffs. In the most extreme case, we can use majorities for fast replication in Fast Flexible Paxos but this will not be able to tolerate any failures as it requires that L=N if N is odd.

Our paper on Fast Flexible Paxos describes exactly how we calculated these equations. The paper also includes a proof of Fast Flexible Paxos, a formal specification of Fast Flexible Paxos written in TLA+ and a prototype using Paxi.

State machine replication is a powerful primitive for distributed systems, which has the power to provide the strongest possible consistency guarantees even in the face of message delays and failures. However, the performance limitations of state machine replication have long limited its adoption. I believe that Fast Flexible Paxos is the first step towards building more performant state machine replication protocols which can support fast replication, thus addressing the leader bottleneck, without requiring most servers to participate in replication.

[1] Strictly speaking, Zookeeper uses atomic broadcast and primary backup replication but the same high-level ideas apply to both.

[2] ¾ is only required to tolerate the same number of failures as Paxos with majorities.

[3] There is another issue with fast replication which is that conflicts can occur when multiple servers try to replicate different operations at the same time. Subsequent protocols like Generalized Paxos and Egalitarian Paxos mitigate this.

Towards an intuitive high-performance consensus algorithm

Distributed consensus is the problem of how to reach decisions in asynchronous and unreliable distributed systems. The two most widely known algorithms are: Paxos, the traditional yet poorly understood solution; and Raft, the newcomer which democratised the understanding of distributed consensus.

Raft is one of the most influential papers of this decade in distributed systems and it has been tremendously successful in industry, with 50+ open source implementations including CoreOS’s etcd and HashiCorp’s Consul. Over the years, we have increasingly seen many efforts to optimise the performance of Raft (which I myself am also guilty of). The Raft paper makes clear that it prioritises understandability over performance. Fundamentally, what makes Raft so understandable – its strong leadership and focus on state machine replication – also makes it a poor basis for high-performance consensus. The vast existing literature on distributed consensus proposes many more efficient algorithms than Raft though, like Paxos, their presentation is often less understandable to the broader community. It is sad to see that such insightful research (EPaxos, Vertical Paxos, VR and Ring Paxos to name but a few) is being largely ignored in practice.
The question, therefore, arises: can we have a consensus algorithm which is both high-performance and intuitive to reason about?

Our latest draft, available now on arXiv, tries to take the first steps towards these goals. We present a new approach to reaching distributed consensus. We aim to improve performance and understandability by utilising two key weapons: generality and immutability.

Weapon 1: Generality

Instead of proposing a single concrete algorithm optimised for a particular scenario, we provide a set of rules which, if followed, allow any algorithm to safely implement consensus. It is well understood that compromise is inevitable in distributed consensus, our aim is to enable engineers to choose their tradeoffs by customising the consensus algorithm to meet their particular needs.

In the paper, we prove that Paxos implements these rule, as does our earlier work, Flexible Paxos and an interesting decentralised consensus algorithm called Fast Paxos. In moving to this generalised model, we also observe that algorithms such as Paxos and Fast Paxos are conservative in their approach and can be optimised accordingly.

Weapon 2: Immutability

The confusion around Paxos largely lies in its subtlety. We have been inspired by the recent trend to use immutable state to tame the complexity of systems and so we have applied this idea to consensus. We build our algorithm from persistent write-once registers and the guarantee that if a value is read from a register then that response is valid forever. From this simple primitive, we can construct consensus algorithms with the aim to make reasoning about safety so intuitive that it is clear, without formal proofs, why an algorithm is correct.

Summary

Our new draft, available here, proposes an immutable and generalised approach to reaching distributed consensus over a single value. We prove that our approach is sufficiently general to encode Paxos, Flexible Paxos and Fast Paxos as well as opening the door to a board spectrum of new solutions to distributed consensus. Our hope is that in the future this new perspective on consensus may be used to express more algorithms from the academic literature and thus effectively communicate their insights to the broader community.

My Reactive Summit 2018 talk is now online

My talk titled “Liberating Distributed Consensus” from last year’s Reactive Summit is now available on youtube.

Abstract: The ability to reach consensus between hosts, whether for addressing, primary election, locking, or coordination, is a fundamental necessity of modern distributed systems. The Paxos algorithm is at the heart of how we achieve distributed consensus today and as such has been the subject of extensive research to extend and optimise the algorithm for practical distributed systems.

In the talk, we revisit the underlying theory behind Paxos, weakening its original requirements and generalising the algorithm. We demonstrate that Paxos is, in fact, a single point on a broad spectrum of approaches to consensus and conclude by arguing that other points on this spectrum offer a much-improved foundation for constructing scalable, resilient and high performance distributed systems.

Majority agreement is not necessary for consensus

Did you know that majority agreement isn’t required by Paxos?

In fact, most of the time, the sets of nodes required to participate in agreement (known as quorums) do not even need to intersect with each other.

This is the observation which is made, proven and realised in our latest paper draft, Flexible Paxos: Quorum Intersection Revisited. A draft of which is freely available on ArXiv.

The paper provides the theoretical foundation of the result, as well as the formal proof and a discussion of the wide reaching implications. However, in this post I hope to give a brief overview of the key finding from our work and how it relates to modern distributed systems. I’d also recommend taking a look at A More Flexible Paxos by Sugu Sougoumarane.

 

Background

Paxos is an algorithm for reaching distributed agreement. Paxos has been widely utilized in production systems and (arguably) forms the basis for many consensus systems such as Raft, Chubby and VRR. In this post, I will assume that the reader has at least some familiarity with at least one consensus algorithm (if not, I would suggest reading either Paxos made moderately complex or the Raft paper).  

For this discussion, we will consider Paxos in the context of committing commands into a log which is replicated across a system of nodes.  

The basic idea behind Paxos is that we need two phases to commit commands:

  1. Leader election – The phase where one node essentially takes charge of the system, we call this node the “leader”. When this node fails, then the system detects this and we choose another node to take the lead.
  2. Replication – The phase where the leader replicates commands onto other nodes and decides when it is safe to call them “committed”.

The purpose of the leader election phase is twofold. Firstly, we need to stop past leaders from changing the state of the system. Typically, this is done by getting nodes to promise to stop listening to old leaders (e.g. in Raft, a follower updates its term when it receives a RequestVotes for a higher term).  Secondly, the leader needs to learn all of the commands that have been committed in the past (e.g. in Raft, the leader election mechanism ensures that only nodes with all committed entries can be elected).

The purpose of the replication phase is to copy commands onto other nodes. When sufficient copies of a command have been made, the leader considers the request to be committed and notifies the interested parties (e.g. in Raft, the leader applies the command locally and updates its commit index to notify the followers in the next AppendEntries).

We refer to the nodes who are required to participate in each phase as the “quorum”. Paxos (and thus Raft) uses majorities for both leader election and replication phases.

 

The key observation

The guarantee that we make is that once a command is committed, it will never be overwritten by another command. To satisfy this, we must require that the quorum used by the leader election phase will overlap with the quorums used by previous replication phase(s). This is important as it ensures that the leader will learn all past commands and past leaders will not be able to commit new ones.

Paxos uses majorities but there are many other ways to form quorums for these two phase and still meet this requirement. Previously, it was believed that all quorums (regardless of which phase they are from) needed to intersect to guarantee safety. Now we know that this need not be the case. It is sufficient to ensure that a leader election quorum will overlap with replication phase quorums.

In the rest of this post, we will explore a few of the implication of this observation. We will focus on two dimensions: (1) how can we improve the steady state performance of Paxos? (2) how can we improve the availability of Paxos?

 

Improving the performance of Paxos

We can now safely tradeoff quorum size in the leader election phase for the quorum size in the replication phase. For example, in a system of 6 nodes, it is sufficient to get agreement from only 3 nodes in the replication phase when using majorities for leader election. Likewise, it is sufficient to get agreement from only 2 nodes in the replication phase if you require that 5 nodes participate in leader election.

Reducing the number of nodes required to participate in the replication phase will improve performance in the steady state as we have fewer nodes to wait upon and to communicate with.

But wait a second, isn’t that less fault tolerant?

Firstly, we never compromise safety, this is a question of availability. It is here that things start to get really interesting.

Let’s start by splitting availability into two types: The ability to learn committed commands and elect a new leader (leader election phase) and the ability for the current leader to replicate commands (replication phase).

Why split them? Because both of these are useful in their own right. If the current leader can commit commands but we cannot elect a new leader then the system is available until the current leader fails. If we can elect a new leader but not commit new commands then we can still safely learn all previously committed values and then use reconfiguration to get the system up and running again.

In the first example, we used replication quorums of size n/2 for a system of n nodes when n was even. This is actually more fault-tolerant than Paxos. If exactly n/2 failures occur, we can now continue to make progress in the replication phase until the current leader fails.

For different quorum sizes, we have a trade off to make. By decreasing the number of nodes in the replication phase, we are making it more likely that a quorum for the replication phase will be available. However, if the current leader fails then it is less likely that we will be able to elect a new one.

The story does not end here however.

We can be more specific about which nodes can form replication quorums so that it is easier to intersect with them. For example, if we have 12 nodes we can split them into 4 groups of 3. We could then say that a replication quorum must have one node from each group. Then when electing a new leader, we need only require any one group to agree. This is shown in the picture below, on the left we simply count the number of nodes in a quorum and on the right we use the groups as described.

Untitled drawing-2

It is the case in both examples that 4 failures could be sufficient to make the system unavailable if the leader also fails. However, with groups is not the case that any 4 failures would suffice, now only some combinations of node failures are sufficient (e.g. one failure per group).

There is a host of variants on this idea. There are also many other possible constructions. The key idea is that if we have more information about which nodes have participated in the replication phase then it is easier for the leader election quorums to intersect with replication quorums.

We can take this idea of being more specific about which nodes participate in replication quorums even further. We could extend the consensus protocol to have the leader notify the system of its choice of replication quorum(s). Then, if the leader fails, the new leader need only get agreement from one node in each possible replication quorum from the last leader to continue.

No matter which scheme we use for constructing our quorums and even if we extend our protocol to recall the leader’s choice of quorum(s), we always have a fundamental limit on availability. If all nodes in the replication quorum fail and so does the leader then the system will be unavailable (until a node recovers) as no one will know for sure what the committed command was.

Improving the availability of Paxos

So far, we have focused on using the observation to reduce the number of nodes required to participate in the replication phase. This might be desirable as it improves performance in the steady state and makes Paxos more scalable across a larger number of nodes.

However, it is often the case that availability in the face of failure is more important than steady state performance. It may also be the case that a deployment only has a few nodes to utilise.

We can apply this observation about Paxos the other way around. We can increase the size of the replication quorum and reduce the size of the quorum for leader election.  In the previous example, we could use 2 nodes per group for replication then only requiring 2 nodes for any group for leader election.

Returning to the trade off from before. By increasing the number of nodes in the replication phase, we are making it more likely that we will be able recover when failures occurs. However, we increase the chance of a situation where we have enough nodes to elect a leader but we do not have enough nodes to replicate. This is still useful as if we can elect a leader then we can reconfigure to remove/replace the nodes.

At the extreme end, we can require that all nodes participate in replication and then only one node needs to participate in leader election. Assuming we can handle the reconfiguration,  we can now handle F failures using only F+1 nodes.

 

Conclusion

Paxos is a single point on a broad spectrum of possibilities. Paxos and its majority quorums is not the only way to safely reach consensus.  Furthermore, the tradeoff between availability and performance provided by Paxos might not be optimal for most practical systems.

 

Caveats

There are as always many caveats and we refer to the paper for a more formal and detailed discussion.

  1. If this result is applied to Raft consensus, we do still need leader election quorums to intersect. This is because Raft uses leader election intersect to ensure that each term is used by at most one leader. This is specific to Raft and is not true more generally.
  2. This is far from the first time that it has been observed that Paxos can operate without majority. However, previously it was believed that all quorums (regardless of which of the phases they were from) needed to intersect. This fundamentally limits the performance and availability of any scheme for choosing quorums and is probably why we do not really see them used in practice.

Acknowledgments

During the preparation of our paper, Sugu Sougoumarane released a blog post, which explains for a broad audience some of the observations on which this work is based. 

 

`The most useful piece of learning for the uses of life is to unlearn what is untrue. ”  Antisthenes

OSCON & Code Mesh

I’m pleased to announce that I’ll be speaking at OSCON and Code Mesh in London this winter. Having heard great things about both events and looking at the lineup, I am very much looking forward to it. Drop me a line if your around in London during either event.

Title: Distributed Consensus: Making Impossible Possible

Abstract:

In this talk, we explore how to construct resilient distributed systems on top of unreliable components.

Starting, almost two decades ago, with Leslie Lamport’s work on organising parliament for a Greek island. We will take a journey to today’s datacenters and the systems powering companies like Google, Amazon and Microsoft. Along the way, we will face interesting impossibility results, machines acting maliciously and the complexity of today’s networks.

Ultimately, we will discover how to reach agreement between many parties and from this, how to construct new fault-tolerance systems that we can depend upon everyday.

Paper notes on S-Paxos [SRDS’12]

The following is a paper notes for “S-Paxos: Offloading the Leader for High Throughput State Machine Replication”. This paper was recommended to me as a example of high-throughput consensus, achieved by offloading responsibilities from the leader.

The paper starts by demonstrates that JPaxos is throughput limited by leader CPU, peaking at 70 kreqs/sec, where as the throughput of S-Paxos can reach 500 kreqs/sec.

Algorithm

The (normal case) algorithms works as follows:
  • Any node is able to receive a request for a client, lets call this coordinator
  • Coordinator sends request and its ID to all nodes
  • All nodes send ack with ID to all other nodes
  • When leader receives f+1 asks then sends phase 2a for ID
  • When leader receives f+1 successful phase 2B for ID then send commit for ID to all
  • When coordinator receives commit for ID, then execute request and reply to client
Path of request:
client -> all -> all -> all -> leader -> all -> client
1 + n + n^2 + n + n + n + 1 = n^2 + 4 + 1
or
client -> all -> all -> all -> all -> client
1 + n + n^2 + n + n^2 + 1 = 2n^2 + 2 + 1
depending on if message 2b is sent to all nodes or just leader
for comparison multi-paxos is:
client -> leader -> all -> leader -> client
1 + n + n + 1 = 2n + 2
or
client -> leader -> all -> all -> client
1 + n + n^2 + 1
depending on if message 2b is sent to all nodes or just leader
The paper proposes various optimizations such as batching and pipelining and message piggybacking to reduce network load

Evaluation

The evaluation demonstrated that under the right conditions S-Paxos can achieve 5x the throughput of JPaxos. Throughput the evaluation, graph x-axis shows number of closed loop clients, which are client who send the next request when the previous response is received. Without any indication of client latency, this did not tell us much about the rate of incoming requests. For example, the graphs in Figure 4 do need seem to be a fair comparison as # of client for Paxos and S-Paxos represents quite at different workloads.

Conclusion

I like the basic idea of this paper and its is interesting to see that latency was increased only by 1/3. However, the system places a substantially load on the network when compared to Paxos and the leader is still required to execute phase 2 paxos for each client request.

Paper Notes on PBFT [OSDI’99]

Practical Byzantine Fault Tolerance (PBFT) is the foundational paper in Byzantine consensus algorithm. Typically, distributed consensus algorithms assume nodes may fail (and possible recovery), but BFT extends this to tolerate nodes acting arbitrarily.

Assumptions

BPFT makes strong assumption about these arbitrary failures. Notability that at most f failures will occur in a system of 3f+1 nodes, this bound applies to both safety and progress guarantees. In contrast, in typical paxos style consensus algorithms the bound of f failures in 2f+1 node only applies to progress guarantee, safety will always be guaranteed even if the system is not able to make progress. BPFT makes the usual asynchronous assumption, common to many consensus algorithms.

Algorithm

Section 4.2 describes the algorithm in details, the core idea is that once the nodes have been made aware of the request for a given log index, they under go a two step process where each node notifies every one of their agreement and proceed when at least f+1 nodes in the 2f+1 responses also agree. At end of the second stage, the client can be notified. Likewise, the view change protocol is a similarly enhanced version of view change from VRR.

Performance

BPFT takes 1.5 RRTs to commit a request in the common case, all requests must reach the system via a master node and after this time, all nodes have knowledge that the request was committed. In contract, typical multi-paxos algorithms (e.g., Raft and VRR) take 1 RTT to commit a request in the common case, again all requests must go via a master node but after this time only the master node has knowledge that the request was committed. Typical leaderless consensus algorithm (e.g., single-valued paxos) take at least 2 RTT to commit a request but requests can be submitted to any node. All cases we exclude the RTT from the client to the system node and back.

BPFT differs from typical consensus algorithms in the number of messages exchanged. Whilst there exists many optimizations to piggyback messages and reduce message load, we can approximate estimate of number of messages for algorithms (this time we are including client communication). A typical multi-paxos algorithms, might send 2n+2 messages for a commit in a system of n nodes, whereas BPFT might send 1 + 2n + 2n^2, a significant difference.

Conclusions

PBFT in an interesting read and an excellent contribution to consensus. However, I am not sure how common it is that a modern datacenter might meet assumptions laid out by the paper, in particular that at most f nodes might be faulty in a system 3f+1 nodes. In this environment, nodes are usually running a same OS, configurations, software stacks and consensus implementation. Thus if an adversary is able compromise a machine then it’s likely they can compromise many, thus failures are very much dependent on each other. Lets see if the 17 years of research since has been able to improve on this.