Majority agreement is not necessary for consensus

Did you know that majority agreement isn’t required by Paxos?

In fact, most of the time, the sets of nodes required to participate in agreement (known as quorums) do not even need to intersect with each other.

This is the observation which is made, proven and realised in our latest paper draft, Flexible Paxos: Quorum Intersection Revisited. A draft of which is freely available on ArXiv.

The paper provides the theoretical foundation of the result, as well as the formal proof and a discussion of the wide reaching implications. However, in this post I hope to give a brief overview of the key finding from our work and how it relates to modern distributed systems. I’d also recommend taking a look at A More Flexible Paxos by Sugu Sougoumarane.

 

Background

Paxos is an algorithm for reaching distributed agreement. Paxos has been widely utilized in production systems and (arguably) forms the basis for many consensus systems such as Raft, Chubby and VRR. In this post, I will assume that the reader has at least some familiarity with at least one consensus algorithm (if not, I would suggest reading either Paxos made moderately complex or the Raft paper).  

For this discussion, we will consider Paxos in the context of committing commands into a log which is replicated across a system of nodes.  

The basic idea behind Paxos is that we need two phases to commit commands:

  1. Leader election – The phase where one node essentially takes charge of the system, we call this node the “leader”. When this node fails, then the system detects this and we choose another node to take the lead.
  2. Replication – The phase where the leader replicates commands onto other nodes and decides when it is safe to call them “committed”.

The purpose of the leader election phase is twofold. Firstly, we need to stop past leaders from changing the state of the system. Typically, this is done by getting nodes to promise to stop listening to old leaders (e.g. in Raft, a follower updates its term when it receives a RequestVotes for a higher term).  Secondly, the leader needs to learn all of the commands that have been committed in the past (e.g. in Raft, the leader election mechanism ensures that only nodes with all committed entries can be elected).

The purpose of the replication phase is to copy commands onto other nodes. When sufficient copies of a command have been made, the leader considers the request to be committed and notifies the interested parties (e.g. in Raft, the leader applies the command locally and updates its commit index to notify the followers in the next AppendEntries).

We refer to the nodes who are required to participate in each phase as the “quorum”. Paxos (and thus Raft) uses majorities for both leader election and replication phases.

 

The key observation

The guarantee that we make is that once a command is committed, it will never be overwritten by another command. To satisfy this, we must require that the quorum used by the leader election phase will overlap with the quorums used by previous replication phase(s). This is important as it ensures that the leader will learn all past commands and past leaders will not be able to commit new ones.

Paxos uses majorities but there are many other ways to form quorums for these two phase and still meet this requirement. Previously, it was believed that all quorums (regardless of which phase they are from) needed to intersect to guarantee safety. Now we know that this need not be the case. It is sufficient to ensure that a leader election quorum will overlap with replication phase quorums.

In the rest of this post, we will explore a few of the implication of this observation. We will focus on two dimensions: (1) how can we improve the steady state performance of Paxos? (2) how can we improve the availability of Paxos?

 

Improving the performance of Paxos

We can now safely tradeoff quorum size in the leader election phase for the quorum size in the replication phase. For example, in a system of 6 nodes, it is sufficient to get agreement from only 3 nodes in the replication phase when using majorities for leader election. Likewise, it is sufficient to get agreement from only 2 nodes in the replication phase if you require that 5 nodes participate in leader election.

Reducing the number of nodes required to participate in the replication phase will improve performance in the steady state as we have fewer nodes to wait upon and to communicate with.

But wait a second, isn’t that less fault tolerant?

Firstly, we never compromise safety, this is a question of availability. It is here that things start to get really interesting.

Let’s start by splitting availability into two types: The ability to learn committed commands and elect a new leader (leader election phase) and the ability for the current leader to replicate commands (replication phase).

Why split them? Because both of these are useful in their own right. If the current leader can commit commands but we cannot elect a new leader then the system is available until the current leader fails. If we can elect a new leader but not commit new commands then we can still safely learn all previously committed values and then use reconfiguration to get the system up and running again.

In the first example, we used replication quorums of size n/2 for a system of n nodes when n was even. This is actually more fault-tolerant than Paxos. If exactly n/2 failures occur, we can now continue to make progress in the replication phase until the current leader fails.

For different quorum sizes, we have a trade off to make. By decreasing the number of nodes in the replication phase, we are making it more likely that a quorum for the replication phase will be available. However, if the current leader fails then it is less likely that we will be able to elect a new one.

The story does not end here however.

We can be more specific about which nodes can form replication quorums so that it is easier to intersect with them. For example, if we have 12 nodes we can split them into 4 groups of 3. We could then say that a replication quorum must have one node from each group. Then when electing a new leader, we need only require any one group to agree. This is shown in the picture below, on the left we simply count the number of nodes in a quorum and on the right we use the groups as described.

Untitled drawing-2

It is the case in both examples that 4 failures could be sufficient to make the system unavailable if the leader also fails. However, with groups is not the case that any 4 failures would suffice, now only some combinations of node failures are sufficient (e.g. one failure per group).

There is a host of variants on this idea. There are also many other possible constructions. The key idea is that if we have more information about which nodes have participated in the replication phase then it is easier for the leader election quorums to intersect with replication quorums.

We can take this idea of being more specific about which nodes participate in replication quorums even further. We could extend the consensus protocol to have the leader notify the system of its choice of replication quorum(s). Then, if the leader fails, the new leader need only get agreement from one node in each possible replication quorum from the last leader to continue.

No matter which scheme we use for constructing our quorums and even if we extend our protocol to recall the leader’s choice of quorum(s), we always have a fundamental limit on availability. If all nodes in the replication quorum fail and so does the leader then the system will be unavailable (until a node recovers) as no one will know for sure what the committed command was.

Improving the availability of Paxos

So far, we have focused on using the observation to reduce the number of nodes required to participate in the replication phase. This might be desirable as it improves performance in the steady state and makes Paxos more scalable across a larger number of nodes.

However, it is often the case that availability in the face of failure is more important than steady state performance. It may also be the case that a deployment only has a few nodes to utilise.

We can apply this observation about Paxos the other way around. We can increase the size of the replication quorum and reduce the size of the quorum for leader election.  In the previous example, we could use 2 nodes per group for replication then only requiring 2 nodes for any group for leader election.

Returning to the trade off from before. By increasing the number of nodes in the replication phase, we are making it more likely that we will be able recover when failures occurs. However, we increase the chance of a situation where we have enough nodes to elect a leader but we do not have enough nodes to replicate. This is still useful as if we can elect a leader then we can reconfigure to remove/replace the nodes.

At the extreme end, we can require that all nodes participate in replication and then only one node needs to participate in leader election. Assuming we can handle the reconfiguration,  we can now handle F failures using only F+1 nodes.

 

Conclusion

Paxos is a single point on a broad spectrum of possibilities. Paxos and its majority quorums is not the only way to safely reach consensus.  Furthermore, the tradeoff between availability and performance provided by Paxos might not be optimal for most practical systems.

 

Caveats

There are as always many caveats and we refer to the paper for a more formal and detailed discussion.

  1. If this result is applied to Raft consensus, we do still need leader election quorums to intersect. This is because Raft uses leader election intersect to ensure that each term is used by at most one leader. This is specific to Raft and is not true more generally.
  2. This is far from the first time that it has been observed that Paxos can operate without majority. However, previously it was believed that all quorums (regardless of which of the phases they were from) needed to intersect. This fundamentally limits the performance and availability of any scheme for choosing quorums and is probably why we do not really see them used in practice.

Acknowledgments

During the preparation of our paper, Sugu Sougoumarane released a blog post, which explains for a broad audience some of the observations on which this work is based. 

 

`The most useful piece of learning for the uses of life is to unlearn what is untrue. ”  Antisthenes

Paper Notes on “CORFU: A distributed shared log” [TOCS’13]

Last Monday we looked at Tango, a system for replicating a data structure to provide linearizable semantics and fault-tolerance. Tango is built up on CORFU, a replicated log, built over storage nodes. This paper notes article covers “CORFU: A distributed shared log” also by Balakrishnan et al. from TOCS December 2013. I believe that this paper elaborates on “CORFU: A Shared Log Design for Flash Clusters” by Balakrishnan et al from NSDI 2012 and that the NSDI paper is not a prerequisite for the TOCS paper.

What we already know?

From the Tango paper, we learned that CORFU provides a replicated log with support for the following operations: append, check, read, trim and fill.

Its main two components are: a sequencer, which hands out addresses in the log and storage nodes (such as SSDs) which store log entries. Storage nodes are divided into clusters and a variant of chain replication is used between them. Each cluster is responsible for a subset of log addresses.

Summary

This paper presents CORFU, a shared log, distributed over storage nodes. It main advantage over existing systems is its scalability as its not bottlenecked by the I/O of a single host.

Local addresses -> physical addresses

Each storage node exposes a infinite write-once logical address space with read, write, delete and fill operations for each address. Delete is used when the data at a particular logical address is not longer required, its physical address can be reused but its logical address cannot. Fill is used to mark that a logical address will not be used in the future. These are implemented using a hash table with various optimizations. A seal operations is also provided which locks a storage node to operates with equal or higher epoch number.

Global addresses -> local addresses

Each client resolves a global address into a set of nodes and a local address in two stages. Firstly, the client uses a local copy of a mapping from ranges in the global addresses space to disjoint subsets of hosts. For example, addresses 0-100K map to replica sets A-C and addresses 100K-200K map to replica sets D-F. A deterministic function (like mod and div) maps the specific global address to a local address a specific replica set like A or C.

Replication

Replica are written to using client-driven chain replication, this means the client writes to each replica in a deterministic order and waits for successful acknowledgment for each storage node before continuing. As a result, write latency scales linearly with the number of replicas. In contrast, majority-based protocol like Raft, Paxos and VRR can replicate a write in as little as 1 RTT, regardless of the number of nodes, in the right conditions. The downside of such protocols is that to tolerant f failure we need 2f+1 nodes instead of f+1.

If a client fails to complete this process then it may be filled in by the next client. Like Raft, this means that clients may be given false negatives, and thus the application utilizing the log must be able to handle this. For example, Raft uses operation id’s and caches to prevent multiple application of the same operation to a state machine in SMR.

A more important failure case is where a client fails to see a committed write since the replica it is querying was removed from the replica set due to network partition/failure and the client is not aware of this change. This is address by issuing leases to the storage units from the sequencer.

Changing projections

Corfu’s sequencer is in some ways analogous to the coordinator/master in tradition protocols. Likewise, Corfu’s changing projection has many parallels with VRR’s view changes or Raft’s term changes. In all cases, a monotonically increasing value (known as the epoch number/view number/term) separates different perspectives on system configuration. Example prospectives include a period of leadership in Raft or a set of projections in CORFU. Each node stores its prospective of this value and each message between nodes includes it. Projection change is initiated by a client, then agreed by storage nodes using a Paxos-like consensus protocol and then each (involved) storage node is sealed in the process. Clients learn of the projection change when they contact storage nodes (since it includes their outdated epoch) and they retrieve the new projection information from a networked storage drive.

My interpretation of changing projections is that each projection change can include a completely new configuration. This mechanism provides us with dynamic membership, in addition to a mechanism for dealing with network failures and partitions.

Evaluation

The authors’ experimental evaluation seems very promising. In some ways, it is difficult for me to determine which gains are from the hardware and which are from CORFU’s design. All of the experiments use two replicas per cluster, thus just two failures are capability of bringing the system down. The use of client-driven chain replication means that I would really like to see how the system scales (particularly its latency) with more replicas.

Conclusions

Corfu is a very interesting system and seems to be a novel point on the solution space of distributed log solutions. Next time, we will take a look at the open source implementation, CorfuDB.

Paper Notes on “Tango: Distributed Data Structures over a Shared Log” [SOSP’13]

The following is a paper notes article on “Tango: Distributed Data Structures over a Shared Log” by Balakrishnan et al. from SOSP 2013. The article focus on the main body of the paper and I will cover Tango’s streaming and transaction support in a separate article, sometime in the future.

I have covered this paper before covering its companion paper “CORFU: A Shared Log Design for Flash Clusters” also by Balakrishnan et al. from NSDI 2012. I expect its paper notes article will answer many of the open questions in this article.

Summary

Tango is system for replicating a data structure to provide linearizable semantics and fault tolerance. Tango is framed as system for application metadata, whose requirement include a need for fault tolerance, high availability, fairly strong consistency/ordering requirements.

[quick note on terminology: unlike the usual terminology to describe systems like Raft and VRR, Tango uses the term “client” to refer to what similar systems call a server and “external client” to what similar systems call a client. We shall use the same terminology as Tango. ]

The basics

At its heart, Tango, takes SMR and separates the replication of the log from the hosts running the application. In traditional SMR, each host runs the application itself (an in-memory deterministic state machine) and stores a local copy of the log. With the help of a consensus protocol, the local copy of the log kept upto date and clients are provided with linearizable semantics. In Tango, only the application runs on these host. The log is instead distributed (using chain replication) across an a set of storage nodes (e.g. an SSD cluster). Tango clients do not communicate directly with message passing (as is the case of SMR), instead they interact via the shared log.

SMR is often used with leader driven consensus, this means that only the leader may commit an operation to replicated log and if the leader was to fail the system is unavailable to clients until a new leader is chosen. In contrast, Tango’s approach allows clients to directly replicate commands to the log. The leader is replaced by a dedicated sequencer and the sequencer state is soft so systems can continue to operate (at higher latency) if the sequencer was to fail.

Tango uses a stream abstraction to shard operations by the data structure on which they operate. This is used to clients to efficiently “ignore” operations which do not apply their data structures.

Shared log (aka Corfu)

The shared log (aka Corfu) is composed of clusters of storage nodes. Within each cluster, each storage node is a replica and operates on a read many/write once basis. Global address x maps to cluster (x mod n) and local address (x div n), where n is the number of clusters. Clients read and write to clusters using chain replication, this is also how failures are handled.

The sequencer stores the global address of the log tail. If this fails, it can reconstructed by querying storage nodes for local address of their log tail. The paper also mentions that each cluster also has its own dedicated sequencer, I could not understand when this is used.

Multiple clients writing to one address is resolved safely and Corfu exposes an operation for filling in unused addresses

Application interface

In many classic protocols, the external client must locate a master/coordinator to handle the request on their behalf. In Tango (like some other leaderless systems like EPaxos) any external client can communication with any client. Often SMR uses master leases or a similar mechanism to handle read requests without fully replicating them (unlike write requests) or having to pass them to the master. Corfu supports a check operation so Tango can check (and update if necessary) the local view before applying the update. Transactions are supported with speculation, object version numbers and a commit/abort stage before an up call to application is made

The API for interfacing with Tango seems similar to that of SMR. An application must provide a callback to apply a operation and has access to method to add read or write operations. An application may provide a checkpointing mechanism and may call discard on any log prefix.

Related systems

Alternatives to Tango are systems such as Zookeeper, Raft and Chubby. In the introduction, the authors highlight that such system are not sufficient since they only support specific data structures and/or do not support operations over multiple data structures. Whilst I support this idea, I would argue that in at least some existing systems, operations over multiple data structure can be supported by considering the multiple data structures as one. This allows operations over multiple data structure but comes at a high cost of adding a total ordering between operations when only a small partial ordering is required. Since Tango support streaming, it already has access to a partial ordering on operations instead of just using a total ordering. P-SMR by Parisa Marandi demonstrates the efficiently improvements possible by utilizing this information. Another argument given in the paper is that adopting SMR after the development of an application “often requires a drastic rewrite of the code”.

Evaluation

The experimental evaluation of Tango focuses on measuring throughput and/or latency whilst varying the number of clients and ratio of write to read requests. I would be interested to see some results on the recovery time for various patterns of storage node or sequencer failures. The initial evaluation considers three scenarios: (a) 18 storage nodes and 1 client, (b) 18 storage nodes and 2 clients, (c) 2 or 18 storage nodes and 1 to 18 clients. Generally speaking, the results are very much as you might expect: increasing request load increases request latency, writes place a much high burden on the system than reads and read only clients scale linearly until storage nodes are saturated.

The paper does not mention any formal proof of correctness (only that the sequencer is not required for correctness). Log compaction is supported with the forget call and checkpointing. Dynamic membership support for clients is trivial due to the design of Tango but I would be interested how Corfu and its variant of chain replication handles storage node failures.

It is not fair to directly compare Tango to systems using SMR with a majority quorum consensus algorithm (such as Raft). Tango’s scalability comes (in part) from the sharding of the replication log into clusters. This technique can be applied to SMR (see S-SMR) too. It would be interesting to see how Tango performs with the storage node, application node and sequencer co-located on same the host. However (particularly with current trend towards containers, docker and unikernels) separating systems like Tango into function specific instances/hosts has some significant benefits such as decoupling the number of storage and application hosts or fine gained provisioning for the sequencer compared to the storage/application hosts.

Conclusion

Each new paper that presents a high availability, fault-tolerance systems seems to build upon many of the old concepts (such as Paxos, SMR and timeouts for failure detection), in combination with a few novel ideas. In this case, Tango builds upon chain replication, for replicating log entries and the typical API for SMR. This is first time I have seen a soft state sequencer used in such as system as well as the divide between the storage nodes and application nodes. Overall, I really like this paper and its nice to see some novelty in a space crowded with variants of Paxos + SMR.

Paper Notes: The Network is Reliable [ACMQ July’14]

The Network is Reliable is an excellent article which attempts to formalise the discussion on real world failures for distributed systems. There is currently great debate on whether the assumption that network partitions are rare is too strong or too weak, for modern networks. Much of the data which we could use to answer this question is not published, instead is takes the form of anecdotes shared between sysadmins over a beer. This article lets the reader sit at the table and share the stories.

It is worth noting that network partition is used here in the broadest possible sense. We are looking at end-to-end communications and anything which can hinder these. This goes beyond simple link and switch failures to distributed GC, bugs in the NIC, and slow IO.

Here 5 examples of the anecdotes covered:

  • split brain caused by non-transitive reachability on EC2
  • redundancy doesn’t always prevent link failure
  • asymmetric reachability due to bugs in the NIC
  • GC and blocking for  I/O can cause huge runtime latencies
  • short transient failures become long term problem with existing algorithms

So, what does this mean for Unanimous and my research on reliable consensus for real world networks?

My original inspiration for working on consensus algorithms was the need for reliable consensus at the internet edge. With consensus comes fault tolerance,  from this we can construct new systems for preserving privacy and handling personal data, offering users a viable alternative to 3rd party centralised services.

However, it has become apparent that the issues of reliable consensus is much more general. This article illustrates that even within the constrained setting of a datacenter, distributed systems are failing to tolerance the wide range of possible failures.