Paper Notes on PBFT [OSDI’99]

Practical Byzantine Fault Tolerance (PBFT) is the foundational paper in Byzantine consensus algorithm. Typically, distributed consensus algorithms assume nodes may fail (and possible recovery), but BFT extends this to tolerate nodes acting arbitrarily.

Assumptions

BPFT makes strong assumption about these arbitrary failures. Notability that at most f failures will occur in a system of 3f+1 nodes, this bound applies to both safety and progress guarantees. In contrast, in typical paxos style consensus algorithms the bound of f failures in 2f+1 node only applies to progress guarantee, safety will always be guaranteed even if the system is not able to make progress. BPFT makes the usual asynchronous assumption, common to many consensus algorithms.

Algorithm

Section 4.2 describes the algorithm in details, the core idea is that once the nodes have been made aware of the request for a given log index, they under go a two step process where each node notifies every one of their agreement and proceed when at least f+1 nodes in the 2f+1 responses also agree. At end of the second stage, the client can be notified. Likewise, the view change protocol is a similarly enhanced version of view change from VRR.

Performance

BPFT takes 1.5 RRTs to commit a request in the common case, all requests must reach the system via a master node and after this time, all nodes have knowledge that the request was committed. In contract, typical multi-paxos algorithms (e.g., Raft and VRR) take 1 RTT to commit a request in the common case, again all requests must go via a master node but after this time only the master node has knowledge that the request was committed. Typical leaderless consensus algorithm (e.g., single-valued paxos) take at least 2 RTT to commit a request but requests can be submitted to any node. All cases we exclude the RTT from the client to the system node and back.

BPFT differs from typical consensus algorithms in the number of messages exchanged. Whilst there exists many optimizations to piggyback messages and reduce message load, we can approximate estimate of number of messages for algorithms (this time we are including client communication). A typical multi-paxos algorithms, might send 2n+2 messages for a commit in a system of n nodes, whereas BPFT might send 1 + 2n + 2n^2, a significant difference.

Conclusions

PBFT in an interesting read and an excellent contribution to consensus. However, I am not sure how common it is that a modern datacenter might meet assumptions laid out by the paper, in particular that at most f nodes might be faulty in a system 3f+1 nodes. In this environment, nodes are usually running a same OS, configurations, software stacks and consensus implementation. Thus if an adversary is able compromise a machine then it’s likely they can compromise many, thus failures are very much dependent on each other. Lets see if the 17 years of research since has been able to improve on this.