Author Archives: Heidi-ann

Adding machines can decrease availability of a distributed system

TL;DR: Run distributed systems using majority quorum on an odd number of machines.

Paxos, Raft, Viewstamped Replication and many other consensus algorithms operate on majority quorums. A majority quorum is any subset of nodes containing at least a strict majority of nodes. These algorithms all require a majority quorum to agree to a value before its considered to be committed. This ensures that any agreement will be aware of all past agreements, as there will always be in intersection of at least one mode between any two majority quorums.

It is a prerequisite that at least a strict majority of nodes are up for the system using these consensus protocols to be operational. From this, we can establish an upper bound on the uptime of any system using majority quorum consensus. For this calculation, we assume node failures are independent. Whilst this is a fair assumption, it is of course not the case in practice.

We can calculate the probability that a strict majority of node are up as follows:

majority_bound_fig2

p is the probably that a node is live, n is number of nodes and underline means floor.

This equation is also the CDF of the Binomial distribution. It works by summing up the probability what no nodes are down, only one node is down, two nodes are down etc.. until you reach the case where the largest possible minority of nodes are down.

The following graph shows the relationship between node uptime p and majority quorum uptime (each line represent the number of nodes from 1 to 8):

majority_bound_fig1

A system with two nodes is more likely to fail than a system with one. It’s easy to see why this is. If either node in a system of two machines fails then the system is down (as its unable to get a strict majority). Since there are two node instead of one, a failure is likely to occur. In other words, the probably of a system of one node having a majority is p, whilst for a system of two nodes its p^2.

If node uptime is less than half, your best off with only one node. In practice, your best off with not using majority quorum consensus.

Adding a node to an odd set of nodes doesn’t improve uptime. Adding this extra node doesn’t increase the number of failures that can we handled, e.g. 5 nodes can handle upto two failures and 6 nodes can also only handle upto two failures. The summation above has no extra terms in this case but the existing terms are smaller as p is raised to a higher power.

This approach is of course very limited. We are not accounting for how extra nodes aid in the handling of read-only commands, the duration and pattern of failures and that machine failures are not independent.

I assume this is well known result in the theory of distributed system but I’ve not come across it before. If anyone has seen it before, I can you send me a pointer to the work.

Chronicle of Coracle: Parsing Protocol Specific Parameters

This is the first post in the “Chronicle of Coracle” series to document daily life using OCaml for systems programming.

Today, I’ll be adding protocol specific parameters to Coracle, this mostly involves working with the Yojson and Cmdliner libraries. Once completed, Coracle will be able to accept JSON files like:

{"nodes": 50,"loss": 0.0, "termination":1000, "consensus": {"protocol":"raft","timeout":300}}

Stage 1 – Protocol selection

The value associated with “protocol” key determines how which consensus protocol is ran by the simulator. Though this value seems like an unusual starting place, the simulator is a functor over the specific consensus implementation, thus we much choose our consensus implementation first before anything else. We extract the protocol name as follows:

let get_protocol filename = try (
  Safe.from_file filename
  |> function `Assoc config -> config
  |> List.assoc "consensus"
  |> function `Assoc proto -> proto
  |> List.assoc "protocol" 
  |> function `String str -> str
  |> function "raft" -> `Raft | "vrr" -> `VRR
  ) with _ -> raise JSON_parsing_failure

This approach gets the job done but could be improved using some kinda option/error monad instead of exceptions and providing more detailed information to the user on why the JSON parsing failed. We added various other JSON parsing functions, all of follow a similar pattern one above. We needed to do a bit more work to get these to type correctly, as described in Stage 3.

Stage 2 – The dummy module

Based on the protocol given in the JSON configuration file, we apply a different consensus module to the simulation functor. At the moment we only have support for the Raft protocol, so here is the type sig for consensus modules and a new dummy consensus module we can use in testing.

module type CONSENSUS = sig
  
type state
type config
val parse_config: json -> config
val init: id list -> config -> state
val add_peers: id list -> state -> state
val state_to_string: state -> string 

type msg
val msg_serialize: msg -> string
val msg_deserialize: string -> msg
val msg_to_string: msg -> string
val eval: msg input -> state -> state option * msg output list  

end 

And the dummy module:

open Common
open Io
open Yojson.Safe

type state = {
  peers: id list;
  counter: int;
  say_hello_to: int;
}

type config = int
let parse_config json = 
  function `Assoc config -> config
  |> List.assoc "id"
  |> function `Int i -> i

let init peers id = {
  peers; counter=0; say_hello_to=id}
let add_peers new_peers s = 
  {s with peers=s.peers@new_peers}
let state_to_string s = string_of_int s.counter

type msg = Hello | HelloBack
let msg_to_string = function
  | Hello -> "hello"
  | HelloBack -> "hello back"

let msg_serialize = msg_to_string

let msg_deserialize = function
  | "hello" -> Hello
  | "hello back" -> HelloBack

let eval event state =
  match event with
  | PacketArrival (id,Hello) -> 
(None, [PacketDispatch (id, HelloBack)])
  | PacketArrival (_,HelloBack) -> 
(Some {state with counter=state.counter +1}, [])
  | Startup _ -> 
(None, [PacketDispatch (state.say_hello_to, Hello)])

This was just a quick hack to test that the simulator will generalise over consensus implementations.

Stage 3 – Using mli files to generalise polymorphic variants

Yojson uses polymorphic variants for describing the structure of JSON files. Polymorphic variants (the ones which start with the backtick) can be declared on the fly and don’t need type definitions beforehand. Initially, we got type errors like this for many of our JSON parsing functions:

Error: Signature mismatch:
       ...
       Values do not match:
         val parse_config :
           [< `Assoc of (string * [> `Int of int ]) list ] -> config
       is not included in
         val parse_config : Yojson.Safe.json -> config

These issues where quickly address by adding mli files or :json type annotations

 

Stage 4 – Knowing when to give up

The simulation command line interface (without using config files) had become exponentially more complex, both in terms of implementation and usage. As of this PR, we on longer support command line simulation without the use of a config file. In hindsight, its unlikely that many would use this interface due to the unmanageable number of parameters (around 2 dozen for ocaml-raft)

Stage 5 – Protocol Selector

Now that we are able to parse the JSON config files and we have multiple protocols to choice between, we can write our protocol selector:

let protocol_selector config_file trace output_file no_sanity =
  match Json_handler.get_protocol config_file with
  | `Raft -> 
    let module R = Simulator.Simulate(Raft) in 
    R.start config_file trace output_file no_sanity
  | `Dummy -> 
    let module R = Simulator.Simulate(Dummy) in 
    R.start config_file trace output_file no_sanity

Since we only have two protocols to choose between, I’ve opted to duplicate the R.start line instead of using first class modules (fcm) as fcm requires packing and unpacking. Interestingly, I can’t directly call start from Simulator.Simulate(Raft), you need to assign it to an intermediate module in between. I’m not sure why this is case.

Conclusion

You can now pass protocol specific parameter to Coracle using JSON configuration files. This will be live on the demo server by tonight. Tomorrow, we will look at adding clients to the simulator.

#C3S2015 Trip Report – Day 1

Good Morning everyone!

I am here in Estonia with Zafar, Steffen and Jon at the First Cyber Security Summer School. It’s 8am and we have a packed schedule ahead of us today.

Good morning from a forest in Estonia

A photo posted by heidi (@heidiann360) on

Opening Remarks

Session 1

Our first session is lead by Steven M. Bellovin (Columbia University).

Session 2

Our second session is lead by Jaan Priisalu (Senior Fellow at CCDCOE), with the help of Andres Kütt, Heiko Vainsalu, Mark Erlich (Estonian Information System Authority) and Kristjan Vassil (University of Tartu)

Panel Discussion

The panel discuss is lead by Lauri Almann (BHC Laboratory) and includes Parag Pruthi, Jon Crowcroft, Konstantina Papagiannaki, Jaan Priisalu, Andres Kütt.

#C3S2015 Trip Report – Day 0

Hello from a lovely little spa hotel in a forest in Estonia.

After an excellent dinner and many coffees, our first session started at 9pm (yes! they are working us hard already). We received a warm welcome from Olaf Meaneel (TUT) to the first ever cyber security summer school (C3S2015). Dr Parag Pruthi kicked off proceedings with his talk titled “Advancing big data analytics – cyber defence solutions”.

Parag asked “When was the first cyber war?” The answer: in 1982, during the cold war, the CIA attacked the flow control software for soviet serbian gas pipeline. Our networks are even more fragile. Example of Iran hijacking US drone and some excellent clips from the IT crowd. Breaching systems is fast, discovery is slow and recovery is very slow. We always blame ‘dave’, we aren’t good at protecting against human error. Intrusion detection systems are not reliable, 1% false positive rate gives a trust levels of .19%.

We researchers are disconnected from the real world, we make simplifying assumptions, design a solution and test in simulation against the assumptions. Parag motivates engineering from real world network. He details the challenges in collecting petabytes of data, storage, compression, retrieval, processing and evaluating.

Parag key message was that big data provides us with near perfect information for intrusion detection.

Q: Is you approach limited in time, we must collect data and anaylsis before we can react?
A: Correct, we still have real people watching data visualisation, like a security guard watch CCTV, but they are not an order of magnitude faster then they where before.

Paper Notes: Ethical Privacy Guidelines for Mobile Connectivity Measurements [Report Nov’13]

Ethical Privacy Guidelines for Mobile Connectivity Measurements is the first item on the C3S reading list, below is my brief notes on this November 2013 report by the Oxford Internet Institute.

The stated purpose of this report is to inform networking researchers about the best practices for preserving data subject privacy when performing active measurement of mobile networks.

Researcher must make a comprise between the privacy of data subjects and dissemination of research artefacts for reproducibility. To aid in reasoning about this comprise, the report presents a risk assessments format covering: the contributions of the research, risk of de-identification, impact of re-identification, unforeseen risk (such as data theft), methods to dissemination artefacts, informed consent and transparency. The report goes onto discuss a few legal implications, in particular, the ongoing debate on whether IP addresses and communication metadata are personally identifiable information.

The authors focus on a guiding principle: collect the minimal data possible to conducted the stated research. Data should n0t be used for a secondary purpose unless explained at the consent stage. This includes open dissemination of the data collected. The report suggests some methods of fuzzing the data including: perturbation, truncation, permutation, quantisation, pseudonymization, k-anonymity and differential privacy.

Overall, I would recommend the report to someone new to the domain of data privacy, as its a nice introduction to the topic.  The authors raise awareness of the necessary compromise between reproducible research and data privacy. Though they do not provide concrete advise to researchers on how to make the best compromise (other than telling them to be conservative). The report claims to focus on active mobile measurements, in practice its contribution is much more general than this. I would love to see this report with real-world examples of measurement studies that have been conducted, the comprise between reproducible research and data privacy that was chosen and how it was chosen.

Paper Notes: The Network is Reliable [ACMQ July’14]

The Network is Reliable is an excellent article which attempts to formalise the discussion on real world failures for distributed systems. There is currently great debate on whether the assumption that network partitions are rare is too strong or too weak, for modern networks. Much of the data which we could use to answer this question is not published, instead is takes the form of anecdotes shared between sysadmins over a beer. This article lets the reader sit at the table and share the stories.

It is worth noting that network partition is used here in the broadest possible sense. We are looking at end-to-end communications and anything which can hinder these. This goes beyond simple link and switch failures to distributed GC, bugs in the NIC, and slow IO.

Here 5 examples of the anecdotes covered:

  • split brain caused by non-transitive reachability on EC2
  • redundancy doesn’t always prevent link failure
  • asymmetric reachability due to bugs in the NIC
  • GC and blocking for  I/O can cause huge runtime latencies
  • short transient failures become long term problem with existing algorithms

So, what does this mean for Unanimous and my research on reliable consensus for real world networks?

My original inspiration for working on consensus algorithms was the need for reliable consensus at the internet edge. With consensus comes fault tolerance,  from this we can construct new systems for preserving privacy and handling personal data, offering users a viable alternative to 3rd party centralised services.

However, it has become apparent that the issues of reliable consensus is much more general. This article illustrates that even within the constrained setting of a datacenter, distributed systems are failing to tolerance the wide range of possible failures.

 

 

Converting from Tikz to PNG

Here’s how I did it:

\documentclass[preview,border=4mm,convert={density=600,outext=.png}]{standalone}
 
\usepackage{url}
\usepackage{tikz}
\usepackage{color}
\begin{document}
 
\begin{tikzpicture}
add figure source here
\end{tikzpicture}
 
\end{document}

Then compile using:

latexmk  smr.tex -shell-escape

Unanimous: System Research Group talklet

I’m looking forward to sharing my thoughts on consensus for the edge network with the SRG today, abstract below

Many projects in the SRG at the moment (HAT, UCN, contacts app, MirageOS for ARM, Jitsu, databox, signposts) are trying to give individuals an viable alternative to 3rd party centralised services and put them back in control of their personal data. However developing applications for the hostile edge network, with its heterogeneous hosts and networks, trust issues and poorly understood middle boxes is tricky. This is made worse by the fact that consensus algorithms are famously difficult to use, underspecified and based on decade old assumption about the internet. In this talklet, I will motivate Unanimous, a new consensus algorithm for the modern internet.

NB: this is a practice talk for EuroSys doctoral workshop next Tuesday, thus this 5 min talk will simply motivate a research direction instead of presenting a complete solution.
EDIT (17/4): these slides are now online

Conservative election in Byzantine Fault Tolerant Raft

Tangaroa, is a Byzantine tolerant Raft implementation in Haskell, developed by Christopher Copeland and Hongxia Zhong for a Distributed Systems class at Stanford. The authors apply many of the approaches in PBFT to Raft, allowing for the Byzantine failure of nodes.  My interest in this work is how can you stop unnecessary leader elections in Raft algorithm using more conservative election approaches, given that election are too common in some environments (see s4.2 in ARC) .

Copeland and Zhong quickly identify this is one of the key barriers to developing a byzantine raft and describe their lazy voters approach. As with the original Raft protocol, a node time out, becomes a candidate, increments its terms and dispatches RequestVotes RPC’s. In addition to the normal restrictions on granting a vote, Tangaroa adds two extra conditions

  • new term must be < current_term + H, where H is a statistically unlikely number of split votes.
  • the current leader must have failed to dispatch AppendEntries within the heartbeat.

I am not convinced that the first condition will not lead the system to become permanently unavailable. For example, a network partition could separate the nodes into two groups thus one group will have a leader in a low term and the other group constantly trying to re-elect a new leader as it cannot get a majority. When network connectivity is restored, the nodes in will observe a sudden unbounded term increase. Thus nodes will not give out their votes and the system is stalled.

The author’s highlight how the second condition could be implemented. Nodes who are in a position to grant a vote except that the second condition isn’t meet, will record the vote locally and then dispatch the votes only when they observe a leader failure for themselves. This means that a node may only replace the leader when a majority of nodes detect a failure (or the client does) instead of when a single node detects a failure. The paper doesn’t go into much detail about this and how this would impact performance, I suspect that this approach is a little too conservative and many in some cases make it significantly more difficult to replace a fault leader. Various implementation details need to be consider here, for example, if a node times out and dispatches its vote, how long until the node steps up to candidate itself? or after how long does it “forget” its local vote, the election timeout maybe.

Writing a Byzantine Fault Tolerant Raft was always going to be hard, Tangaroa is a good first step in the right direction.