I asked a question on today’s BBC radio 4 show “Any questions?”, http://www.bbc.co.uk/programmes/b06b3ny4, skip to 42:07 to hear me nervously ask “How can we protect the rights of citizens in an increasingly digital world?” and hear the panel’s response. The responses where fairly disappointing but it helps to keep the debate alive.
In this post we will be looking at the results for the Azure latency Pilot study described last week. Yesterday, we started by looking at the aggregated results and found that the measured RTT was larger then expected. Today, we will look at how the results vary depending on which VMs the measurements where taken between. It may be the case that we can infer something about topology between VMs, for example whether VM’s are in the same physical host and the same rack.
The table below shows the RTT between each pair of VMs. The first server in the pair, labelled src is the one which initialised the ping. The table includes each machine pinging itself for comparison.
All VMs other than VM 2 have high latency to VM 1. In fact, we see an average 65 ms RTT from VM 1 to itself. This warrants further investigation into how hping3 is measuring latency. Removing VM 1 from the equation, we observe reasonable uniformity in the RRTs between VMs 2 to 5. Between these the min, 25th, 50th and 75th percentile are all similar and the maximum varies highly, which is to be expected.
I would like to take a close look at how the distribution of RTT measurements varies between VMs 2 to 5. The table below shows the RTT between each pair of VMs between 2 to 5, at various percentile points.
Yesterday, we saw that the 90th percentile for dataset as a whole was 61.4 ms, this is not representative of the RTT between VMs 2-5. We can see this information more clearly using the following 5 CDF, each graph representing the round trip time from each machine to each of the others (and itself).
Machine 1 is a clear outlier from the perspective for machines 1, 3, 4 and 5. The observed RTT doesn’t seems to be symmetric. Again this asymmetry warrants further investigation. The stepping in the CDFs is because the RTT is recorded to the nearest 1 decimal place.
Next time, we will look at how the observed RTT varies with time.
This is post we will be looking at the results for the Azure latency Pilot study described last week. We will starting by looking at the aggregate results, disregarding the time a measurement was taken and which machines the measurement was taken between.
The 22332 data points have been processed in Python3, in particular using the matplotlib and numpy libraries. The scripts are available in azure-measurements repository.
They currently only use the average round trip time, as reported by hping3, average over 10 pings.
The results in the table above are much larger than I expected. Given the large standard deviation and very large maximum value, it is likely that a few large measures have skewed the results. Let’s take a look at the cumulative distribution function (CDF) and percentile points to see if this is the case.
As expect, some large measurements have skewed the results. However, the proportion of measurement which are considered large is much greater then I expected. This warrants further investigation.
Before that, lets take a closer look at how the majority of value are distributed. Due to the central limit theorem and the sufficiently large sample size, we would expect to see a normal distribution. In simulators such as Raft Refloated, we simulate latency as normally distributed with given parameters, discarding values below a threshold value. We can take a closer look at the probability density function (PDF) and see if this is a reasonable approximation.
The green bars represent the probability of each RTT. We see an approximate normal distribution, although it is clear that this distribution doesn’t have the same parameters as the data set as a whole. The red lines shows a normal distributed with mean 3.6 and standard deviation of 1. This red line appears to be a reasonable approximation and could be used in simulation.
Next post, we will looks at how the measured RTT differs depending which of the 5 machines the measurement was taken between.
I’ll just leave this here, my all time favourite youtube video