Azure Latency Pilot Study: Part 3 – Machine specific results

In this post we will be looking at the results for the Azure latency Pilot study described last week. Yesterday, we started by looking at the aggregated results and found that the measured RTT was larger then expected. Today, we will look at how the results vary depending on which VMs the measurements where taken between. It may be the case that we can infer something about topology between VMs, for example whether VM’s are in the same physical host and the same rack.

The table below shows the RTT between each pair of VMs. The first server in the pair, labelled src is the one which initialised the ping. The table includes each machine pinging itself for comparison.

src	dst	mean	s.d	min	25th	50th	75th	max
1	1	51.0	65.1	1.7	4.3	65.25	71.6	1004.0
1	2	12.1	89.7	1.7	2.9	3.6	4.3	1004.7
1	3	9.5	66.5	1.6	3.4	4.1	4.8	1004.7
1	4	7.3	52.3	1.3	3.0	3.5	4.0	1004.1
1	5	11.8	84.9	1.6	3.0	3.6	4.4	1242.2
2	1	6.2	12.7	1.6	3.0	3.7	4.425	88.4
2	2	16.7	65.8	1.1	3.0	3.7	4.6	1002.9
2	3	15.0	95.6	1.4	3.0	3.7	4.4	1004.9
2	4	9.6	62.0	1.4	2.9	3.5	4.3	1003.6
2	5	15.3	80.0	1.3	2.9	3.5	4.3	1002.5
3	1	48.5	44.9	2.0	4.5	64.2	70.6	754.9
3	2	7.0	48.9	1.4	2.9	3.6	4.4	1005.0
3	3	3.9	4.2	1.6	3.1	3.8	4.6	124.9
3	4	5.8	32.0	1.3	2.9	3.4	4.0	628.6
3	5	6.7	52.5	1.5	2.9	3.8	4.4	1229.3
4	1	48.8	48.3	1.9	4.7	62.1	67.3	1003.0
4	2	7.7	59.5	1.3	2.8	3.6	4.4	1003.8
4	3	9.0	60.1	1.6	3.3	4.1	4.8	1004.0
4	4	5.2	29.3	1.2	2.7	3.3	3.8	754.6
4	5	15.6	104.7	1.5	2.8	3.6	4.3	1035.0
5	1	50.9	69.8	2.0	4.5	63.8	69.6	1005.3
5	2	9.4	70.9	1.3	2.9	3.7	4.3	1003.8
5	3	8.6	57.6	2.0	3.3	4.1	4.8	1003.7
5	4	6.5	47.7	1.3	2.7	3.4	4.0	1003.0
5	5	6.8	49.2	1.2	2.5	3.2	4.1	1023.0

All VMs other than VM 2 have high latency to VM 1. In fact, we see an average 65 ms RTT from VM 1 to itself. This warrants further investigation into how hping3 is measuring latency. Removing VM 1 from the equation, we observe reasonable uniformity in the RRTs between VMs 2 to 5. Between these the min, 25th, 50th and 75th percentile are all similar and the maximum varies highly, which is to be expected.

I would like to take a close look at how the distribution of RTT measurements varies between VMs 2 to 5. The table below shows the RTT between each pair of VMs between 2 to 5, at various percentile points.

src	dst	80th	90th	95th	98th	99th
2	2	5.1	41.8	47.2	124.6	150.7
2	3	4.6	5.0	5.5	34.5	429.5
2	4	4.5	4.9	5.5	47.4	78.5
2	5	4.5	5.4	60.4	91.8	207.8
3	2	4.5	4.8	5.0	5.5	8.9
3	3	4.7	5.0	5.1	5.4	5.8
3	4	4.2	4.5	4.8	5.2	6.1
3	5	4.6	4.9	5.2	5.9	6.6
4	2	4.5	4.8	5.0	5.2	5.8
4	3	5.0	5.2	5.4	5.7	6.4
4	4	4.0	4.3	4.7	5.1	5.8
4	5	4.5	4.8	5.2	6.1	762.9
5	2	4.5	4.8	4.9	5.1	6.1
5	3	4.9	5.2	5.4	5.6	6.8
5	4	4.1	4.5	4.9	5.3	6.3
5	5	4.3	4.6	4.8	5.8	7.2

Yesterday, we saw that the 90th percentile for dataset as a whole was 61.4 ms, this is not representative of the RTT between VMs 2-5. We can see this information more clearly using the following 5 CDF, each graph representing the round trip time from each machine to each of the others (and itself).

alt text

Machine 1 is a clear outlier from the perspective for machines 1, 3, 4 and 5. The observed RTT doesn’t seems to be symmetric. Again this asymmetry warrants further investigation. The stepping in the CDFs is because the RTT is recorded to the nearest 1 decimal place.

Next time, we will look at how the observed RTT varies with time.

Read, Write, Execute

Notebook of a researcher in distributed systems.

Azure Latency Pilot Study: Part 3 – Machine specific results

Leave a Reply Cancel reply

Share this:

Leave a Reply Cancel reply