User login

Brendon Jones's blog

21

Aug

2015

Wrote some more unit tests to check that AMP tests were correctly
reporting data using protocol buffers, and that the data coming out
matched what was put in.

Updated the build system to properly reflect the new requirements for
protocol buffers and Debian packaging dependencies.

Did some initial testing with individual tests to make sure that nntsc
would accept the data, and fixed a couple of issues that I found (mostly
signed vs unsigned mismatches). Ran a proper client with a full test
schedule and checked the results against existing data to make sure that
everything was working as expected.

12

Aug

2015

Converted the HTTP test to report data using protocol buffers.

Wrote a simple unit test for the DNS test to check that the data coming
out was the same as the data going in, and did some testing with NNTSC
to make sure that the data was in the appropriate format to be inserted
into the database. Found and fixed a few errors where things weren't
being set appropriately.

Spent some time looking into the slow HTTP test data I have been
collecting. Around 60% of the objects that were slow to fetch were on
new connections (usually the first one where we fetch the initial HTML)
and the delay was normally between sending the request and receiving any
response bytes. However there are enough delays in different places and
for different reasons that there is no obvious single cause, more
investigation is required.

06

Aug

2015

Converted the DNS, TCPPing, traceroute and throughput tests to report
data using protocol buffers, and updated the scripts used by nntsc to
extract/parse the report messages. Updated the build system to
automatically build all the appropriate files from the .proto definition
files.

Wrote some unit tests to make sure that the data being put into the
protocol buffers was the same as the data coming out and that optional
fields were appropriately present/absent.

Started collecting some more data on slow HTTP tests, dumping full
result data to try to see if there are any patterns around what objects
are slow to fetch, and which part of the transfer process is slow.

30

Jul

2015

Built and tested new amplet client packages for the wheezy portion of
the New Zealand mesh, including the ASN lookup fixes from last week.
Deployed them on the local test amplet to run.

Chased up a few issues that had come to light recently, including using
the correct credentials to sign the cert used by apache when serving
client keys, the HTTP test incorrectly reporting "-1" data rather than
None/missing, and a few minor compiler warnings.

Started to move the amplet test reporting away from handcrafted
structures to Google protocol buffers. This will take care of a lot of
the boring bits around encoding variable length data and makes it a lot
easier to report only the data required to describe a test result
(rather than including unused fields). So far I have updated the ICMP
test to use protocol buffers and it has been a pleasantly easy experience.

22

Jul

2015

Rewrote some of the code around ASN lookups for traceroute tests to make it more robust in the case of the server being unreachable. Failure to connect is now detected much quicker (using non-blocking sockets) and if anything goes wrong with a remote lookup then the thread will stop trying and respond only with cached data. The flow of control is now also a lot simpler, which means sockets always get read from in a timely manner and the code is a lot clearer.

Spent longer than I would have liked trying to diagnose the problems around ASN lookups due to not realising rsyslog was throttling all the extra debug output I had added.

Had a bit of a deeper look into some of the long duration HTTP tests to see if there might be any patterns of interest. Trademe stands out as the site that has the most slow (>10 second) page fetches, but the majority of them all come from the same connection, with other connections being perfectly fine. Most of the other targets are pretty consistent with each other.

16

Jul

2015

Spent some time investigating unusual data to make sure it wasn't
occurring in the amplet tests. Monitoring of management connections
found one that was sharing a physical link with a test connection. Some
HTTP tests were having unusually long run times which appears to be
caused by the server infrastructure and not our own DNS lookups or
connections.

Started testing a new version of the amplet client for deployment on the
NZ mesh. Ran into an issue with our large schedule files where a count
variable was too small and overflowing. Results were collected fine, but
most of them were being thrown away when reported. Split all report
messages into smaller chunks as a short term solution that doesn't
require updating the server side code (still aim to move to something
smarter like protocol buffers).

Made no useful progress on getting Chromium to fetch/modify headers
without crashing. There are newer versions I need to try, but they
require more recent versions of libraries than I have.

06

Jul

2015

Kept working with Chromium to try to get complete information on object
fetch timings. It looks like I should be able to get full timing
information for every object if I can set the Timing-Allow-Origin
header. Currently stymied by the library crashing in its memory freelist
implementation when I try to modify HTTP response headers.

Had a closer look into the behaviour of wget to try to confirm some test
methodology used by some other data sources I'm looking at. Turns out
that wget actually measures only the amount of time spent reading from
the socket and ignores everything else, reporting a very misleading time
and throughput.

Investigated further into some MTU issues we were seeing to confirm the
behaviour we were reporting. Something in the path only has a 1400 byte
MTU but doesn't always send packet too big messages, which is causing
lots of connection failures.

Spent some time proofreading reports.

06

Jul

2015

Made a new dump of up to date data for analysis, including all test
types this time. Spent some time talking to Ray about it.

Generated some graphs to try to show the comparison in latency between
two connections. Some connections that should be quite similar are a few
milliseconds different to the same target at the same time, but quite
consistent across all targets and all times. Connections that are known
to be different also exhibited a lot of similar latency to some targets,
but across multiple targets and over time there are clear differences.

Spent some more time trying to get to grips with the embedded Chromium
library and how to implement my own versions of the URL fetching functions.

24

Jun

2015

Spent some more time looking into using embedded Chromium as part of the
HTTP test. I've managed to successfully extract all of the
navigation/resource timing information from the browser after the page
has loaded which is very useful. Getting access to headers looks like it
will require implementing my own resource handlers and processing
requests manually, but should be doable. Also, I still haven't managed
to completely decouple the browser from GTK - something is still trying
to initialise it even though there is no need for it and nothing is ever
drawn to the screen.

Tidied up some more configuration parsing and parts of the main loop in
the amplet client, removing the need for a few more global variables
that were convenient at the time of writing.

Helped Brad configure a new measurement machine to be sent out.
Reconfigured some of the existing machines to swap the management ports
around so we can test them without the reporting traffic interfering.

15

Jun

2015

Short week as I was off sick on Monday and Tuesday.

Spent some time looking into using a headless web testing environment
as an alternative to the current HTTP test. This would give us
javascript support and allow us to fetch the page items we currently
don't (due to them being generated programmatically or obfuscated). Not
all of the headless testing software appears to give full access to the
events that I'm interested in, while some are written such that they
will be awkward to integrate into an AMP test. Currently looking at
embedded Chromium as most likely to be useful.

Started refactoring some of the configuration parsing code in amplet to
remove some unnecessary globals and remove some cruft from the main loop
that didn't really need to be there.