User login

Search Projects

Project Members


The aim of this project is to develop a system whereby network measurements from a variety of sources can be used to detect and report on events occurring on the network in a timely and useful fashion. The project can be broken down into four major components:

Measurement: The development and evaluation of software to collect the network measurements. Some software used will be pre-existing, e.g. SmokePing, but most of the collection will use our own software, such as AMP, libprotoident and maji. This component is mostly complete.

Collection: The collection, storage and conversion to a standardised format of the network measurements. Measurements will come from multiple locations within or around the network so we will need a system for receiving measurements from monitor hosts. Raw measurement values will need to be stored and allow for querying, particularly for later presentation. Finally, each measurement technology is likely to use a different output format so will need to be converted to a standard format that is suitable for the next component.

Eventing: Analysis of the measurements to determine whether network events have occurred. Because we are using multiple measurement sources, this component will need to aggregate events that are detected by multiple sources into a single event. This component also covers alerting, i.e. deciding how serious an event is and alerting network operators appropriately.

Presentation: Allowing network operators to inspect the measurements being reported for their network and see the context of the events that they are being alerted on. The general plan here is for web-based zoomable graphs with a flexible querying system.




Developed another detector for netevmon based on the Binary Segmentation algorithm for detecting changepoints. The detector appears to work very well and outperforms most of our existing detectors in terms of detection latency, i.e. the time between an event beginning and the event being reported.

Finally was able to migrate prophet's database over to our new faster schema and upgrade NNTSC accordingly. Aside from a couple of minor glitches that were easily fixed, the upgrade went pretty well and our database is performs somewhat better than before although I'm not convinced it will be fast enough for the full production AMP mesh.

Experimented with a few other event detection approaches for our latency time series, but unfortunately these didn't really go anywhere useful.




Finished translating the mode detection over to C++ and managed to get it producing the same results as my original python prototype. Started running it against all of our AMP latency streams which was mostly successful but it looks like there are one or two very rare edge cases that can cause it to fall over entirely. Unfortunately, the problems are difficult to replicate, especially as the failures can occur at a point where I have no idea which time series I'm looking at, so debugging looks like it might be painful.

Wrote a new detector that uses the modes reported by my new code to identify mode changes or the appearance of new modes. It would possibly be more effective if the mode detection was performed more often (currently I look for new modes every 10 minutes), but I'm concerned about the performance impact of doing it more frequently.

Started investigating other potential anomaly detection methods. Had a look at Twitter's recent breakout detection R module, but it didn't perform very well with our latency data. Found another changepoint module in R which appears to work much better, so will start looking at developing our own version of this algorithm.




Continued the painful process of migrating my python prototype for mode detection over to C++ for inclusion in netevmon. Managed to get the embedded R portion working correctly, which should be the trickiest part.

Spent a bit of time with our new libtrace testbed, getting the DAG 7.5G2s configured and capturing correctly. Ran into some problems getting the card to steer packets captured on each interface into separate stream buffers, as the firmware we are currently running doesn't appear to support steering.




Managed to get my python prototype doing a reasonable job of finding modes in a selection of time series from the current prophet database. Added a new system for determining the 'width' of a detected mode -- wide modes cover a large range of values in the probability density function and so therefore are more likely to indicate a noisy data series. Width is calculated using both the relative standard deviation and the quartile coefficient of dispersion.

Started converting the python prototype into C++ code so it can be incorporated into netevmon.

Spent the remainder of my week reading over Richard and Craig's Honours reports and making plenty of little suggestions as to how to improve the language and make sure the important points come across clearly to the reader.




Modified the amp-web matrix to add a dropdown selector for the type of latency to show on the latency matrix (TCP, ICMP or DNS). Removed the tabs for absolute and relative DNS latency, as this is now incorporated into the generic latency tabs.

My heuristics for identifying multimodal series were not quite as effective as I had hoped, so I spent the remainder of my week investigating methods used by real statisticians to find modes in a sample set. The approach I have taken involves estimating the probability density from the observed measurements using a kernel function. This results in a smoothed line graph where the peaks represent likely modes.

By examining the differences between consecutive values on the line graph, I find the local maxima and minima in the density function. The maxima are, of course, the modes themselves while the minima are required for the following step. I then use Fisher and Marron's method to eliminate or merge "minor" modes in my set of maxima. This seems to work reasonably well in the limited test cases I have provided so far, although much of the math is too complicated for me to implement entirely within netevmon. Instead, it looks like we will be calling out to R to generate the density function, but it seems likely that R will be able to do this much faster than any naive implementation I write anyway.




Finished and submitted my PAM paper, after incorporating some feedback from Richard.

Fixed a minor libwandio bug where it was not giving any indication that a gzipped file was truncated early and content was missing.

Managed to get a new version of the amplet code from Brendon installed on my test amplet. Set up a full schedule of tests and found a few bugs that I reported back to the developer. By the end of the week, we were getting closer to having a full set of tests working properly -- just one or two outstanding bugs in the traceroute test.

Got netevmon running again on the test NNTSC. Noticed that we are getting a lot of false positives for the changepoint and mode detectors for test targets that are hosted on Akamai. This is because the series is fluctuating between two latency values and the detectors get confused as to which of the values is "normal" -- whenever it switches between them, we get an erroneous event. Added a new time series type to combat this: multimodal, where the series has 2 or 3 clear modes that it is always switching between. Multimodal series will not run the changepoint or mode detectors, but I hope to add a special multimode detector that alerts if a new and different mode appears (or an old mode disappears).




Finished developing and testing stream / collection selection in netevmon.

Added support for the HTTP test back into NNTSC. We only store basic statistics from the test, i.e. number of objects, bytes, servers and the time taken to fetch everything, as opposed to the previous schema which tried to store detailed information about each individual fetched object. Managed to get my own amplet VM to do some testing and have been happily running HTTP tests for most of the week.

Replaced the pika code in NNTSC to use asynchronous connections rather than blocking connections. This should make our rabbit queue publishing and consuming code a bit more robust, especially if a TCP connection breaks down, and it also appears to have made our backlog processing much faster.

Spent a decent chunk of time chasing down a bug in the AMP HTTP test that would cause it to segfault if you tested to certain sites. After delving deep into the flex code that parses the HTML on the fetched pages looking for other objects to fetch, we eventually found that the buffer being provided to store the URL of the found object was not big enough to fit all the URLs we were seeing.




Released a new version of libtrace on Tuesday that contains the most recent batch of bug fixes. Started moving the libtrace wiki from trac to github; only the tool pages are left to migrate.

Updated netevmon to support the new family-based streams in NNTSC. Since this new approach results in one time series per stream (as opposed to multiple streams having to be aggregated into each time series), this greatly simplified the anomalyfeed script. Added event detection for changes in AS paths which operates in much the same way as the old IP path event detection.

Started adding the ability to specify a subset of streams / collections for event detection in netevmon, rather than automatically running against all streams. The streams / collections of interest are provided via a config file and a SIGHUP will cause the file to be re-read and any necessary changes made. This also
meant I had to add unsubscribe support to the NNTSC exporter, so that it would stop sending live updates for streams that had been removed from the config file.




Finished up a draft of the PAM paper, eventually managing to squeeze it into the 12 page limit.

Spent a bit of time learning about DPDK while investigating a build bug reported by someone trying to use libtrace's DPDK support. Turns out we were a little way behind current DPDK releases, but Richard S has managed to bring us more up-to-date over the past few days. Spent my Friday afternoon fixing up the last outstanding known issue in libtrace (trace_interrupt not working for most live formats) in preparation for a release in the next week or two.




Spent most of my week writing up a paper for PAM on the event detectors we've implemented in netevmon.

Wrote and tested a script to ease the transition from the current per-address stream format to a per-family stream format. We've already accepted that we're not going to try and migrate any existing collected data for the affected collections, so it is mostly a case of making sure we drop all the right tables (and don't drop any wrong ones).

Spent Wednesday at the student Honours conference. Our students did fairly well and were much improved on their practice talks.




Wrote a script to query prophet's database to extract the Smokeping time series used to generate the event ground truth data used in Meena's masters project, with an eye towards releasing the time series and the associated events that we have identified as a dataset for the anomaly detection community to use to validate and compare new techniques.

Went over all of the events that we had found and updated them to match the current output of our event detection software, which had changed quite a bit since we originally collected the events. There were also quite a few errors and inconsistencies in the significance ratings for the events, so I ended up spending most of my week working on this. Many of the changes were made to events that I had originally classified, so I can't blame the students entirely :)

Spent a decent chunk of Wednesday listening to our students give their Honours practice talks. The good thing is that they all appear to have done some useful work so far, but there's a bit of work to do in terms of making that work accessible to a general CS audience.




Brendon deployed the new amp-traceroute test on a VM early in the week, so I was finally able to test the new amp-traceroute database schema. After a few minor glitches, we were able to get both AS paths and IP paths going into and coming out of a NNTSC database.

Updated the existing traceroute graphs to use the new data formats. Hop count and rainbow graphs are both now based on AS paths, which we will be measuring much more frequently than IP paths. In particular, using AS paths should make our rainbow graphs a bit more useful rather than looking like a bad patchwork quilt.

Merged Brad Christensen's traceroute map graph into my current amp-web branch and updated it to work with the IP path data that we are now collecting. The map graph now "works" but there are a lot of improvements to make in the future. Sizing nodes and edges based on the frequency that the hop was hit is the main goal, but we also need to figure out what to display on the summary graph.




Added support for the new amp-tcpping test to ampy and amp-web.

Started on yet another major database schema change. This time, we're getting rid of address-based streams for amp collections and instead having one stream per address family per target. For example, instead of having an amp-icmp stream for every google address we observed, we'll just have two: one for ipv4 and one for ipv6.

This will hopefully result in some performance improvements. Firstly, we'll be doing a maximum of 2 inserts per test/source/dest combination, rather than anywhere up to 20 for some targets. We'll also have a lot less streams to search and process when starting up a NNTSC client. Finally, we should save a lot of time when querying for data, as almost all of our use cases were taking the old stream data and aggregating it based on address family anyway. Now our data is effectively pre-aggregated -- we also will have a lot less joins and unions across multiple tables.

By the end of the week, my test NNTSC was successfully collecting and storing data using this new schema. I also had ampy fetching data for amp-icmp and amp-tcpping, with amp-traceroute most of the way towards working. The main complexity with amp-traceroute is that we should be deploying Brendon's AS path traceroute next week, so I'm changing the rainbow graph to fetch AS path data and adding a method to query the IP path data that will support the monitor map graph that was implemented last summer.

Spent a day working on libtrace following some bug reports from Mike Schiffman at Farsight Security. Fixed some tricky bugs that popped up when using BPF filters with the event API.

Deployed the update-less version of NNTSC on skeptic finally. Unfortunately this initially made the performance even worse, as we were trying to keep the last timestamp cache up to date after every message. Changed it so that NNTSC only writes to the cache once every 5 mins of realtime, which seems to have solved the problem. In fact, we are now finally starting to (slowly) catch up on the message queue on skeptic.




Made a few minor tidyups to the TCPPing test. The main change was to pad IPv4 SYNs with 20 bytes of TCP NOOP options to ensure IPv4 and IPv6 tests to the same target will have the same packet size. Otherwise this could get confusing for users when they choose a packet size on the graph modal and find that they can't see IPv6 (or IPv4) results.

Now that we have three AMP tests that measure latency, we decided that it would be best if all of the latency tests could be viewed on the same graph, rather than there being a separate graph for each of DNS, ICMP and TCPPing. This required a fair amount of re-architecting of ampy to support views that span multiple collections -- we now have an 'amp-latency' view that can contain groups from any of the 'amp-dns', 'amp-icmp' and 'amp-tcpping' collections.

Added support for the amp-latency view to the website. The most time-consuming changes were re-designing the modal dialog for choosing which test results to add to an amp-latency graph, as now it needed to support all three latency collections (which all have quite different test options) on the same dialog. It gets quite complicated when you consider that we won't necessarily run all three tests to every target, e.g. no point in running a DNS test to as it isn't a DNS server, so the dialog must ensure that all valid selections and no invalid selections are presented to the user. As a result, there's a lot of hiding and showing of modal components required based on what option the user has just changed.

Managed to get amp-latency views working on the website for the existing amp-icmp and amp-dns collections, but it should be a straightforward task to add amp-tcpping as well.




Wrote a script to update an existing NNTSC database to add the necessary tables and columns for storing AS path data. Tested it on my existing test database and will roll it out to prophet once we're collecting AS path data and are sure that our database schema covers everything we want to store.

Added a TCP ping test to AMP. This turned out to be a lot more complicated than I had first anticipated, but I'm reasonably confident that we've got something working now. The test works by sending a TCP SYN to a predefined port on the target and measures how long it takes to get a TCP response (either a SYN ACK or a RST). We can also get an ICMP response, so we need to listen for that and report a failed result in that case. The complications arise in that the operating system typically handles the TCP handshake, so we have to pull a number of tricks to be able to send and receive TCP SYN and SYN ACK packets inside our test code.

Sending a SYN is easy enough using a raw socket, although we have to make sure we bind the source port using a separate socket to prevent the OS from allowing other applications to use it which would screw with our responses. Getting the response is a lot harder -- we have to work out which interface our SYN is going to use and attach a pcap live capture to that interface (filtering on traffic for our known source and dest ports + icmp). We find the interface by creating a UDP socket to our target and seeing which source address it binds to, then check the list of addresses returned by getifaddrs() to find a match. The match will tell us the name of interface that the address belongs to.

Any packets received on our pcap capture are checked to see if they match any of the SYNs that we had sent out. This is done by parsing the packet headers -- I felt dirty writing non-libtrace packet parsing code -- and looking to see if the ACK matched the sequence number of the packet we had sent (in the case of TCP) or if the embedded TCP header matched our original SYN (in the case of ICMP).

The test still has a few annoying limitations due to the nature of firewalls on the Internet these days. I had originally intended to allow the test to vary the packet size by adding payload to the SYN, which is technically legal TCP behaviour, but in testing I found that SYNs with extra payload will often be dropped and we'll get no response. Transparent proxies on the monitor side are also problematic, in that they will pre-emptively respond to SYNs on port 80 and therefore mess with our latency measurement, e.g. the Fortigate here at Waikato does this, which initially made me think I had a bug in my timestamping since I was getting sub-1 ms results for targets I knew were hundreds of ms away.

Deployed the TCP ping test on our Centos VM successfully and was able to collect some test data in a NNTSC database. Also updated netevmon to be able to process TCP ping latency measurements.




Spent most of the week tidying up the NNTSC codebase. Replaced the dataparsers with proper OO classes, which removed a lot of repeated code and should make it easier to both write and maintain dataparsers. Also replaced most of the error reporting with exceptions rather than returning error codes everywhere.

Added support for storing AS paths alongside IP paths for amp-traceroute in NNTSC. The new improved traceroute test will eventually be able to report the AS for each detected hop, which is often more interesting and useful than the specific IP address. For instance, event detection can look for changes in the AS path rather than alerting when a new IP address is observed -- which can happen a lot when your path takes you through Google or Amazon EC2.

We can also colour our rainbow graph based on the AS rather than using a new colour for each address, which should hopefully reduce the patchwork-quilt effect while still being useful to look at.




Carrying on from last week, storing a cache entry per stream turned out to be a bad idea. Some matrix meshes consist of 100s of streams so we spend a lot of time looking up cache entries. As a result, I rewrote the caching code to store one dictionary per collection, mapping stream ids to tuples containing the timestamps. This gets looked up once per query, so only one cache operation is required to generate a matrix.

Updating the cache when we have to query for missing values is a bit annoying, as we cannot simply update the dictionary and put it back in the cache once the query is complete as the data inserting process may have updated other cache entries with new 'most recent data' timestamps while we were fulfilling our query. Instead, we have to re-fetch the dictionary, update the one stream we're changing and then immediately store the dictionary again.

Updated ampy to no longer keep track of active streams and removed support for ACTIVE_STREAMS queries from the NNTSC protocol.

Merged Perry's lzma support into libwandio. Started working towards a new libtrace release -- managed to build and pass all tests on our various development boxes so should be able to push out a release next week.

Spent a day reading over Meenakshee's thesis. Suggested a series of mostly minor edits and changes but overall it is looking pretty good.




Found and fixed a large memory leak in netevmon that had caused prophet to run out of memory over the weekend. The problem was that I was allocated space for storing IPv6 address strings in the Traceroute detector but not freeing it properly if the address was already in our LRU. Also took the opportunity to make our memory use for Traceroute much more efficient, i.e. having a global hop LRU across all traceroute streams rather than one per stream which was leading to a lot of duplication.

Started looking into our insertion speed problems. One obvious source of slowdowns is the UPDATE that we use to remember when we last inserted data for a stream. This update is being called once per measurement interval for each collection and becomes quite onerous when the streams table gets very large. Implemented a solution where the first and last insertion for each stream is stored in memcache instead of the database. If there is no entry in memcache when a query comes in for the stream, we can query the data table for that stream for min and max timestamp instead, although this is a slightly expensive operation.

Once I had that working, I removed the 'streams' table from NNTSC entirely as it was no longer needed (each collection has its own stream table with specific details about each stream; the streams table was mainly for storing common properties across all collections like lasttimestamp). This meant I had to remove or change all references in the NNTSC database code to the streams table but was otherwise straightforward.

Spent Friday fixing a bug in libtrace where trace_get_source_port and trace_get_destination_port would return bogus values if called on fragmented packets. Added a new API function for getting the fragment offset and more fragment flag from a packet. I needed this anyway for fixing the bug and given the amount of bit-shifting, masking, multiplying and header parsing (for v6) involved, it would probably be useful to other people as well.




Short week last week, after being sick on Monday and recovering from a spot of minor surgery on Thursday and Friday.

Finished adding support for the amp-throughput test to amp-web, so we can now browse and view graphs for amp-throughput data. Once again, some fiddling with the modal code was required to ensure the modal dialog for the new collection supported all the required selection options.

Lines for basic time series graphs (i.e. non-smokeping style graphs) will now highlight and show tooltips if moused over on the detail graph, just like the smoke graphs do. This was an annoying inconsistency that had been ignored because all of the existing amp collections before now used the smoke line style. Also fixed the hit detection code for the highlighting so that it would work if a vertical line segment was moused over -- previously we were only matching on the horizontal segments.




Reworked how aggregation binsizes are calculated for the graphs. There is now a fixed set of aggregation levels that can be chosen, based on the time period being shown on the graph. This means that we should hit cached data a lot more often rather than choosing a new binsize every few zoom levels. Increased the minimum binsize to 300 seconds for all non-amp graphs and 60 seconds for amp graphs. This will help avoid problems where the binsize was smaller than the measurement frequency, resulting in empty bins that we had to recognise were not gaps in the data.

Added new matrices for DNS data, one showing relative latency and the other showing absolute latency. These act much like the existing latency matrices, except we have to be a lot smarter about which streams we use for colouring the matrix cell. If there are any non-recursive tests, we will use the streams for those tests as these are presumably cases where we are querying an authoritative server. Otherwise, we assume we are testing a public DNS server and use the results from querying for '', as this is a name that is most likely to be cached. This will require us to always schedule a '' test for any non-authoritative servers that we test, but that's probably not a bad idea anyway.

Wrote a script to more easily update the amp-meta databases to add new targets and update mesh memberships. Used this script to completely replace the meshes on prophet to better reflect the test schedules that we are running on hosts that report to prophet.

Merged the new ampy/amp-web into the develop branch, so hopefully Brad and I will be able to push out these changes to the main website soon.

Started working on adding support for the throughput test to ampy. Hopefully all the changes I have made over the past few weeks will make this a lot easier.