User login

Search Projects

Project Members


The aim of this project is to develop a system whereby network measurements from a variety of sources can be used to detect and report on events occurring on the network in a timely and useful fashion. The project can be broken down into four major components:

Measurement: The development and evaluation of software to collect the network measurements. Some software used will be pre-existing, e.g. SmokePing, but most of the collection will use our own software, such as AMP, libprotoident and maji. This component is mostly complete.

Collection: The collection, storage and conversion to a standardised format of the network measurements. Measurements will come from multiple locations within or around the network so we will need a system for receiving measurements from monitor hosts. Raw measurement values will need to be stored and allow for querying, particularly for later presentation. Finally, each measurement technology is likely to use a different output format so will need to be converted to a standard format that is suitable for the next component.

Eventing: Analysis of the measurements to determine whether network events have occurred. Because we are using multiple measurement sources, this component will need to aggregate events that are detected by multiple sources into a single event. This component also covers alerting, i.e. deciding how serious an event is and alerting network operators appropriately.

Presentation: Allowing network operators to inspect the measurements being reported for their network and see the context of the events that they are being alerted on. The general plan here is for web-based zoomable graphs with a flexible querying system.




Updated the event tooltips to better describe the group that the event belongs to, as it was previously difficult to tell which line the event corresponded to when multiple lines were drawn on the graph.

Brad's rainbow graph is now used whenever an AMP traceroute event is clicked on in the dashboard. Fixed a couple of bugs with the rainbow graph: the main one being that it was rendering the heavily aggregated summary data in the detail graph instead of the detailed data.

Replaced the old hop count event detection for traceroute data with a detector that reports when a hop in the path has changed.

Fixed a tricky little bug in NNTSC where large aggregate data queries were being broken up into time periods that did not align with the requested binsize, so a bin would straddle two queries. This would produce two results for the same bin and was causing the summary graph to stop several hours short of the right hand edge.

Started working on making the tabs allowing access to "similar" graphs operational again. Have got this working for LPI, which is the most complicated case, so it shouldn't be too hard to get tabs going for everything else again before the end of the year.




Spent most of the week adding view support to all of the existing collections within ampy. Much of the work was modifying the code to be more generic rather than the AMP-specific original implementation Brendon wrote as a proof of concept.

Added a new api to amp-web called eventview that will generate a suitable view for a given event, e.g. an AMP ICMP event will produce a view showing a single line for the address family where the event was detected.

Updated the legend generation code for views to work for all collections as well. Added a short label for each line so it will be possible to display a pop-up which will distinguish between the different colours for the same line group.




Finished the re-implementation of anomalyfeed to support grouping of streams into a single time series. Now our AMP ICMP tests are considered as one time series despite being spread across multiple addresses (and therefore multiple streams).

Brendon changed the way that we store AMP traceroute test results to improve the query performance, so this required a further update to anomalyfeed to be able to parse the new row format.

Updated NNTSC to always use labels rather than stream ids when querying the database. Eventually, all incoming queries will use labels but ampy still uses stream ids for many collections so we have to support both methods still. Any queries that are using stream ids are converted to labels by the NNTSC client API.

Updated Brendon's view / stream group management code in ampy to not be so AMP-specific. The collection-specific code has now moved into the parser code for each collection so it should be much easier to implement views for the remaining collections now.




Spent the first part of the week fixing various bugs and less than ideal behaviours in netevmon and nntsc. Some examples include:
* Preventing an event from being triggered when an amp-traceroute stream reactivates after a long idle time
* Fixed a crash bug in anomalyfeed due to an incorrect field name being used
* Fixed a problem in NNTSC where the HTTP dataparser would fall over if a path contained a ' character.
* Added a rounding threshold to the Mode detector so that it can be used with AMP ICMP streams, as these measure in usec rather than msec. Now we can round to the nearest msec.

Brendon finally merged his view changes back into the development branches of our software. This caused a number of problems with netevmon, as this had been overlooked when testing the changes originally. Managed to patch up all the problems in a rather hurried session on Tuesday afternoon and got everything back up and running.

Restarted netevmon with the TEntropy detectors running. They seem to be performing very well so far and are a useful addition.

Started working on adding the ability to group streams into a single time series within anomalyfeed. The main reason for this is to be able to cope better with the variety of addresses that AMP ICMP typically tests to. It makes more sense to consider all of these streams as a single aggregated stream rather than trying to run the event detectors against each stream individually, especially considering many addresses are only tested to intermittently. Grouping them will ensure there should be a result at every measurement interval. So far I've got this working for AMP ICMP, AMP traceroute and AMP DNS and will need to reimplement the other collections using the new system.

Spent a fair chunk of time reading up on belief theory and Dempster-Shafer so that I could give Meena some pointers on what she will need to be able to apply them to our event data. Managed to come up with some rough ideas that seem to work, but not sure if the theory is being applied 100% correctly.




Spent some time tweaking the new TEntropy-based detectors in netevmon to reduce the number of false positives and insignificant events that they were reporting. Mostly this involved tuning the various thresholds used by the Plateau detector that is run over the TEntropy values rather than the TEntropy methodology itself.

As I was doing this, I started putting together a gigantic spreadsheet of the events observed, their significance, which detectors were picking them up, and the delay between the event starting and the detector reporting it. This is useful for two main reasons:
* As I adjust and tweak the existing detectors I can easily compare the events I used to detect with what I am detecting now (and what I think I should be getting).
* We will need to calculate the probability that a given detector is right for the next major phase of Meena's project. This spreadsheet will form the basis for estimating these probabilities.

Added support to NNTSC for collecting and storing AMP HTTP test results. Seems to work reasonably well (after fixing a bug or two in the test itself!) but it'll be interesting to see how query performance pans out once the table starts to get large, given our travails with the traceroute data.




Managed to write libprotoident rules for a couple of new applications, WeChat and Funshion. Released a new version of libprotoident (2.0.7).

Added support for the AMP DNS test to NNTSC, netevmon and amp-web. Wrote a new detector that looks for changes in response codes, e.g. the DNS response going from NOERROR to REFUSED or some other error state. This should also be useful for the HTTP test in the future.

Fixed a bug in the ChangepointDetector where it wasn't dealing well with streams that featured large values (i.e. >100,000). Also spent a bit more time tweaking the Plateau detector, mainly dealing with problems that show up when either the mean or the standard deviation are very small.




Short week due to remaining in Aus for a holiday after LCN.

Upon my return, I spent a bit of time trying to capture traffic for WhatsApp and other mobile messaging services. I had earlier found some flows that were possibly WhatsApp in some traffic I had captured before going away and wanted to confirm it.

It turned out to be a bit trickier to get this traffic than originally anticipated. WhatsApp required a mobile phone number to register an account so we needed to acquire a couple of new 2degrees SIM cards and receive the confirmation text messages on them. Also, the Android VM that we had created for this purpose wouldn't install WhatsApp because the image was intended for a tablet rather than a phone so we had to use Blue Stacks instead.

I also captured traffic for Kik, another similar application, and found that we were erroneously classifying Kik traffic as Apple Push notifications as they both use SSL on port 5223. Fortunately, some very subtle differences in the SSL handshake allowed me to write a rule that could reliably identify Kik traffic. Also tried to capture GroupMe traffic but could not reliably receive the text message required to register an account.

Spent most of Friday going over events reported by the Plateau detector in netevmon and made a number of tweaks which should hopefully make it quicker to pick up on obvious changes in latency time series as well as more reliable than before.




Spent most of the week preparing for my Sydney trip. Wrote the talk I will be presenting this coming Thursday and gave a practice rendition on Friday.

The rest of my time was spent fixing minor issues in Cuz -- trying not to break anything major before I go away for a week. Replaced the bad SQLAlchemy code in the ampy netevmon engine with some psycopg2 code, which should make us slightly more secure. Also tweaked some of the event display stuff on the dashboard so that useful information is displayed in a sensible format, i.e. less '|' characters all over the place.

Had a useful meeting with Lightwire on Wednesday. Was pleased to hear that their general impression of our software is good and will start working towards making it more useful to them over the summer.




The new psycopg2-based query system was generally working well but using significant amounts of memory. This turned out to be due to the default cursor being client-side, which meant that the entire result was being sent to the querier at once and stored in memory. I changed the large data queries to use a server-side cursor which immediately solved the memory problem. Instead, results are now shipped to the client in small chunks as needed -- since the NNTSC database and exporter process are typically located on the same host, this is not likely to be problematic.

Netevmon now tries to use the measurement frequency reported by NNTSC for the historical data wherever possible rather than trying to guesstimate the frequency based on the time difference between the first two measurements. The previous approach was failing badly with our new one stream per tested address approach for AMP as individual addresses were often tested intermittently. If there is no historical data, then a new algorithm is used that simply finds the smallest difference in the first N measurements and uses that.

Changed the table structure for storing AMP traceroute data. The previous method was causing too many problems and required too much special treatment to query efficiently. In the end, we decided to bite the bullet and re-design the whole thing, at the cost of all of the traceroute data we had collected over the past few months (actually, it is still there but would be painful to convert over to the new format).

Had a long but fruitful meeting with Brendon and Brad where we worked out a 'view' system for describing what streams should be displayed on a graph. Users will be able to create and customise their own views and share them easily with other users. Stream selections will be described using expressions rather than explicitly listing stream ids as it is now (although listing specific streams will still be possible).

This will allow us to create a graph showing a single line aggregating all streams that match the expression: "collection=amp-icmp AND AND AND family=ipv4". Our view could also include a second line for IPv6. By using expressions, we can have the view automatically update to include new streams that match the criteria after the view was created, e.g. new Google addresses.




Finished migrating our database query code in NNTSC from SQLAlchemy to psycopg2.

Released libwandevent 3.0 and updated netevmon to use it instead of the deprecated libwandevent 2 API.

Continued to be stymied by performance bottlenecks when querying large amounts of historical data from NNTSC using netevmon. The problems all relate to attempts to export live data at the same time breaking down, which eventually caused the data collection to block waiting to write live data to the exporter. Because the data collection was blocked, no new data was being collected or written to the database.

The first new problem I found was (surprisingly) caused by our trigger function that writes new data into the right partition. Because there is no "CREATE IF NOT EXISTS" for triggers in postgres, we were dropping the trigger and then re-creating it whenever we switched to a new partition. However, you can't drop a trigger from a table without having an exclusive lock on the table. If the table is under heavy query load (e.g. from netevmon) then the DROP TRIGGER command will block until the querying ends. The solution was reasonably straightforward -- check the metadata tables for the existence of the trigger and only create it if it doesn't exist.

The other problem was that our select queries were happening in the same thread as the reading of live data from the exporter. Despite last week's improvements, the queries can still take a little while and live data was building up while the query was taking place. Furthermore, we were only reading one message from the live data queue before returning to querying so we would never catch up once we fell behind. To fix this, I've implemented a worker thread pool for performing the select queries instead so we can export live data while a query is ongoing.




Attempting to run netevmon against a decent quantity of historical data has been causing significant performance problems and even preventing NNTSC from processing and storing new measurements. After a bit of hackish profiling, I realised that the biggest problem was the time taken to query for traceroute data. Unlike most of the other existing data tables, the traceroute data is spread across three tables which are joined to create a view that we query from.

Unfortunately, the join was not smart enough to recognise that the traceroute test ids it was looking for all fell within a certain set of table partitions. Instead, it would sequentially scan millions of rows across all of the test tables. After a lot of messing around with the SQL used to create the view, I found that the best approach was to instead use a procedure that figured out the test ids that fell within the time period being queried for and returned a table constructed using constraints on the test ids as well as timestamp and stream ids.

This managed to get the query time for several weeks worth of data down from 12 seconds to 2 seconds. The next problem was using the procedure within SQLAlchemy in place of a "data table", as SQLAlchemy treats the returned table as a Result object rather than a Table object. This meant that there weren't any Column objects available for us to operate on, e.g. apply aggregation functions for generating graph data.

At this point, it became apparent that SQLAlchemy was more of a hindrance than a benefit and I decided we would be better off replacing it with the much simpler but more intuitive psycopg2, at least for the database querying side of NNTSC. Spent the remainder of my week writing and testing the new query code.




Changed the stream definition for both the AMP ICMP and AMP traceroute collections in NNTSC to include the address that was tested to. This means that we can more easily analyse the behaviour of specific paths and show each one as a separate line on our graphs.

Also added support for multiple streams into ampy and amp-web. Previously, a graph URL would contain a single stream id which described the stream to be shown on the graph -- now the URL contains a series of stream ids separated by hyphens (although we only plot the first right now). Various ampy functions now return a list of streams rather than just one. Streams within the amp-web javascript are represented as objects rather than just the id number -- this allows us to store additional information with the stream such as the colour to use when plotting the stream and whether the stream should be plotted or not.

Added an LRU-based detector to netevmon, mainly for use with the traceroute data. The detector maintains an LRU of values that it has seen recently (e.g. hop counts) and creates an event anytime it has to add a new value to the LRU. This will also be used to check for changes in the full path returned by the traceroute test.




Updated ampy to cache stream information as well as data measurements. I had noticed that multiple requests for the same stream information were being generated when loading a graph, which seemed a little wasteful. Now we cache the details of what streams are available for each collection and the description of each stream (source, dest, metric etc.). The one downside is that newly-added streams won't be obvious until the cached stream list for the collection has expired.

Added support in NNTSC for table partitioning of traceroute data. This was much more complicated than anticipated for several reasons:
* the trigger function that inserts the data must return NULL to avoid a duplicate insertion into the parent table as well as the partitioned table.
* our traceroute test table had a "test id" column that was defined as a primary key based on an auto-incremented sequence, which meant sqlalchemy would try to return the newly inserted row by default.
* we needed the value of the test id for subsequent inserts into other tables relating to the traceroute test.
* sqlalchemy had no error-handling for the case where an insert operation that was meant to return a row returned null, resulting in a crash with little to no useful error message.

Once I'd figured all this out, I implemented a (somewhat hackish) solution: disable the implicit return, so we could keep our trigger function returning NULL without crashing sqlalchemy. Then, following our insert operation, immediately perform a SELECT to find the row we just inserted and grab the test id from that.

There was also the problem of the traceroute path table which I also wanted to partition but did not have a timestamp column. The partitioning code I had written was only designed to partition based on timestamp, so I had to re-engineer that to support any numeric column (although it defaults to using timestamp).

Finally, I had to then go and manually move all of the existing traceroute data into suitable partitions.

I also spent some time fixing up the Constant to Noisy algorithm in netevmon. Mostly this just involved refining some of the thresholds for the change detection, but I also avoid moving from Constant to Noisy unless the most recent N measurements have all demonstrated a reasonable amount of noise, i.e. the differences between consecutive measurements is significant relative to the mean.

One last thing: added timer events to the python version of libwandevent. Used this to ensure that anomalyfeed would request historical information at a sensible rate when first starting up, rather than asking for it all at once and completely hosing NNTSC with data requests.




Spent most of the week on leave, so not much got done this week.

In the time I was here, I fixed a number of bugs with the auto-scaling summary graph that occurred when there was no data to plot in the detail view.

I implemented yet another new algorithm for trying to determine if a time series is constant or noisy, as the previous one was pretty awful at recognising that the time series had moved from constant to noisy. The new one is better at that, but still appears to have problems for some of our streams -- it now tends to flick between constant and noisy a little too frequently -- so it will be back to the drawing board somewhat on that one.




Tidied up a lot of the javascript within amp-web. Moved all of the external scripts (i.e. stuff not developed by us) into a separate lib directory and ensured that everything used consistent and specific terminology.

Added config options to amp-web for specifying the location of the netevmon and amp meta-data databases. Previously we had assumed these were on the local machine, which proved troublesome when Brad tried to get Cuz running on warlock.

Capped the maximum range of the summary graph to prevent users from zooming out into empty space.

Fixed some byte-ordering bugs in libpacketdump's RadioTap and 802.11 header parsing on big endian architectures.




Added a smarter method of generating tick labels on the X axis to amp-web. Previously, if you were zoomed in far enough, the labels simply showed a time with no indication as to what day you are looking at. Now, we show the date as well as the time.

Reworked how zoom behaviour works with the summary graph. The zoom-level is now determined dynamically based on the selected range, e.g. selecting more than 75% of the current summary range will cause it to zoom out to the next level. Selecting a small area will cause it to zoom back in.

To support arbitrary changes to the summary graph range without having to re-fetch and re-draw both graphs, I decided to rewrite our graph management scripts to operate on an instance of a class rather than just being a function that gets called whenever we want to render the graphs. The class has methods that update just the summary graph or just the detail graph, so we only end up changing the graph that we need to. Also, the class can be subclassed to support different graph styles easily, e.g. our Smokeping style. While I was rewriting, I used jQuery.when to make all of the AJAX requests for graph data simultaneously rather than sequentially as we were previously.

Unfortunately, this was a pretty painful re-write as Javascript scoping behaviour was a constant thorn in my side. Turned out that there was a reason we did everything inside of one big function, as I frequently found that I could no longer access my parent object inside of callback functions that I had defined within the new class. Often the method that was used to setup the callback did not support passing in arbitrary parameters either, so ensuring I had all the information I needed inside my callback functions took a lot longer than anticipated.




Implemented a new data caching scheme within ampy to try and limit the number of queries that are made to the NNTSC database. Previously, data was cached based on the start and end time given in the original query, which meant that we would only get a cache hit if the exact same query was made. Instead, caching is now done based on time "blocks", where each block includes 12 individual datapoints, so we can more easily re-use the results from old queries that overlap with the current one.

Re-worked the JitterVariance detector in netevmon, as it had been producing some unimpressive results of late. Instead of looking at the standard deviation of the individual measurements, I now look at the standard deviation as a percentage of the mean latency. Also started running a Plateau detector against these values, which has been surprisingly effective at picking up on increases in "smoke" quickly.

Fixed the issue in amp-web where the y-axis on the detail graph was autoscaling to the largest value in the summary graph. Also tweaked some of the behaviour of the selection area in the summary graph: single-clicking is now a null operation (i.e. it won't reset the detail graph to show the full summary graph) and you can now click and drag on the shaded area to move the selection (previously, you could only use the tiny handle for this).

Tidied up the _get_data function in the core of ampy, as this was getting messy and disorganised. ampy parsers must now implement a request_data function which will form and make the request to NNTSC for data -- however, the clunky get_aggregate_columns, get_group_columns and get_aggregate_functions functions have all gone away.




Spent another couple of days moving code around in amp-web to make it tidier and easier to work with. Hopefully, Brendon will still be able to find things inside the codebase...

Added support for the amp-traceroute collection to amp-web. The graph is just a placeholder at the moment (a line graph of hop counts) until we get around to implementing the more useful stacked hop count graph using envision.

Re-enabled the tabs on the right-hand side of the graphs that allowed switching between related graphs, albeit without the preview graphs that used to be on them. The original tabs were very AMP-specific and hard-coded to appear on every graph. Now, the tabs are generated dynamically by an AJAX request that asks ampy for a list of "related" streams to the one currently being displayed. For example, an LPI byte count stream would have tabs showing flow, packet and user counts for the same source and application protocol whereas AMP streams will have tabs showing latency and traceroute for the same source-destination pair.

To avoid page reloads when using the tabs to switch between collections, I changed the dropdowns to be generated dynamically via an AJAX request rather than being placed and populated via the python run when the page is loaded.




Added support for the AMP ICMP collection to ampy and amp-web, so we are now able to plot graphs of the test data Brendon has been collecting.

Spent a decent chunk of an afternoon working through the DPDK build system with Richard S., trying to make the DPDK libraries build as position-independent code so that we can link libtrace against them nicely.

Reworked a large amount of code in amp-web to move the collection-specific code out of the core source files and into separate little modules for each collection. This means that the core code should be much easier to follow and work on. Adding support for new collections should also be simpler and require less inside knowledge of how the whole system works.




Table partitioning is now up and running inside of NNTSC. Migrated all the existing data over to partitioned tables.

Enabled per-user tracking in the LPI collector and updated Cuz to deal with multiple users sensibly. Changed the LPI collector to not export counters that have a value of zero -- the client now detects which protocols were missing counters and inserts zeroes accordingly. Also changed NNTSC to only create LPI streams when the time series has a non-zero value occur, which avoids the problem of creating hundreds of streams per user which are entirely zero because the user never uses that protocol.

Added ability to query NNTSC for a list of streams that had been added since a given stream was created. This is needed to allow ampy to keep up to date with streams that have been added since the connection to NNTSC was first made. This is not an ideal solution as it adds an extra database query to many ampy operations, but I'm hoping to come up with something better soon.

Revisited and thoroughly documented the ShewhartS-based event detection code in netevmon. In the process, I made a couple of tweaks that should reduce the number of 'unimportant' events that we have been getting.