User login

Blogs

28

Sep

2017

Spent some time chasing down issues in my BIRD configuration in my BGP resiliency testbed that meant routes were being shared inappropriately between peers (missing filters, which wouldn't have been a problem except I'm also messing with settings allowing the local AS to appear). Added further peers and edge devices in different configurations to make sure that they are all properly isolated. Everything looks to be working pretty well, and enabling/disabling specific peering sessions causes the appropriate route updates. Adding a second controller to the test correctly keeps the best routes available even when one of them is unavailable.

Noticed that as the number of peers increased, the number of full route recalculations was getting large, so tried to remove some extraneous causes of updates to be sent. Often we already had enough information in a peer process to do the work without asking the table to do work as well (and possibly triggering it to send unnecessary updates to other peers). Also added a very short dampening period to updates so that many consecutive messages in a short time period only cause only a single route recalculation to occur.

Fixed a few bugs that would allow saved routes to be modified by filters, meaning the next time these routes ran through filters the results would be cumulative. Hopefully the saved raw/original routes and filtered routes for distribution are now quite separate.

28

Sep

2017

Had a very interesting chat with Perry about the BGP router project, and how I was going about trying to make it more resilient. He suggested a few ways to go about it that were much more simple than what I was planning, and also did away with the nastiness of shared state between the redundant controllers. Each controller can independently do its own thing and use BGP route selection on the managed devices to settle any differences arising.

Started setting up a test environment so that I can trial the changes made to help the BGP router more resilient to failures. It's currently a simple network using docker containers running BIRD to act as my peers/routers, with another couple running redundant instances of my code. Most of the work so far has gone into getting my edge devices running BIRD to do the right thing with the routes, importing and exporting using the correct tables to make sure they don't get inadvertently modified or shared at the wrong location.

Updated the router to better track which peers are in an active state, and to add communities to exported routes when peers are missing in order to flag the degraded state to the recipient.

25

Sep

2017

Back at work after a couple of weeks disrupted by illness. Spent most of the week working on my application protocol paper. Managed to produce a few interesting looking graphs and am now starting to get a rough idea of how my narrative is going to come together. Essentially, modern application protocols are vague and therefore require a lot more work and expertise to identify. However, they are still possible to identify and there are still plenty of new protocols appearing every year, so DPI hasn't outlived its usefulness entirely yet.

Had a meeting with Alistair from CAIDA about the first steps on the STARDUST project, which is essentially a redevelopment of their telescope to support 10G capture and multiple live clients. Obviously, this is going to build a lot on our experiences so far with parallel libtrace / wdcap -- one of my key jobs will be to develop a new parallel, multicast RT protocol as the old RT protocol simply won't be fast enough anymore.

20

Sep

2017

This week I finished of modifying the BGP performance testing tool. I ran the test several times with various prefix sizes on Quagga, BIRD and the disaggregator to gather stats and compare performance. The tool samples CPU usage, memory consumption, prefixes processed and time. After cleaning up and processing the results I then looked at ways to improve performance. I focused on memory and CPU profiling the code to see where the bottlenecks are and where we can improve.

Memory usage seems to be quite high due to the prefix and route entry objects, which we have a lot of. After looking into potential solutions I found that there are no ways this can be reduced in python. I then started looking at implementing the prefix module in C and replacing the existing Python-based implementation to reduce memory usage and improve performance.

11

Sep

2017

This week I finished off implementation of the dynamic topology module. Fixed up issues with the OSPF dynamic topology module and with the processing of topology information in the network module. I also performed more tests to make sure that the network changes are received by the disaggregator and interpreted correctly. I have also fixed some bugs relating to the unit test runner and modified the disaggregator to use a logging module, in preparation for performance testing. The log level for the output can now be specified as an argument when running the disaggregator.

At the end of the week, I started to work on performance testing the code. I am currently modifying the testing tool to allow benchmarking of the disaggregator. There are several other BGP implementations that can be benchmarked with the tool. We can use these to compare our results.

04

Sep

2017

Most of this week I spent on getting dynamic topology information from the network. I mostly focused on implementing support for OSPF. I extended the simulation tools to allow starting quagga ospfd router instances and extended the connection tools to work as well. I then created a test OSPF network and spent some time trying to connect to it to receive link state updates. After some experimentation, I managed to integrate a tool that will establish an OSPF connection to a router and receive link state updates, which it stores and processes. Network information is built from the link state database. This info is sent to the disaggregator which will process it and create a topology object from it. When a network change occurs, either due to something expires. a link going down or something new being advertised, the internal topology of the disaggregator will be modified accordingly.

I have also spent some time modifying the configuration file in preparation for support of more protocols to get dynamic topology information of our network. The config file now accepts multiple protocols as well as multiple static files to load topology information from. The dynamic protocol probe tools are automatically started by the network module based on the config file protocol type and configuration attributes.

At the end of the week, I spent some time working on implementing better filtering of prefixes to peers based on the negotiated MP-BGP AFI/SAFI attributes.

04

Sep

2017

Spent most of the week working with the TCP throughput test to investigate what I can actually do with the new information, and integrating it into the test. Retransmit counters, RTT etc are easy to extract and explain. I also get information about time spent blocked due to the receive window on the remote end, or the send buffer on the near end of the connection, but I'm not convinced about how accurate these are (or I'm not understanding what they mean correctly) - drastically limiting my send buffer size will only sometimes report any time spent limited by the send buffer, and querying when I know for certain that there is no outstanding data doesn't always report as being application limited. It's a starting point at least, so I'll keep looking at the data and see what can be done about it.

Kept working on making the BGP disaggregated router more resilient. Implemented a few new messages to communicate the state of peers between different processes within a single instance so that they can be compared with other instances. Got some simple logic working that will disable any instance that is known to have an incomplete view of the peers.

04

Sep

2017

Wrote up a sample program to test out some heartbeat ideas, and some fairly simple ideas look like they should work ok, while sharing minimum state between instances of the routing engine. Started to implement that inside the actual BGP disaggregated router code to see how it will behave with real data. Started setting up my test environment to allow multiple instances to be running and connected to the same BIRD process so that they all get the same routes.

Had another look at the new TCP info available to processes now that I'm running a 4.10 kernel. My userspace hasn't been updated, but all the new information is available to me if I use the updated version of the struct. Looks like I should be able to get timing information about what is causing send to block (which end is at fault), as well as retransmit counts, RTT, etc that can be used to try to determine why the throughput test reported a particular result.

04

Sep

2017

Implemented basic route refresh functionality in the BGP disaggregated router, and wrestled with exabgp to find out how to pass through the messages I required to do so. Also spent some time chasing down what looked like bugs in the topology module, but was actually a broken data file that didn't correctly describe the layout of the network.

Had another attempt at getting my chromium/youtube test working. It works fine when I build it within the chromium source tree alongside their example headless applications, but otherwise fails. It appears to be linking against a lot of object files deep inside chromium, as well as the headless static library (which I thought should contain everything needed to build a headless application?), as well as all the normal shared libraries. Back into the too-hard basket until they sort their stuff out or I have some more time to push through this.

Spent a lot of time reading about different approaches to HA/resilience, what sort of information nodes often pass around and how they go about sharing state (or avoiding sharing state).

04

Sep

2017

Finished up writing and testing the new address family selection options in AMP and making sure that all of the tests work properly when they are set. Changing the way the config files worked to allow globally setting options (but able to be overridden at test level) meant there were a few more edge cases than anticipated.

Started thinking and writing about how we might go about making the BGP disaggregated router more resilient, and what situations may arise that it will need to deal with.