User login

Weekly Report 16/11/15

16

Nov

2015

I have been working on investigating time series databases for the AMP system. Some of the more promising systems included KairosDB (http://kairosdb.github.io/), Prometheus (http://prometheus.io/) and InfluxDB (https://influxdb.com/index.html). I also looked at other systems including Druid, OpenTSDB and ElasticSearch, but for a number of reasons these didn't seem to be ideal for our use case.

A few things that were important to consider for AMP include
- The server may not necessarily be clustered, so systems that rely on clustering aren't particularly useful
- The data taken from each test (http, tcpping etc.) is reasonably multi-dimensional, so time series that rely on univariate data aren't useful either
- Not all data from tests is numeric (eg AS numbers in traceroute), so systems that use numeric typing wouldn't be ideal either
- We would prefer not to use a java based system if possible
- We want queries that are used on the website to be as fast as possible, so a database that supports fast querying and aggregation over time would be great.

After reviewing these options, I decided the best place to start would be InfluxDB. A few useful features of InfluxDB include:
- Aggregation is done on the fly through the use of continuous queries. We can set up queries to aggregate data that will be queried often into smaller tables. For example, one might make a table that contains just the mean rtt for tcpping between hosts in 5 minute intervals. This speeds up querying a lot.
- It has a very fast write speed
- It is written in Go
- It has an HTTP API
- One series per measurement, so the database schema is nice and simple
- There is a python API available (http://influxdb-python.readthedocs.org/en/latest/)

A couple of issues with InfluxDB are that:
- Custom functions are not yet supported (https://github.com/influxdb/influxdb/issues/68). It would be great if we could create a 'mode' function to help with traceroute measurements
- The system is based on what are called tags and fields (sort of both a feature and an issue). This means that each column in the table needs to be labelled as either an independent or a dependent variable. You cannot construct queries that select the values of all tags, and likewise you cannot have a 'group by' clause that is based on a particular value

I have successfully filled this database with data from the RabbitMQ dev queue, and have been testing times for standard queries, such as the mean rtts of tcpping over the last couple of days in 5m intervals. Some goals for next week include:
- compare query times for InfluxDB with our current system
- test InfluxDB on the prod stream of data, and backfill a longer period of time with random data, to test how it works with larger volumes of data
- demonstrate and test how a traceroute query may work to find the mode path over a period of time

A couple of results from initial tests on a few queries:

Testing query: 'Select * from "icmp_rtt.means.5m" where time > now() - 2d' for 5 clients (query on aggregated table):
Rows returned: 5243
Average Speed: 0.1007314 seconds

Testing query: 'Select mean(rtt) from icmp where time > now() - 2d group by time(5m)' for 5 clients (query on non-aggregated table):
Rows returned: 577
Average Speed: 6.2435292

So querying a pre-aggregated table is around 60 times faster, which is a great speed gain. It will be interesting to see how these queries do compared to SQL queries.