User login

Search Projects

Project Members

Machine Learning Log File Analysis

The purpose of this project is to take a different approach from the normal regular expression analysis of log files, which requires constant tuning/adaptation and look at machine learning techniques to see what patterns and events can be identified. It is particularly aimed at data centres and cloud environments.




Started examining connections between multiple files. In other words if an events happens in one file if we look into other files around the same time are there any relations?

Firstly to be able to check a time window all the log file timestamps needed to be standardised so I made a script to process each timestamp and convert it into a unix timestamp. This allows for easy comparisons by just having to subtract one timestamp fromt he other to find the difference in seconds between two events.

Once I had usable timestamps I created a program to compare two files against one another. It goes through event by event in the first file and then goes through the second file to find events within a sixty second time window of one anther. Once an event is found a similarity between the two events is calculated and tokens for matches and stored.

Alot of the lines need to be stripped of character like: =, [, < etc. so that any useful information can be compared. Have also found that since alot of the IP addresses are 130.217 there are alot of matches on just that so it may be useful to implement a blacklist for that network prefix because IP addresses can be are assigned more weight than normal words.




Firstly I made the MultinomialNiaveBayes sort by confidence and print it's predictions in order of confidence to system out. The application allows for multiple files to be used for training and once trained another file can be specified for testing.

Next to allow the user to correct the mistakes of the program I made a GUI which displays the output in list form to be displayed in a tabular format. The user is then able to go through each event and update whether or not it is safe. Finally once the user is happy they can save the instances to an arff file so it can be fed back in for training allowing it to improve from its previous mistakes.

Now I'm looking into grouping events that occur within some time period e.g. 60 seconds of one another from multiple files to see how prevalent the connections between multiple file types are. Some measure of similarity between events will need to be used so each "word" will be treated as a token to compare two events, this can be adjusted to weight certain parts of events e.g. treat an IP address as 4 separate parts so instead of just having a weight of one if the whole thing matches it would now have a weight of four or if just the network section matches a weight of two.

To be able to find the time between each event I will need to convert the time stamps into a usable format which will need to be customised for log files with different time stamp formats.




Found a paper where they cluster event logs with word vector pairs; this approach compares each pair to each other pair in the supplied logs allowing it to cluster lines with similar parts. There is also a toolkit associated with the paper that allows you to specify the input files and the support for making a pair then outputs the clusters where the support is reached, the outlier clusters can also be outputted. This will need to be investigated further to see if it is a good possible solution.

Had a meeting with Antti Puurula about possible approaches, where we discussed outputting a ranking into lists of safe and unsafe events. It was discussed on how this could be evaluated with a Mean Average Precision measurement and then a few algorithms that could be used for scoring events like clustering if the feature space could be separated, supervised learning if all the data was tagged or his recommended a supervised learning where the user manually updates the list of safe/unsafe and the classifier updates iteratively.

Then non language features like time stamps were discussed on how to integrate them as well by having another algorithm like niavebayes handling continuous features. This way we could identify events happening within a certain time period of one another to tie events between files.




Finished off making word count program that outputs comma seperated format: word, occurances, frequency. To help give a better understanding of the document's that are being worked with. Next I think it will be useful to get the counts stats on how many words occur only once, how many occur > 10 etc.

On Friday 22/5 I had a meeting with Bob Durrant from the Statistics department to see get his opinion on possible approaches etc. I now have a better idea of what I'm going to do next firstly ignoring Topic Modeling for now and looking at Clustering, specifically to start with only the bearwall firewall logs to get started then look into comparing multiple types of log files later.

Firstly I will start with a simple version of K-Means and add functionality onto it as I go along e.g. seeing if there is more meaning in an IP address by separating the network and host portions and many other possibilities. Will also need to look into the languages and libraries I have found for this type of work to decide what I will be using.

Also I have been working on my presentation that I will have on Wednesday 27.




I am going to organise a meeting with Bob Durrant from Statistics to discuss my project and get his opinion on approaches etc. and Bob has suggested an expert on text mining that would be glad to discuss my project.

For these upcoming meetings and in general it was suggested that I make brief summaries on the information I am working on so I created a program to process my log files and output a word frequency count for each set of input files that belong to a process; the output is in .csv format so it can easily be loaded into excel to be sorted, plotted etc. It can also take specific date as input so the logs for one specific day can be filtered.

Next I will use my program to make frequency tables for each of the processes in my log files (where the process would be considered the author in text mining), then I will take a few lines from each process to give a short summary of each process's output which will give a summary of the type of files I'm working with.




Researched into document machine learning algorithms for processing small documents. Since social media is becoming ever popular there is a lot of work going into how to learn useful things from posts on social media, for example learning things about an event being posted on Twitter. I was especially interested in what techniques were used for Twitter since a message cannot be more than 140 characters, is posted by a user and has an interesting makeup of #hashtags etc. When we look at a log entry it is sort of similar in structure i.e. each event is a short line, is made by one specific program and has port numbers and IP addresses.

Some of the studies found better results by aggregating the "tweets" into one document which is already done with log files, others try and address the short nature of log files by looking at biterm pairs and many other techniques. Mostly the algorithms are based off topic modeling which infers topic's (groups of words) that could possibly generate the document, though I did find other clustering algorithms like spherical K-Means which I will look into further.

Looked into Mallet's API further and looked into how the importer works that creates the .mallet file that gets passed as input. So I was able to change the regex to parse tokens to include IP adresses and other numbers etc. and found it coverts the token strings to integers for storage. Then after getting the IP addresses etc. included in the input I tried it with topic modeling but it failed and the output was all weird characters so I need to find out what effect the numeric and punctuation characters have on both the input generation and modeling steps.




Started using mallet on the files that I have collect. Tested it on the entire directory at once and it created a .mallet file quickly and a topic model in 1 hour 30 mins, I then used it on the Bearwall logs which I unzipped and took a lot longer over 3 hours which makes me believe that it ignores the zipped files.

Then I looked at the topic keys it had generated and it had stripped out all the numbers and only kept words so looking at the topics didn't really prove to show anything useful.

So next step is to look into other programs and methods that are better suited towards log files because it would be more useful to see it grouping events together from multiple files of different applications. Though that's not to say the topic modeling the Mallet is suited towards won't be helpful I will also need to look into if there is an option to retain numbers etc. and if so re-test.




Spent this week working on other assignments to get them finished before end of teaching recess and didn't get to work on project.

Next step is still to research popular document learning algorithms and if they exist within Mallet see how well they adapt to log files, adjusting them if needed for a better fit to this text format.




Went and saw Brad and got /var/log log files from ns1 zipped up to start testing with so I can decide whether or not I need Syslog logs at the next stage since it is simpler to use the logs already available than to set up Syslog on a machine so these will be a great start.

Unzipped them and now trying to find the best way to combine them into a .mallet file. Tried to execute it on the whole folder but after a couple hours it was still going; When I have time I will leave it running for awhile because it may just take awhile to do 400MB of logs, for the moment I'll use subfolders while waiting. From the examples I went through they had all their files in the .txt format but when testing on a single folder Mallet seems to be able to decompress and read files in the plain file format.

I ran mallet on the single sub folder /var/log/kernel to create a simple topic model which worked well but didn't really have enough info to tell anything interesting so I will be looking for some bigger subsets to test on until I can get the entire log to combine.

Now I will start researching and testing different document clustering algorithms for finding patterns within these logs.




First to catch up on the previous weeks, I created my brief and proposal (see attached for more info).

This week my goal was to become familiar with using the tool Mallet which I will use to analyse log files.

There aren't many tutorials on Mallet but I Found some good ones on using the command line interface (CLI) and I should be able to accomplish most of what I want to do just with the CLI; I will use it for now to start my testing and will possibly move onto the java API if I need to perform something more complicated than what is capable with the CLI.

I then looked further into Mallet's java API and there aren't really any tutorials on how to use it but if I need to accomplish something more complicated a combination of the example files and docs should be sufficient to make a useful program.

The next step is to get some log files from Syslog or var/log and start converting them to Mallet's file format ".mallet" (which could be a single file or an entire directory) and then run the data through some machine learning methods.