User login

Weekly Report - 8/5/15




Researched into document machine learning algorithms for processing small documents. Since social media is becoming ever popular there is a lot of work going into how to learn useful things from posts on social media, for example learning things about an event being posted on Twitter. I was especially interested in what techniques were used for Twitter since a message cannot be more than 140 characters, is posted by a user and has an interesting makeup of #hashtags etc. When we look at a log entry it is sort of similar in structure i.e. each event is a short line, is made by one specific program and has port numbers and IP addresses.

Some of the studies found better results by aggregating the "tweets" into one document which is already done with log files, others try and address the short nature of log files by looking at biterm pairs and many other techniques. Mostly the algorithms are based off topic modeling which infers topic's (groups of words) that could possibly generate the document, though I did find other clustering algorithms like spherical K-Means which I will look into further.

Looked into Mallet's API further and looked into how the importer works that creates the .mallet file that gets passed as input. So I was able to change the regex to parse tokens to include IP adresses and other numbers etc. and found it coverts the token strings to integers for storage. Then after getting the IP addresses etc. included in the input I tried it with topic modeling but it failed and the output was all weird characters so I need to find out what effect the numeric and punctuation characters have on both the input generation and modeling steps.