I recently got access to a huge amount of server log data (at the new job). I have some experience in machine learning from college. The logs data include server logs, database access logs etc. I was wondering what kind of learning can be done from such a data.

One little thing i tried was to predict number of requests on a certain hour of the day based on the data of past week, which seemed ok but this is kind of trivial. So,

  • What kind of learning can be done from such data?
    • May be predicting the probability of an IP doing spam clicks on ads(yes the company is into that) based on some usage pattern of previous spammers?
    • May be predicting at what time the traffic may shoot up.
  • Are there any existing tools/projects which specifically leverage?
  • Any interesting resources/papers which talk about similar stuff?
  • Also, data related process activity at over a certain time on server. can this be any useful for learning?

1 Answer

There are some ways to solve your problem:

  1. Extract logging templates from the source code to extract identifiers from the logs (the thing in the log corresponding to %s is an identifier). They use certain heuristics to distinguish identifiers from non-identifiers.
  2. Use ratios between values instead of raw numbers.
  3. Use Principal Component Analysis to discover anomalies in vectors of such features.

