The present invention relates to analysis of log files, and more specifically, to analysis of log files to locate anomalies.
In the running of a very busy cloud service due to the very large volumes of transactions, system logs on servers grow very rapidly and to very large sizes.
If an incident occurs and it is necessary to diagnose an issue where a server is crashing, a request may be made to examine log files to “find anything unusual”. Taking one example, a log file may have ˜9 million lines of text and 380 mb size. Clearly eyeballing this file for unusual activity is not humanly possible.
Spam detection algorithms work on the principal that spam follows some regular pattern, and any thing that is not spam looks unusual to the spam detector and therefore is considered not-spam. The same principle applied in reverse on log files could be used to filter out the common activity leaving the unusual behind.
Similarity algorithms commonly used for spam detection such as the Sorenson index, also known as Dice's coefficient were found to be quite inefficient and not very effective in achieving the goal, this was largely due to the relatively small amount of content in a typical log message.
Therefore, there is a need in the art to address the aforementioned problems.