One of the most challenging problems in data manipulation in the future is to be able to efficiently handle very large amounts of data but also multiple induced properties or generalizations in that data. Monitoring and analysis of communication networks are more and more based on not only event counters and key performance indicators derived from them but also on event logs continuously collected by the monitoring system. For example, security analysis is almost entirely based on such event logs, either by sampling and summarizing them or by analysing detailed logs in order to find out courses of actions that lead to security breaches.
However, such logs may easily grow to an overwhelming data amount. The reason for this is that the monitoring system has to log lots of information to provide the opportunity of analysing afterwards any kind of incidences. This problem becomes worse every day due to the fact that network operators add new and more powerful tools for logging and security monitoring. In most of these tools, data are stored in the form of event logs.
In conventional 2nd generation mobile communication networks, event logs may consist of alarms, disturbances and notifications, real time traffic (RTT) reports generated by e.g. a mobile switching center (MSC), billing tickets, performance logs generated by a network management system (NMS) and network elements, operating system logs, and the like. The network operators are starting to utilize information hidden in these logs. The utilization of such log or log record information becomes more and more important in view of the fact that network operators wish to analyse not only behaviour of network elements but also quality of service provided from the network to the customers or customer groups. Moreover, network operators are more and more interested in analysing their own processes and operation of their networks.
In order to achieve proper monitoring and prevent problems as well as security breaches, maintenance staff should analyse these logs regularly and pay attention to any abnormal log entries or early signs of troubles. In practice, however, this is laborious. Logs are so huge—at worst millions of lines per day—and filled with lots of uninteresting entries which makes it difficult to recognize interesting lines in between.
FIG. 1 shows an example of an extract portion of a log record produced by a fire-wall application. Most of the lines are reporting from quite normal network activity which is not very interesting. Therefore, logs are used basically for ‘after-the-fact’-analysis where-reasons and chains of events preceding problems are analysed. Even for such tasks, the present logs are very difficult to use due to the huge amount of data which hides relevant facts.
The problem is not only associated with the size of the logs but also the amount of them. Currently, in Network Management systems tens of different types of log files are generated per day, which should be analysed in the course of trouble shooting or security monitoring. For example, event logs from network element operating systems (Unix, NT, etc.), system logs (NMS system, etc.), network monitoring logs (alarms, RTT logs, billing logs, etc.), system operation logs (MML, Unix accounting, connections to network elements, etc.), security logs (intrusion detection systems, authentication, firewalls, etc.) and so on. In addition thereto, new tools and new logs are introduced gradually. The analysis of these huge amounts of logs becomes even harder as there is no common structure or template for these logs. The logs contain lots of semi-structured data meant to be readable and informative for experts in order to debug or follow up the system behaviour. There are, however, also plenty of structured data where field values describe the event as well as time and origin where it occurred. Even a common naming of fields can be provided but unfortunately the meaning of the fields varies from one application to another.
So far, the main solutions presented have been sorted and string based searches for events. In many applications, it is also possible to select some field value combinations either searched for or removed from a display. Furthermore, in alarm correlation, there is rule-based approach where predefined alarm sets or patterns are either suppressed from or combined in display. In intrusion detection applications, systems can be arranged to count combination of values of pre-defined log fields. When a preset threshold is exceeded, an alarm is signaled. However, these systems are powerless with regard to actions like distributed attacks originated from many sites slowly during longer periods of time.
Another problem arises from the storage requirements of these huge amounts of logs. Due to several reasons these event logs have to be stored for longer periods of time. Security related logs, for example, might be needed several months or even years after their creation in order to properly analyse security breaches which have been taking place during a long period of time or a long time ago. In practice, however, this is difficult since the log files are so huge, at worst millions of lines per day. Archiving of such files requires a lot of machine resources. An even harder problem is to find from these archives data that is needed for solving a specific analysis task in question and load it from the archive to the current analysis environment.
In practice, these event logs are normally filled with repeated entries sharing a lot of structure but still containing some fields whose values are varying. In the example shown in FIG. 1, two repeating event record field value combinations are shown. However, in the middle of all the lines, there is a time field which is changing from line to line.
The log files are typically archived in compressed form. In many cases, compression is done with the well-known Lempel-Ziv compression algorithm (LZ) or with some other corresponding algorithm. When the log files are restored and a query or a regular expression search for relevant lines is made, the whole archive must be decompressed.
An other possibility to archive the logs is to first insert them into a database managed by a database management system and then, after certain period, compress the whole database to a file that is inserted to a tape or another mass storage medium. The problem becomes real when there is a need to analyse old backups. An expert has to find the correct media, decompress the whole database and load it to the database management system. This can be a problem because the size of data included in the database per day might be some gigabytes. Thus its decompression and uploading takes lot of a time. If the compression is not done, the problem is to fit the database to mass storage media and still be able to manage the rapidly grown number of mass storage media. Selecting carefully what is stored can reduce the problem, but still there tends to be quite a lot of data written into archives.
Concepts for obtaining a condensed or concise representation for data mining are known e.g. from J.-F. Boulicaut et al. “Modeling KDD Processes within the Inductive Database Framework”, Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery DaWak '99, Florence, Italy, Aug. 30 to Sep. 1, 1999, Springer-Verlag, LNCS 1676, pp. 293-302, “Frequent closures as a concise representation for binary data mining”, Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining PakDD00 , Kyoto, Japan, Apr. 18 to 20, 2000, Springer-Verlag LNAI Vol. 1805, pp. 62-73, and “Constraint-based discovery of a condensed representation for frequent patterns”, Proceedings of the Workshop Database Support for KDD co-located with the 5th European Conference on Principals and Practice of Knowledge Discovery in Databases PKDD '01, Freiburg, Germany, Sep. 7, 2001, pp. 3-13. In these publications, patterns are used to define association rules which are screened. The aim is to obtain a condensed set of association rules. However, the disclosed procedures do not serve to reduce consumed space or increase readability for human beings.