A modern ICT system contains sufficient processing power, network connectivity speed and memory that the system is able to capture and store large amounts of data relating to one or more components of the system. For example, in telecommunications, Network Operators are able to capture log data relating to many subscriber lines in their network (such as data rates, error rates, etc.), which is reported back to a Network Management System. In an example in a security aspect of the telecommunications sector, Network Operators are also able to capture large amounts of data relating to the traffic flowing through the core network, which may be analysed to detect adverse behaviour in the event of a security breach. Generally, the amount of data which may be captured in modern ICT systems is now so vast and/or complex that its analysis is beyond the capabilities of traditional processing techniques. A new field, known as big data analytics, has emerged to tackle such data sets.
It is well known that a large data set (such as a log data file of multiple subscriber lines in a network or traffic data flowing in the network) often has latent within it information that can be utilised for valuable purposes such as in network optimization, network failure detection or security breach analysis. However, due to the inherent difficulties in extracting the relevant data from such large data sets, such analysis is often only performed on limited samples of data after an event has occurred. For example, following a network failure event, a Network Operator may extract the relevant log data for the subscriber lines that suffered the fault with the aim of determining the network failure's cause. It is of course preferable to pre-empt such events rather than react. Accordingly, there is a desire to collect and analyse the data in real time to detect and anticipate such problems.
One of the problems faced in data analytics is that input data is collected in a variety of structured and unstructured formats. When the data set increases in volume (for example, because the amount of data being collected from a single source increases or due to the data being collected from an increasing number of sources), then the lack of uniform structure in the data set makes it almost impossible to analyse. Thus, an initial step of data analytics is to organise the data into a suitable format for post-processing. This may be achieved by a data extraction method, in which a subset of semantically-significant data from the data set is identified and re-expressed in a format such that it may be analysed or stored for later analysis.
One prior art approach to this problem was the use of regular expressions to identify and extract data entries or a part thereof from a data set. For example, a data set collected at a Digital Subscriber Line Access Multiplexer (DSLAM) may contain a data entry string including substrings for: an IP address associated with a subscriber line connected to the DSLAM, an event code relating to an event on the subscriber line, a date and time of the event, and text relating to the event. From this data set, a regular expression may be used to identify and extract substrings from the data entry string. For example, the regular expression (\d+\.\d+\.\d+\.\d+) may be used to identify and extract all dotted decimal IPv4 addresses from the data set. More complex regular expressions may be used to identify particular patterns within substrings of a data entry string, such that an IP address and an event code are identified and extracted. This data can then be sorted into a structured format for further analysis.
Thus, creation of suitable regular expressions for data extraction has been a useful tool for automating data extraction from semi-structured data, such as log files. The process of creating suitable regular expressions typically involves an operator manually reviewing log files, recognizing patterns of substrings that are of significance, and creating a regular expression which could be run to identify that pattern of substrings in future log files. Of course, such a process is time consuming and prone to human-error. In an effort to automate this process, programs were developed to scan through homogeneously-structured log data and identify patterns of substrings and automatically create a corresponding regular expression.
An example program which illustrates such a prior art process is known as “RecordBreaker”, details of which can be found at http://cloudera.github.io/RecordBreaker/. RecordBreaker first generates a parser that breaks each record into typed fields. It then guesses appropriate labels for the fields making use of a large dictionary of data formats and types. This approach is good for data sets containing structured records all of which conform to the same format. It cannot deal with heterogeneous formats within a given data set, or extract data that is embedded in human-readable text fields.
There are also programs that use regular expressions to recognise and extract potentially significant patterns of substrings (e.g. dates) embedded within unstructured data sets. However, such programs often misinterpret the significance of a pattern of substrings as the context of the data entry was lost (for example, a regular expression for identifying an event code and associated date and time within a data set could not recognise if the date and time was when the event occurred or when the event was recorded).
It is therefore desirable to alleviate some or all of the above problems.