This relates to data files and, more particularly, to large files.
At times one is faced with some large file whose structure is known only to the extent that it consists of records, that the records are separated by some (unknown) record delimiter, that each record comprises fields, and that the fields are separated by some (unknown) field delimiter. These files, which are often produced by databases and information processing systems, are called relational tables.
Modern information systems routinely generate and store massive relational tables. In the context of IP networks, for example, this includes a wealth of different types of collected data, including traffic (e.g., packet of low level traces), control (e.g., router forwarding tables, BGP and OSPF updates) and management (e.g., fault, SNMP traps) data. It is beneficial to process this type of data into forms that enhance data storage, access, and transmission. In particular, good compression can help to significantly reduce both storage and transmission costs.
Relational data files are typically presented in record-major order, meaning that data appears as a sequence of bytes in the order of records, each of which consists of fields ordered from left to right. On the other hand, in applications such as data compression, faster access of field data, and so on, it is beneficial to think of the data in field-major order. That is, to reorganize the data by first field, second field, etc.
To perform such a reorganization of data, it is required to know what information unit (e.g., character) constitutes the record delimiter, and what information unit constitutes the field delimiter. Unfortunately, in applications such as data compression and data structure discovery, one is often presented with a data file without any extra information. Thus, there is a need to develop techniques for identifying the record and field delimiters when a given data file is believed to be relational in nature.
Current techniques to detect delimiters are predominantly manual, requiring human scanning of the raw data for patterns, and combing the scanning with knowledge of what is typically used as delimiters. Such approaches are not effective for handling large volumes of data, so there is a pressing need for tools and techniques to automate the structure-extraction process.