As computing devices have become ubiquitous, the volume of data produced by such computing devices has continuously increased. Organizations often wish to obtain insights about their processes, products, etc., based upon data generated by numerous data sources, wherein the data from different data sources may have different formats. To allow for these insights to be extracted from data, the data must first be “cleaned” such that a client application (such as an application that is configured to generate visualizations of the data) can consume the data and produce abstractions over the data.
In an example, server computing devices of an enterprise can be configured to output log files. These log files have a “flat” structure, in that a log file does not contain a (hierarchical) presentation of the data included in the log file (unlike a JSON document or an XML document). Further, log files tend to comprise unstructured or semi-structured data, rendering it difficult to analyze such data in its native form. For instance, an application executing on a server computing device can generate a log file that indicates times that particular actions were undertaken by the server computing device when executing the application. Data lines in a log file, however, may include semi-structured data, such that executing a query over the log file is problematic. Hence, it is often desirable to extract certain data from a log file and place the data in tabular form, such that a client application can then further process the data using standard tabular analysis tools.
Conventionally, it is cumbersome to extract data from log files and place it in tabular form. One exemplary approach is for a user (e.g., a data cleaner) to manually extract desired data from a log file and placing the extracted data in appropriate cells of a table. Log files, however, may include thousands to millions of lines of information and, therefore, this manual approach is often not possible. Another exemplary approach is for a programmer to write a script that extracts data from the log file and populates cells of a table based upon the data extracted from the log file. This approach, however, requires programming expertise. Further, different applications generate log files with different data structures; therefore, writing the program often is a one-off project, which is an inefficient use of programmer time.
Relatively recently, programming by example (PBE) technologies have been developed, where programs are synthesized based upon examples provided by end users. The structure of most log files, however, is not well-suited for PBE technologies. More specifically, log files tend to have various different types of lines therein, including but not limited to header lines, comment lines, and data lines. Thus, conventionally, an end user may be required to explicitly identify lines (such as comment lines and header lines) that do not include data that is of interest to the end user as negative examples. Further, for conventional PBE technologies to be employed to synthesize a program that is configured to extract data from a log file and place it appropriately in a table, the end user must explicitly identify boundaries of records in the log file. This may be burdensome for the end user, as the task of identifying record boundaries may not match the mental model of the user, who may simply care to extract certain fields.