The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
Data processing systems often need to process complex data input files, such as comma separated value (CSV) files, extensible markup language (XML) files, text files, JavaScript Object Notation (JSON) files, and other types of files containing data that is to be stored an accessed through a data repository. Processing the data input files usually involves reading entries of the data input file into a data repository, such a columnar data store or a database.
To correctly read the data input file into the columnar data store or database, the server computer needs to apply a schema to the file. The schema identifies an encoding for the data in the file, which symbols in the data file are used to delimit lines or rows, which symbols in the data file are used to delimit columns, which information in the file is header information and excluded from the columns of the data file, and which types of data formats apply to different rows. Without at least the delimiter information, a server computer is unable to extract data from the data input file for storing in a structured manner. Without the data format type information, the server computer is unable to store the data in a manner that allows the data to be queried efficiently. For example, if a column is filled with only strings, a user is unable to perform a search against the column for values above or below a certain number.
Generally, a schema is identified by the user when a data input file is uploaded. This requires the user uploading the file to be aware of the schema that should be applied to the file and to convey that information to a server computer system. While some users may have the adequate technical experience required to understand the schema for an uploaded file and convey that information to a server computer, a system that requires users to understand the schema for an uploaded file inherently reduces the usability of the system such that only experts in file structure can upload files into a data repository system.
Alternative structures would require some type of uniformity in the creation of data input files. For instance, a system may require all file being uploaded to conform to a uniform schema or to include schema information in a header portion of the file. These alternative structures only work for the creation of files moving forward. Thus, the usability of a system that requires uniformity is greatly reduced as it would be unable to handle older files, files created through different applications, or files created by different users for different purposes.
Thus, there is a need for a system that infers a schema for a data input file using only the information stored in the data input file.