The field of the present invention relates to electronic data search and retrieval. In particular, systems and methods are disclosed herein for high-speed searching and filtering of large datasets.
This application related to subject matter disclosed in (i) U.S. provisional App. No. 61/424,063 filed Dec. 17, 2010 in the name of Roy W. Ward (the '063 application), (ii) U.S. provisional App. No. 61/431,654 filed Jan. 11, 2011 in the names of Roy W. Ward and David S. Alavi (the '654 application), and (iii) U.S. non-provisional application Ser. No. 13/326,326 filed Dec. 15, 2011 in the name of Roy W. Ward (the '326 application). Each of said applications is hereby incorporated by reference as if fully set forth herein, and are hereinafter referred to collectively as the “inline tree applications.”
Many situations exist in which very large amounts of data are generated or collected (e.g., 104, 106, 108, or more data records, each comprising multiple data fields). For data in a dataset to be of any practical use, indicia representing the dataset are stored according to a data structure arranged so that particular pieces of information can be located and retrieved from the dataset. In the pre-digital past, such data structures often comprised printed alphanumeric indicia on suitable media (often including an accompanying printed index), and data search and retrieval were manual functions performed by humans. The introduction of electronic data storage and search capabilities around the middle of the last century revolutionized the ability to store large datasets, and to search for and retrieve specific information from those stored datasets.
Today, alphanumeric indicia representative of a dataset are typically stored according to digital, electronic data structures such as an electronic spreadsheet or an electronic relational database. A spreadsheet (also referred to as a flat file database) can be thought of as a single table with rows and columns, with each row corresponding to a specific data record, and with each column corresponding to a specific data field of that data record. In a simple example (one that will be used repeatedly within the instant specification), each data record can correspond to a registered voter in a dataset of all registered voters in a particular state, e.g., Oregon. The data fields in each data record can include, e.g., last name, first name, middle name or initial, age, gender, marital status, race, ethnicity, religion, other demographic information, street address (likely divided into multiple data fields for street number, street name, and so on), city, state, zip code, party affiliation, voting history, county, U.S. house district, state senate or house district, school district, other administrative districts, and so on.
A relational database typically comprises multiple tables, each comprising multiple records with multiple fields, and relations defined among various fields in differing tables. In the registered voter example given above, a “voter” table might include voter records with name and demographic information in corresponding fields, and an “address” table might include address records that includes street address and district information in corresponding fields. A field in the voter table can include a pointer to the corresponding address in the address table, defining a one-to-many relationship between each address and one or more corresponding voters. Other tables and relationships can be defined (including many-to-many relationships and so called pivot tables to define them).
Electronic spreadsheets and electronic relational databases have become standard methods for storing digital datasets. They offer nearly unlimited flexibility in arranging the data, for updating the data, for adding new data, and for sorting, searching, filtering, or retrieving data. However, it has been observed that for a very large dataset (e.g., >106 or more records, or even as few as >104 or >105 records), spreadsheets and databases tend to become unwieldy to store, access, and search. In particular, search and retrieval of information from such a large electronic dataset can become so slow as to render it essentially useless for certain data retrieval applications.
The applications cited above (hereinafter referred to collectively as the “inline tree applications”) disclose alternative systems and methods for high-speed searching and filtering of large datasets. In contrast to conventional spreadsheets and relational databases, the dataset is stored as a specialized, highly compressed binary data structure that is generated from a more conventional data structure using a dedicated, specifically adapted conversion program, and that is searched and filtered using a dedicated, specifically adapted search and filter program. The inline tree data structure typically can be stored in a binary file that occupies less than about 1 to 2 bytes per field per record on a digital storage medium (e.g., a dataset of one million records having 100 fields each can be stored in less than about 100 to 200 MB). The significant size reduction relative to a spreadsheet or a relational database (often greater than 10× reduction) can often enable the entire dataset to be loaded into random access memory for searching and filtering, significantly increasing the speed of those operations. The small size and contiguous arrangement of the inline tree data structure also speeds search and filter processes, so that a large dataset (e.g., 106, 108, or more data records each including over 100 data fields) can be searched and filtered in less than about 150 to 500 nanoseconds per record per processor core.
As noted above, inline tree data structures have a highly specialized structure that must be generated by a dedicated, specially adapted conversion program, and must be search and filtered by a dedicated, specially adapted search and filter program. Unlike a spreadsheet or a relational database, an inline tree data structure cannot be readily modified to include new or updated data. For new or replacement data to be inserted into existing data fields, or to add entire new records to the dataset, the conversion program must be executed to generate an entirely new inline tree structure. For new data fields to be added to the dataset, the conversion program must be adapted to accommodate those new fields before generating a new inline tree structure, and the search and filter program must be adapted to accommodate the new inline tree data structure. As noted in the inline tree applications, this loss of flexibility and updateability is the price paid to obtain the small size and speedy searching of the inline tree data structure.
It would be desirable to provide systems and methods that enable high-speed search and retrieval of information from large electronic datasets that substantially exceed search and retrieval speeds from conventional electronic data structures (e.g., conventional spreadsheets and databases), so as to enable data search and retrieval applications that are too slow for practicable use with those conventional data structures, while also enabling alteration or updating of data strings in certain existing data fields or enabling addition of new data fields.