This application related to subject matter disclosed in (i) U.S. non-provisional application Ser. No. 13/326,326 filed Dec. 15, 2011 in the name of Roy W. Ward (now U.S. Pat. No. 9,002,859), (ii) U.S. non-provisional application Ser. No. 13/347,646 filed Jan. 10, 2012 in the names of Roy W. Ward and David S. Alavi (now U.S. Pat. No. 8,977,656 issued to Ward), and (iii) U.S. non-provisional application Ser. No. 13/733,890 filed Jan. 4, 2013 in the name of Roy W. Ward (now U.S. Pat. No. 9,171,054). Each of said applications and patents is hereby incorporated by reference as if fully set forth herein, and said applications and patents are hereinafter referred to collectively as the “inline tree patents.”
Many situations exist in which very large amounts of data are generated or collected (e.g., 104, 106, 108, or more data records, each comprising a handful, dozens, or a hundred or more data fields). For data in a dataset to be of any practical use, indicia representing the dataset are stored according to a data structure arranged so that information in the dataset can be searched, filtered, listed, enumerated, located, or retrieved. In the pre-digital past, such data structures often comprised printed alphanumeric indicia on suitable media (often including an accompanying printed index), and data search and retrieval were manual functions performed by humans. The introduction of electronic data storage and search capabilities around the middle of the last century revolutionized the ability to store large datasets, and to search, filter, list, enumerate, locate, or retrieve information in the stored dataset.
Today, alphanumeric indicia representative of a dataset are typically stored according to digital, electronic data structures such as an electronic spreadsheet or an electronic relational database. A spreadsheet (also referred to as a flat file database) can be thought of as a single table with rows and columns, with each row corresponding to a specific data record, and with each column corresponding to a specific data field of that data record. In a simple example (one that will be used repeatedly within the instant specification), each data record can correspond to a registered voter in a dataset of all registered voters in a particular state, e.g., Oregon. The data fields in each data record can include, e.g., last name, first name, middle name or initial, age, gender, marital status, race, ethnicity, religion, other demographic information, street address (likely divided into multiple data fields for street number, street name, and so on), city, state, zip code, party affiliation, voting history, county, U.S. house district, state senate or house district, school district, other administrative districts, and so on.
A relational database typically comprises multiple tables, each comprising multiple records with multiple fields, and relations defined among various fields in different tables. In the registered voter example given above, a “voter” table might include voter records with name and demographic information in corresponding fields, and an “address” table might include address records that includes street address and district information in corresponding fields. A field in the voter table can include a pointer to the corresponding address in the address table, defining a one-to-many relationship between each address and one or more corresponding voters. Other tables and relationships can be defined (including many-to-many relationships and so called pivot tables to define them).
Electronic spreadsheets and electronic relational databases have become standard methods for storing digital datasets. They offer nearly unlimited flexibility in arranging the data, for updating the data, for adding new data, and for sorting, searching, filtering, or retrieving data. However, it has been observed that for a very large dataset (e.g., >106 or more records, or even as few as >104 or >105 records), spreadsheets and databases tend to become unwieldy to store, access, and search. In particular, search and retrieval of information from such a large electronic dataset can become so slow as to render it essentially useless for certain data retrieval applications.
The inline tree patents cited above disclose alternative systems and methods for high-speed searching and filtering of large datasets. As disclosed in those patents, and in contrast to conventional spreadsheets and relational databases, the dataset is stored as a specialized, highly compressed binary data structure that is generated from a more conventional data structure using a dedicated, specifically adapted conversion program; that binary data structure is searched and filtered using a dedicated, specifically adapted search and filter program. The inline tree data structure typically can be stored in a binary file that occupies less than about 1 to 2 bytes per field per record on a digital storage medium (e.g., a dataset of one million records having 100 fields each can be stored in less than about 100 to 200 MB). The significant size reduction relative to a spreadsheet or a relational database (often greater than 10× reduction) can often enable the entire dataset to be loaded into random access memory for searching and filtering, significantly increasing the speed of those operations. The small size and contiguous arrangement of the inline tree data structure also speeds search and filter processes, so that a large dataset (e.g., 106, 108, or more data records each including over 100 data fields) can be searched and filtered in less than about 150 to 500 nanoseconds per record per processor core.
In an additional modification (disclosed in the second and third inline tree applications), a so-called clump header table can be employed to indicate groups of data records that share a large number of data field values (e.g., geographically constrained field values such as country, city, congressional district, school district, and so on, that cannot appear in arbitrary combinations) and to direct the search and filter program to only those portions of the inline tree data structure for which the clumped data field values match the search or filter criteria. In a further modification (disclosed in the third of the inline tree applications), an auxiliary, parallel data structure of can be employed along with the inline tree data structure to store additional or replacement data field values. The search and filter program can be adapted to interrogate the inline tree data structure and the auxiliary data structure in parallel. The auxiliary data structure can be employed for enabling modifications to certain data field values without regenerating the entire inline tree data structure, to enable different users of the inline tree data structure to append their own additional data fields, to facilitate aggregation of certain data records for licensing or purchase, or for other purposes.
As noted above, inline tree data structures of the inline tree patents have a highly specialized structure that must be generated by a dedicated, specially adapted conversion program, and must be search and filtered by a dedicated, specially adapted search and filter program. Unlike a spreadsheet or a relational database, an inline tree data structure is unwieldy to modify to include new or updated data. For new or replacement data to be inserted into existing data fields, or to add entire new records to the dataset, often the conversion program is executed to generate an entirely new inline tree structure. For new data fields to be added to the dataset, the conversion program must be adapted to accommodate those new fields before generating a new inline tree structure, and the search and filter program must be adapted to accommodate the new inline tree data structure. As noted in the inline tree patents, this loss of flexibility and updateability is the price paid to obtain the small size and speedy searching of the inline tree data structure.