The field of the present invention relates to electronic data search and retrieval. In particular, systems and methods are disclosed herein for high-speed searching and filtering of large datasets.
Many situations exist in which very large amounts of data are generated or collected (e.g., 104, 105, 106, or more data records, each comprising multiple data fields). For data in a dataset to be of any practical use, indicia representing the dataset are stored according to a data structure arranged so that particular pieces of information can be located and retrieved from the dataset. In the pre-digital past, such data structures often comprised printed alphanumeric indicia on suitable media (often including an accompanying printed index), and data search and retrieval were manual functions performed by humans. The introduction of electronic data storage and search capabilities around the middle of the last century revolutionized the ability to store large datasets, and to search for and retrieve specific information from those stored datasets.
Today, alphanumeric indicia representative of a dataset are typically stored according to digital, electronic data structures such as an electronic spreadsheet or an electronic relational database. A spreadsheet (also referred to as a flat file database) can be thought of as a single table with rows and columns, with each row corresponding to a specific data record, and with each column corresponding to a specific data field of that data record. In a simple example (one that will be used repeatedly within the instant specification), each data record can correspond to a registered voter in a dataset of all registered voters in a particular state, e.g., Oregon. The data fields in each data record can include, e.g., last name, first name, middle name or initial, age, gender, marital status, race, ethnicity, religion, other demographic information, street address (likely divided into multiple data fields for street number, street name, and so on), city, state, zip code, party affiliation, voting history, county, U.S. house district, state senate/house district, school district, other administrative districts, and so on.
A relational database typically comprises multiple tables, each comprising multiple records with multiple fields, and relations defined among various fields in differing tables. In the registered voter example given above, a “voter” table might include voter records with name and demographic information in corresponding fields, and an “address” table might include address records that includes street address and district information in corresponding fields. A field in the voter table can include a pointer to the corresponding address in the address table, defining a one-to-many relationship between each address and one or more corresponding voters. Other tables and relationships can be defined (including many-to-many relationships and so-called pivot tables to define them).
Electronic spreadsheets and electronic relational databases have become standard methods for storage of digital datasets. They offer nearly unlimited flexibility in arranging the data, for updating the data, for adding new data, and for sorting, searching, filtering, or retrieving data. However, it has been observed that for a very large dataset (e.g., >106 or more records, or even as few as >104 or >105 records), spreadsheets and databases tend to become unwieldy to store, access, and search. In particular, search and retrieval of information from such a large electronic dataset can become so slow as to render it essentially useless for certain data retrieval applications.
It would be desirable to provide systems and methods that enable high-speed search and retrieval of information from large electronic datasets that substantially exceed search and retrieval speeds from conventional electronic data structures (e.g., conventional spreadsheets and databases), so as to enable data search and retrieval applications that are too slow for practicable use with those conventional data structures.