Electronic data is being created and recorded in staggering amounts as our world becomes increasingly computerized. Unfortunately, finding particular data within discrete data sets becomes increasingly difficult as the amount of data grows. Efficiently searching for relevant data, whether in databases or in distributed environments such as the World Wide Web (the “Web”) typically includes accessing one or more electronic indexes of the data within the data set. In generalized computing environments, the index is created and maintained by various commercially available database products. In the context of the Web indexes are created and maintained by a variety of Search Engines accessible via the Internet. The challenge in most environments is keeping the indexes current—reflecting the data as the data is added, removed and updated in the environment.
Indexes are commonly used to provide access to that electronic information. Inverted indexes are a type of index used in databases and search engines for indexing many-to-many relationships. An inverted index typically consists of a plurality of records, with each record having a key and one or more associated references. Each reference indicates the presence of the key in the referenced material. For example, an index of Web pages may contain many records with a word identifier as the key and a reference to the uniform Resource Locator (“URL”) of the Web document that contains the word.
The process of generating an inverted index is referred to as “inverting the index.” The inversion process is resource intensive and can often require large amounts of memory, disk space or time. Typical inversion methods partition the data into postings written to a posting file, divide the postings into intermediate files of a manageable size, and then progressively reading, sorting, dividing, recombining, and writing new intermediate files until all the postings are sorted. Once the latest version of the intermediate files represents all the postings properly sorted, the intermediate files are merged into an inverted index.
By way of example, an inversion method is illustrated in FIG. 15. The inversion occurs in multiple passes, with an earlier pass 1502 and a later pass 1504 illustrated. Two intermediate files 1506 and 1508 are created by the earlier pass 1502 and two newly created intermediate files 1510 and 1512 are created by the later pass 1504. While only two intermediate files are discussed so as to simplify the example, in most cases, the number of intermediate files required to perform the inversion method are determined by the number and size of the postings to invert, the amount of available memory, and the processing power of the computer system, among other possible considerations.
Each intermediate file 1506, 1508 in the earlier pass 1502, include four postings 1520-1526 and 1528-1534, respectively. The example discussed assumes that the inversion creates an index for the search of documents (or document identifiers) by terms (or term identifier) and that each intermediate file consumes all of the available memory.
Each intermediate file 1506, 1508 begins with unsorted postings 1520-1526 and 1528-1534, respectively. The first intermediate file 1506 is read into memory and sorted, resulting in unordered postings 1520-1526 being reordered by termID as posting 1526 (termID2), posting 1522 (termID3), posting 1520 (termID4) and posting 1524 (termID4). The first intermediate file 1506 is then written to storage memory. Similarly, the second intermediate file 1508 is then read into memory and sorted, resulting in unordered postings 1528-1534 being reordered by termID as posting 1528 (termID1), posting 1530 (termID2), posting 1534 (termID2) and posting 1532 (termID3).
In the example, the postings in intermediate files 1506 and 1508 remain unordered with respect to each other. To continue the sort of the entire set of postings 1520-1534, the intermediate files 1506 and 1508 must be merged 1540 and sorted again. This step occurs in the later pass 1504. The merge operation is much like shuffling a deck of cards. Some of the records 1552 from the intermediate file 1506 are combined with some of the records 1554 from intermediate file 1508 to form a new intermediate file 1510. Usually, this step involves reading the postings 1552 from storage, reading the postings 1554 from storage, and then writing the new intermediate file 1510 to storage. New intermediate file 1512 is similarly created from the combination of the remaining postings 1556 from intermediate file 1506 and the remaining postings 1558 from the intermediate file 1508.
Once the new intermediate files 1510 and 1512 are defined, the later pass 1504 sorts the individual intermediate files 1510 and 1512 as performed in the first pass 1502. If the entire set of postings 1520-1534 remains unordered, the intermediate files 1510 and 1512 are combined in another pass (not shown) and the process moves from earlier pass 1502 to later pass 1504 until the entire set of postings 1520-1534 is ordered. An index 1570 is then created by merging the intermediate files from the final pass.
Conventional creation and sorting of multiple layers of intermediate files is especially expensive in terms of time. Reading intermediate files from a disk drive, for example, is believed to be much slower than moving postings in memory, and writing intermediate files is believed to be even slower. The repeated use of intermediate files to sort large numbers of postings slows the creation of indexes, potentially making them less inclusive and current. There is an unmet need for improved systems and methods for generating inverted indexes and sorting large amounts of ordered pairs of data, particularly by eliminating or reducing reliance on time sinks, such as intermediate files.