An inverted list <Kj, {p0, . . . , pi, pN−1}> is a data structure that stores a mapping from a key Kj (such as, for instance, a database key or a query term) to a list of one or more pointers pi to objects in a structured, semi-structured or unstructured database. In structured databases, where the key consists of a database key composed of one or more attributes, the list contains the logical or physical pointers to the records containing the database key. As an example, let the database key be the attribute Color for a relation Car. The inverted list for Color=red contains the logical or physical pointers to all the records in the Car relation for which the attribute Color has the value red. A physical pointer is an address to a specific location, whereas a logical pointer is a unique object identifier that can be mapped to a physical location by some translation mechanism. As an example, a logical pointer to a database record can be a unique identifier assigned by the system (typically an increasing integer number) that is mapped to a physical location by a translation table.
Inverted lists are used in many information technology applications. In addition to structured databases, an important application area is text retrieval. In this context, the key represents a term (or, usually, a unique term identifier) and the list enumerates all the documents (usually represented by unique identifiers) in the collection containing that term. Another important area is represented by dynamic taxonomies (Sacco, G., Dynamic taxonomy process for browsing and retrieving information in large heterogeneous data bases, U.S. Pat. No. 6,763,349, U.S. Pat. No. 7,340,451) where the deep extension of concepts (i.e. the set of objects classified under a concept or one of its descendants) can be represented by inverted lists, where the key is a concept identifier and the list includes all the identifiers for the objects in the deep extension of that concept.
The key in an inverted list may be of variable size. The list is often kept ordered because this allows to perform list operations such as intersection, union, subtraction, etc. (which implement boolean operations on keys) in linear time.
Inverted lists are stored in inverted indices, which allow quick access to the inverted list corresponding to a search key. Such inverted indices are stored in computer memory, and generally on secondary storage. They are usually organized as B-trees or variations (Corner, D. The Ubiquitous B-Tree. ACM Computing. Surveys, 11, 2 (June 1979), 121-137; Bayer, R. and Unterauer, K. 1977. Prefix B-trees. ACM Trans. Database Syst. 2, 1 (March 1977), 11-26). These structures order records in such a way that the physical order of records in the file is the same as the logical order of the index keys. In this present case, the records inserted in the index are the inverted lists, and index keys are the inverted list keys. Indices are usually organized into fixed-size pages, a page being the minimum access unit.
Inverted lists and inverted indices present several problems. First, for large information bases or information bases with a non-uniform distribution of keys, many inverted lists may contain a large number of pointers, and, in general, span several pages. It is quite difficult and expensive to perform insertions and deletions in these cases, if the pointers in the inverted list are to be maintained ordered so that list operations can be performed in a linear time.
Another problem of inverted lists and inverted indices is the space required by these structures. In naïve implementations, the overhead of these structures can be so large as to be impractical for large applications. For instance, in text-retrieval applications, the size of the inverted index can be as large as or larger than the size of the indexed text corpus.
Finally, even if a linear time is required for list operations on sorted lists, list operations can be too expensive in very large applications.