Modern search engines ranging from the ones that power internet search sites such as Google, MSN and open source such as Lucene have become extremely useful tools for rapidly locating information documents, and multimedia content from a variety of sources. A typical modern search engine builds an index representation of terms in a document to locate relevant documents. This index representation can be thought of as a lookup table which locates a set of documents relevant to a particular search term. This lookup table is ordered sequentially for all the search terms and each entry in this table consists of one search term and all documents relevant to that search term. Given a search term, locating this entry in the lookup table returns a list of relevant documents. Similarly, combinations of search terms can be handled using a union or set intersection of entry lookups. This index representation is generally known as an inverted index.
In the case of web searches and homepage searches, locating a document is all that is desired and there is no need to additionally locate the search terms within the document itself. However, in the case of multi-page or voluminous documents such as user manuals, programming guides, etc., or multimedia (video and audio files) which span several minutes, it becomes important to not only locate the document relevant to a given search query, but also the appropriate location of the search term within the document itself. For example, if a user wants to locate news broadcasts on a particular sporting event, the user would not only like to access the relevant broadcasts, but also the precise time slot within such a broadcast where the sporting event was mentioned. Similarly, for user manuals, the relevant search terms might be located deep within the document and it would be ideal to be able to jump directly to the exact location of the term in the document.
The typical solution to this problem is to either split the document into many-documents and index each of these sub-documents individually or scan the document linearly to locate the search terms within the document after they have been identified as relevant. However, splitting the document results in significant loss of contextual information due to an arbitrary chunking of documents into sub-documents. Further, the cost of a linear scan is prohibitive, especially when there are multiple matching documents and the length of each document is large (e.g. 1000 page pdf documents are not uncommon these days).
An additional problem is that during indexing, and re-indexing, a searching application typically scans the document and creates and inverted word index to internally represent the document. This process is fairly expensive, especially for applications where new documents are continually added requiring regularly scheduled re-indexing of documents.
There is a need for a positional representation of data that makes possible efficient indexing of documents and retrieval of searched information.