Throughout the present specification, the following definitions are assumed.
Archives refer to a collection of records, and may also refer to the location in which these records are kept. Archives are made up of records which, in general, have been created in a continuous fashion, e.g. during the course of an organization's life. Usually, an archive consists of records which have been selected for pennanent or long-term preservation. In computer science, creating archives can sometimes be a cumbersome process wherein billions of data are parsed, selected and stored. In addition, said archives may need to be updated.
Besides, one knows automatic indexing. Automatic indexing begins with texts, and leads to inverted index term lists or document vectors and a dictionary.
Document vectors are e.g., for a document, a list of all words comprised therein along with how many times they appear. This may take the form ([list,5],[vector,3]).
A dictionary is e.g., a list of all unique words and their identifiers. Words can furthermore be conflated in the index by stemming or simple plural removal. Steps in automatic indexing are typically the following. First, documents (e.g., an article in an encyclopedia) are identified. Second, fields (e.g., title, author, abstract) are identified. Finally, one proceeds to parse and if necessary transform to standard forms terms like names, dates, compounds, words, abbreviations, acronyms, numbers and other special characters, etc.
An inverted index is an index structure storing a mapping from words to their locations in a document or a set of documents, giving full text search. An inverted index is assumed to be one of the most important data structure used in search engines. Such an associative array is a multimap (more than one value may be associated with a given key), and can be implemented in many ways. It could be a hash table, where the keys are words (strings), and the values are arrays of locations. There are two main variants of inverted indexes: An inverted file index contains for each word a list of references to all the documents in which it occurs. A full inverted index additionally contains information about where in the documents the words appear. This could be implemented in several ways. The simplest may be a list of all pairs of document identifiers and local positions. An inverted file index needs less space, but also has less functionality. It allows for searching terms (as a search engine), but not phrase.
As of today development of computer-implemented indexing makes that it is a common task of software to build and/or update several indexes based on one or more documents (or more generally a set of data). Typically, the creation of such indexes requires indexing the relevant reference data. However, the above operations are usually not optimized, leading to a detrimental computational effort and loss of time.
There is therefore a need for a method, a computer program product and system allowing for optimizing the creation of such indexes. Preferably, said method should further optimize the update of said indexes.