A typical way of structuring large volumes of data such that they can be easily accessed is to index the documents. This means that a document or a group of documents is referenced by an indexing term. A collection of such indexing terms then forms an index: This is exemplarily shown in FIG. 4.
In FIG. 4 address documents 410, 420, 430 contain address data on individual persons. The documents may be characterized by the individual elements which they contain, one of them is the family name of the persons to which the documents relate.
This information may then be used for building an index 440 shown on the left-hand side of FIG. 4. This index contains list of the family names which are contained in the documents, and each of the elements of the index references an individual document as can be seen from FIG. 4.
This is a very classical and typical way of organizing information in a structured manner such that the desired information or the desired documents containing a searched and desired piece of information may be retrieved and accessed from a large volume of documents.
Indexes can be built for several elements such as the family name, the first name, the street name, etc. What indexes have in common is that the elements of an index all in some sense have the same “meaning”, such as “family name”, “first name”, or the like.
Therefore, the individual elements which are used to build an index are somehow consistent with respect to the information they contain when viewed from a more abstract level. In other words, all elements of the index have the same “meaning”.
Another, more general approach for ordering documents is just to characterize one or more documents by a certain term, and then to build an index from the thus used individual terms. In such a case the index elements do not have to have a consistent “meaning”, although one may consider that they all have the same meaning in the sense that each of the elements characterizes or describes the one or more documents which it references.
After an index has been built, it can be used for querying and accessing the set of documents ordered or structured by the thus built up index. Either one can directly input a search term, and if it is contained in the index, then the document or the documents referenced by the index term are retrieved. Another possibility is to “browse” the index, which means to display the individual index elements in some (typically alphabetical) order, as shown in element 440 of FIG. 4. This has the advantage that a user can quickly overlook which index elements are used in total for organizing or “indexing” the set of documents.
Another somewhat more sophisticated approach is to use a so-called fault-tolerant search, which means that a search term is inputted and those documents are retrieved where the corresponding index value is identical or at least similar (to some extent, depending on the used fault-tolerant search algorithm) to the search term.
In any case, building an index is a very difficult and tedious work, it is the preparatory work which has to been done in order to make it feasible to access large sets of documents in an ordered and meaningful manner.
Typically indexes are created “manually”, at least in case of the documents to be indexed being “unstructured”, such as plain text documents. If the documents to be indexed are “structured”, such as in the case of relational data base tables, then it is relatively easy to built an index. However, if one does not know which individual “meaning” an element in an unstructured document has, then it is extremely difficult and tiresome to select elements which can be used for indexing this document.
Consequently, it is highly desirable to improve the processing of indexing documents.