The present invention relates to a full-text search that performs high-speed retrieval of documents containing specified strings from the full text of large-scale document databases. The present invention is used in databases, document management systems, document filing systems, and DTP (Desktop Publishing) systems.
One method for performing high-speed retrieval of documents containing specified strings from the full text of large-scale document databases is to use an n-gram index.
In the n-gram indexing method, information about the position at which each n-gram (a string consisting of n consecutive characters) occurs in a document is indexed when a document is registered. Using this method, documents in which a search term appears are found as follows. When a search is performed, the n-grams contained in the search term are looked up in the index, and an evaluation is made to see whether the positional relations within the search term match the positional relations in the index (this is referred to hereinafter as an adjacency evaluation).
FIG. 2 shows an example of a 1-gram indexing method.
Referring to the figure, in the n-gram indexing method, information about the position at which each n-gram (n=1 in the example shown in FIG. 2) appears in a document is stored in an index when a document is registered.
For example, the 1-gram (“ni”) appears as character number ‘3’ in the document ‘001’. Thus, the document number ‘001’ and the character position ‘3’ are stored in an index 200 corresponding to (“ni”).
When a search is performed, an occurrence position for a search term is obtained by performing an adjacency evaluation of the occurrence positions in the indexes of the n-grams (n=1 in the example shown in FIG. 2) extracted from the specified search term.
For example, if(“bi|sei|butsu”) is specified as the search term, the 1-grams “bi”, “sei”, and “butsu” are extracted from the search term.
Then, occurrence position information for “bi|sei|butsu” is obtained by performing an adjacency evaluation using an index 201 corresponding to “bi”, an index 202 corresponding to “sei”, and an index 203 corresponding to “butsu”.
In the example shown in the figure, “bi”, “sei”, and “butsu” are adjacent to each other starting with character ‘9’ in document number ‘001’. The characters are also adjacent to each other starting with character ‘5’ in document number ‘056’. This indicates that ‘bi|sei|butsu’ occurs at these positions.
As described above, the n-gram indexing method allows searching to be performed without scanning a document by simply loading indexes and performing adjacency evaluations based on occurrence position information. Thus, the method can be used to provide high-speed full-text searches even when implemented for large-scale document databases.
However, with the n-gram indexing method, using an n value of 1, i.e., using 1-gram indexing, the occurrence position information for individual 1-grams will be increased since each 1-gram will have a high frequency of occurrence, thus making the individual indexes large.
This results in slower loading of indexes as well as the number of adjacency evaluations that have to be performed based on the occurrence position information, thus causing the searching to be time consuming.
To provide high-speed searching, smaller indexes must be created using a higher value of n. However, indexes for smaller values of n must also be created to allow searching when short search terms are specified.
As a result, the total index size is increased.
Also, in indexing document retrieval methods, such as the n-gram indexing method, strings (n-grams, in the case of the n-gram indexing method) must be managed in a tree structure, such as the tries described in “Information Retrieval”, by William B. Frakes, pp. 21–23.
Tries are tree structures created for sets of strings to be searched, i.e., key words (hereinafter referred to as key sets), where the common front sections of the key words (hereinafter referred to as keys) are delimited with common delimiters.
These tries are used when registering and retrieving documents. A string to be registered or a string contained in the search term is used as a key that is traversed in a trie to obtain pointer information indicating an index corresponding to the string.
Since the time required to search a trie is not dependent on the number of keys, tries can be used for large-scale databases to provide high-speed key word searches.
FIG. 3 shows a trie corresponding to a key set of {baby, badge, badger, jar}.
In this trie, a branch label b (302) is defined from a node 1 (300) to a node 2 (301). At the node corresponding to the end of the key, indicated by double circles, pointer information for the index corresponding to the key is set up.
For example, if the specified search term is “baby”, the trie in the figure is searched for the string “baby”, and pointer information Pt1 set up at a node 5 (303) is obtained. The pointer information Pt1 points to where an index corresponding to the search term “baby” is stored.
When using these types of tries to manage n-grams for the n-gram indexing method, creating indexes with longer n-grams to keep individual indexes smaller and to make searches faster will result in an increased number of n-grams and trie nodes, leading to larger tries.
In order to overcome this problem of increased total index size and increased size in the tree structure that manages the index, Japanese laid-open patent publication number Hei 8-1947 18 (hereinafter referred to as conventional technology 1) discloses a method where, if an n-gram makes the index size exceed a certain reference value (hereinafter referred to as the reference index size), the value of n for the n-gram is increased and a smaller index is created. This provides a consistently light load for index loading and adjacency evaluations for occurrence position information, allowing high-speed searching, while also preventing increases in total index size and the size of the tree structures (hereinafter described for tries) used to manage the index.
FIG. 4 provides an overview of the incremental n-gram indexing method disclosed in conventional technology 1.
When a document is registered in this method, an index of n-grams is created and connection information for two characters in the document is registered in a trie 122.
If the index size exceeds a reference index size as documents are being registered, an index is created for n-grams having one more character than the original n-gram (hereinafter referred to as an extended n-gram).
The following is a more specific description of the method used to create indexes, with reference to FIG. 4.
To create an extended n-gram with one more character than the original n-gram, the trie 122 is looked up and an n-gram that may continue from the original n-gram is retrieved.
Then, an adjacency evaluation is performed for the occurrence position information of the index for the retrieved n-gram (hereinafter referred to as a connection n-gram) and the index of n-grams that exceeded the reference index size (hereinafter referred to as the reference index surplus n-gram). This is used to create an index for extended n-grams.
In the example shown in this figure, the index corresponding to the 1-gram “sei” exceeds the reference index size, so “sei” becomes a reference index surplus n-gram.
First, the trie 122 is searched for “sei”, and a connection n-gram following “sei” is obtained.
In the example shown in the figure, a search of the trie 122 determines that “butsu” and “soku” follow “sei”.
Then, an adjacency evaluation is performed for the occurrence positions in the “sei” and “butsu” index and the “sei” and “soku” index. This results in the creation of an extended n-gram index 400, where one character is added to “sei”, as in “sei|butsu” and “sei|soku”.
As described above, by using conventional technology 1, extended n-grams having one more character than the original n-grams are created for large indexes that slow down searches. This provides a consistently low load for index loading and adjacency evaluations of occurrence position information, thus allowing high-speed searches to be performed.
For all other indexes, indexes for longer n-grams are not created, thus preventing increases in the total index size and the size of the tree structures (tries) used to manage n-grams.