The present invention relates to information retrieval techniques. More specifically, the present invention relates to performing a phrase search using exclusion tokens.
Since the use of high-speed large-capacity communication infrastructures, such as computers and broadband environments, has become widespread, and information technology has been increasingly introduced in organizations such as public offices, universities, and companies, an enormous number of unformatted documents are created daily. Accordingly, there is an increasing demand for a search system capable of rapidly and precisely retrieving a document desired by a searcher.
In a search system, a character string of a document to be searched is divided into units (hereinafter, referred to as tokens), such as words and clauses, by using an appropriate character string division method. The resulting tokens are assigned position numbers in an order that the tokens appear in the original document, and are then stored in an inverted index. An input query text is also divided into predetermined units (hereinafter, referred to as search tokens), such as words and clauses. The search system determines whether or not to extract the document as a search result depending on whether or not the registered tokens of the document to be searched match the search tokens.
Various techniques intended for improving search precision are known. For example, a technique is known in which language identification and character string analysis, such as morphological analysis, are performed in order to accurately retrieve a document intended by a searcher from among many documents, thereby realizing a search having a higher precision than a search based on simple character string matching. In addition, Japanese Patent Application Publication No. 2010-250389 discloses a technique in which indexing is performed by dividing a document into tokens using two analysis methods, i.e., morphological analysis and N-gram, in order to suppress an insufficient search and obtain appropriate search results.
However, when advanced character string analysis is introduced, a document containing a partially matching character string is also included in search results because of influence of the advanced analysis result. As a result, a situation may occur where documents included in the search results are not necessarily desired by a searcher.
Meanwhile, punctuation marks and symbols may be arbitrarily used by a document creator. Thus, a general search system adopts a method in which punctuation marks and symbols are not used as headwords and are not indexed so that a search can be performed without being affected by the punctuation marks and symbols. However, such a system is incapable of performing a search in consideration of punctuation marks and symbols.
In a phrase search, whether or not consecutive position numbers are assigned to tokens that match search tokens contained in a phrase is determined using the position numbers assigned to the tokens. Accordingly, in order to support the phrase search, adjacent tokens in a document have to be indexed so that a difference between their position numbers is fixed (generally, to one). This restriction makes it difficult to include punctuation marks and symbols in headwords in the phrase search.
However, companies often use symbols in proper nouns, such as company names, project names, and product names. When a search process is performed with such symbols being omitted, a document desired by a searcher is undesirably excluded from search results. Although proper nouns can be registered as words in a dictionary, the dictionary registration work is troublesome and, furthermore, indices have to be re-created every time dictionary registration occurs. Thus, dictionary registration is insufficient as a solution.