The present invention relates to a keyfact-based text retrieval method and a keyfact-based text index method. In particular, the methods describe the formalized concept of a document as a pair comprising an object that is the head and a property that is the modifier, and uses the information described by the pair as index information for efficient document retrieval.
A keyfact means an important fact contained in sentences which constitute a document. The keyfact is represented by an object and property information through syntactic analysis of the sentence.
The keyword-based text retrieval method was the main stream in conventional text retrieval methods. However, the precision of the keyword-based text retrieval method was not good due to the following reasons. First, the meaning of the document is not precisely represented and the representativeness of document expression is low because the document is represented by keywords, which are nouns. This is a fundamental reason for poor retrieval precision. Second, when a query includes a natural language phrase or a natural language sentence or keywords, the intention of the user""s query is not reflected precisely in a keyword-based text retrieval method because the query is expressed by keywords. Therefore, the keyword-based text retrieval method has a fundamental limitation in retrieval precision because it performs document retrieval by keywords. As a result, because the keyword-based text retrieval system provides such low level of retrieval precision, it causes a number of unnecessary retrievals and therefore precious resources, such as time and effort, are wasted.
Recently, a number of studies have been performed in the area of phrase-based text retrieval methods in order to compromise such defects of the keyword-based retrieval method. The phrase-based text retrieval methods extract a precise phrase pattern through a morphological-syntactic normalization process and perform indexing and retrieval by extracted phrase. Therefore, the phrase-based retrieval method performs more precise text retrieval than the keyword-based text retrieval method but performs less precise text retrieval than a concept-based text retrieval method, which expresses text by concept units.
A new approach to keyfact-based text retrieval methods has been proposed in order to overcome the shortcomings of the keyword-based text retrieval method and generalize phrase-based text retrieval method. In the keyfact-base text retrieval method, a part of text that represent the same meaning is described as a keyfact. Since the keyfact-based retrieval method is a sort of concept-based retrieval method, and therefore indexing and retrieval of the keyfact-based retrieval method are performed with the unit of the keyfact, precision of the retrieval is greatly improved.
In the keyfact-based retrieval method, it is desirable that phrases or words having the same meaning are indexed as the same indexing terms. For example, noun phrases including xe2x80x9cthe retrieval of informationxe2x80x9d as a subset of xe2x80x9cthe efficient retrieval of informationxe2x80x9d, xe2x80x9cthe retrieval of the distributed informationxe2x80x9d, and xe2x80x9cthe fast retrieval of the distributed informationxe2x80x9d must have common indices which can be possibly generated from xe2x80x9cthe retrieval of informationxe2x80x9d as subsets and recognize also them as different meaning with subtle conceptual different indexes at the same time.
Since the keyword-based retrieval method doesn""t recognize the conceptual difference between xe2x80x9cthe retrieval of the informationxe2x80x9d and xe2x80x9cthe efficient retrieval of the informationxe2x80x9d, users are not able to retrieve the exact document that is desired.
A keyfact-based retrieval method, which extracts the precise keyfact pattern using the natural language processing techniques and indexes documents with the unit of the keyfact, is provided.
In addition, a keyfact-based retrieval method, which extracts precise keyfact patterns included in a natural query of a user using the natural language processing techniques and retrieves documents similar to the query in the keyfact-based index file, is provided.
In addition, a keyfact-based retrieval method, which retrieves and indexes documents with the unit of keyfact, is provided.
A keyfact-based text retrieval system of the present invention includes keyfact extracting means, keyfact indexing means, and keyfact retrieving means. The keyfact extracting means analyze a document collection and a user query, and extracting keywords not having part-of-speech ambiguity from the document collection and the user query, and respectively extracting keyfacts of the document collection and the user query from the keywords. The keyfact indexing means for calculating the frequency of the keyfacts of the document collection and generating a keyfact list of the document collection for a keyfact index structure. The keyfact retrieving means for receiving the keyfact of the user query and the keyfacts of the document collection and defining a keyfact retrieval model in consideration of weight factors according to a keyfact pattern and generating a retrieval result.
The keyfact extracting means includes morphology analyzing means, part-of-speech tagging means, keyfact pattern extracting means, and keyfact generating means. The morphology analyzing means analyze morphology of an input sentence and obtaining tag sequences of part-of-speech by attaching part-of-speech tags. The part-of-speech tagging means selects a tag sequence of part-of-speech out of the tag sequences of part-of-speech. The tag sequence of part-of-speech is precise. The keyfact pattern extracting means extracts a keyfact pattern by applying the tag sequences of part-of-speech to a keyfact pattern rule. The keyfact generating means applies the keyfact pattern to a keyfact pattern generation rule and generating a keyfact list, which is a set of keyfact terms.
The keyfact indexing means includes frequency calculating means, table generating means, and keyfact indexing means. The frequency calculating means calculates a frequency of various keyfacts and a document frequency of the keyfacts. The various keyfacts are included in the document collection, and the document frequency is the number of documents contained the various keyfacts. The table generating means generates a document index table, a document table, and a keyfact index table of the document collection. The keyfact indexing means forms a keyfact index structure. The keyfact index structure has information regarding document frequency, document identifier, and keyfact frequency in each corresponded documents.
The keyfact retrieving means includes following means. A means forms a document and a user query vector with an index file and the keyfact of the user query. The index file generated by the keyfact indexing means. The keyfact of the user query generated by the keyfact extracting means. A means determines keyfact weight constants in accordance with the keyfact pattern. A means calculates keyfact weights for the document and the user query by applying the keyfact weight constants to the document and the user query vector. The retrieval results displaying means displays the retrieval result by applying the keyfact weights to keyfact retrieval model. The retrieval result indicates documents with a keyfact similar to the keyfact of the user query.
A keyfact-based text retrieving method of the present invention includes keyfact extracting step, keyfact indexing step, and keyfact retrieving step. The keyfact extracting step is to analyze a document collection and a user query, and extracts keywords without part-of-speech ambiguity from the document collection and the user query, and respectively extracts keyfacts of the document collection and the user query from the keywords. The keyfact indexing step is to calculates the frequency of the keyfacts of the document collection and generates a keyfact list of the document collection for a keyfact index structure. The keyfact retrieving step is to receives the keyfact of the user query and the keyfacts of the document collection and defines a keyfact retrieval model in consideration of weigh factors according to the keyfact pattern and generates the retrieval result.
The step of keyfact extracting includes the following steps. The first step is to analyze morphology of an input sentence and obtaining tag sequences of part-of-speech by attaching part-of-speech tags. The second step is to select a tag sequence of part-of-speech out of the tag sequences of part-of-speech. The third step is to extract a keyfact pattern by applying the tag sequence of part-of-speech to a keyfact pattern rule. The fourth step is to apply the keyfact pattern to a keyfact pattern generation rule and generating a keyfact list.
The step of analyzing morphology includes the following steps. The first step is to divide the input sentence into words. The second step is to perform morphological analysis on the words using part-of-speech dictionaries. The third step is to perform morphological variation and recover prototypes. The fourth step is to obtain the tag sequence of part-of-speech by tagging part-of-speech tags in accordance with the result of the morphological analysis.
The part-of-speech dictionaries include a noun dictionary, a verb dictionary, an adjective dictionary, an adverb dictionary, a preposition dictionary, a conjunction dictionary and a stop-word lexicon.
The step of keyfact indexing includes the following steps. The first step is to calculate a frequency of various keyfacts and a document frequency of the keyfact. The second step is to generate a document index table, a document table and a keyfact index table of the document collection. The third step is to form a keyfact index structure including document frequency, document identifier and keyfact frequency.
The step of keyfact retrieving includes the following steps. The first step is to form a document and a user query vector with an index file and a keyfact of the user query. The second step is to determine keyfact weight constants in accordance with the keyfact pattern. The third step is to calculate keyfact weights for the document and the user query by applying the keyfact weight constants to the document and the user query vector. The fourth step is to display the retrieval result by applying the keyfact weights to the keyfact retrieval model. The retrieval result indicates documents with a keyfact similar to the keyfact of the user query.