This invention relates to trainable methods and apparatus for automatically identifying keywords in a document, by using stop words to delimit phrases.
After documents are prepared, there is often a need to generate a list of keywords and phrases that represent the main concepts described therein. For example, academic documents such as technical papers, journal articles and the like typically have an accompanying list of keywords and phrases that can be utilised by a reader as a simple summary of the document or for use in searching and locating articles. As of late, with an increased popularity and use of the Internet, there is an even greater requirement to provide keyword lists of electronic documents to facilitate searching for a document.
Currently, the following four methods are used for generating keywords:
1. Keywords are generated manually, by the author of the document or by a person skilled in indexing documents.
2. Keywords are generated automatically by listing the most frequent words in a document excluding stop words such as very common frequently occurring words such as xe2x80x9candxe2x80x9d, xe2x80x9cifxe2x80x9d, and xe2x80x9chavexe2x80x9d.
3. Keywords are generated automatically by first automatically tagging the words in the document by their part-of-speech, such as noun, verb, adjective, etc., and then listing the most frequent noun phrases in the document.
4. Keywords are generated automatically by selecting those words from a document that belong to a predetermined set of indexing terms. This method requires a list of thousands of indexing terms specific to a particular field.
Of course manual keyword or phrase generation is highly labour intensive. Moreover, a person skilled in indexing documents is likely required to have some knowledge of the terms and understanding of the particular subject matter being indexed.
Listing the most frequent words in the document with the exception of stop words usually results in a relatively low-quality list of keywords, especially in comparison with manual keyword or phrase generation. Single words are often less informative than two or three-word phrases.
Part-of-speech tagging requires a lexicon of usually several tens of thousands of words, and such lexicons have to be provided for each target language.
Most part-of-speech taggers also require a large body of training text, in which every word has been manually tagged. While the quality of the keyword list generated by this method is superior to the second method above, the quality of the list of keywords remains inferior to the manual method of keyword and phrase generation. A limitation of a lexicon of target keywords is that it requires a list of thousands of indexing terms. The list of indexing terms must be kept up-to-date and will be specific to a certain field (e.g., law, biology, chemistry, etc.). Building and maintaining such a list is very labour intensive.
Of the three methods that are currently used for automatically generating keywords, part-of-speech tagging tends to yield the best results. This method has two basic steps. First, potential keywords are identified by tagging words according to their part-of-speech and listing noun phrases. Second, keywords are determined by selecting the most frequent noun phrases. A limitation of this method is that it uses a strong method for identifying potential keywords, but a weak method for selecting keywords from the list of candidates.
In view of the limitations of the prior art methods of keyword generation, it is an object of this invention to provide a method and means for automatically generating keywords, that overcomes many of these limitations.
It is a further object of this invention to provide a fast and relatively efficient method of generating keywords from an electronically stored document.
It is yet a further object of the invention to provide a method and system for generating a plurality of keywords from an electronic stored document wherein the system is trainable by using a training data set independent of the document.
In accordance with the invention, there is provided, a method of generating a plurality of keywords from an electronic, stored document including phrases, stop words delimiting the phrases, and punctuation. A computer is used to select from the document raw phrases of one or more contiguous words excluding stop words, by utilising stop words, or stop words and punctuation, to determine raw phrases to be selected. The step of selecting raw phrases is performed in the absence of part-of-speech tagging and a lexicon of target keywords. The computer then uses a form of the raw phrases to generate the plurality of keywords.
The features used for evaluating the raw phrases include a frequency of the raw phrase occurrence within the document; a measure of closeness to a starting portion of the document; and, a length of the raw phrase.
In accordance with the invention there is further provided a method of generating a plurality of keywords from an electronic, stored document including phrases, stop words delimiting the phrases, and punctuation. A computer is used to select from the document, raw phrases comprised of one or more contiguous words excluding stop words. A form of the raw phrases is used to generate the plurality of keywords in dependence upon a plurality of weighted criteria, wherein weights for the criteria are determined by a step of training. For example, the step of training is performed by providing a training document; providing a set of keywords that are dependent upon the training document; providing a set of weights that are independent of the training document; performing keyword extraction on the training document; comparing the generated keywords with the provided keywords; and then modifying the weightings for the criteria and repeating the step of training until the comparison is within predetermined limits. For example, training may be performed with a genetic algorithm and weights may be stored in a decision tree.
In accordance with the invention there is provided a method of generating a plurality of keywords from an electronic, stored document including phrases, stop words delimiting the phrases, and punctuation. A first list of words within the document that are not stop words are generated. Each word in the list is evaluated to determine a score in dependence upon a plurality of indicators and weights for each indicator, scores for different words in the list determined using same indicators and same weights. The list of words is ordered in dependence upon the scores. For each word in the list, all raw phrases of one or more words containing a word having a predetermined similarity are selected and a score for each selected word phrase is determined. Then the word in the list is replaced with a most desirable word phrase comprising a word having a predetermined similarity.
Advantageously, the invention provides a method and system wherein training data sets are provided comprising documents and keywords for analysis, so that training of the system may occur. Once particular information is gleaned from the preferred training set, the system in accordance with this invention performs similarly, after analysing/learning from the training data set.