A keyword is a single word or multiple-words present within documents that can characterize and summarize the topics covered by the documents. Generally, when documents are prepared, there is often a need to generate a list of keywords and phrases that represent the main concepts described in such documents. For example, a reader may utilize a list of keywords and phrases as a simple summary of a document for searching and locating articles in academic documents such as technical papers, journal articles etc. Similarly, due to an increase in the usage of the well-known Internet, there is a need to provide a keyword list of electronic documents to facilitate searching for a particular document. Keyword extraction from a document possesses many potential applications, such as the creation of metadata for a document, facilitating skimming documents by highlighting keywords, and use used in the context of index terms for searching document collections, and also for analyzing usage patterns in Web server logs.
Keywords from a document can be generated manually by an author of the document or a person skilled in indexing documents. The keywords may also be generated automatically by tagging words in documents by their part-of-speech, such as for example a noun, a verb, an adjective, etc. Similarly, the most frequent words in documents can be listed, excluding stop words such as “and” “if” “have” etc. Stop words are commonly utilized insignificant words such as “the” which occurs frequently in a document. Such prior art keyword extraction methods possess limited capabilities, which results in a relatively low-quality list of keywords. Such approaches are also usually highly labor intensive.
One prior art keyword extraction approach collects word frequencies with respect to a corpus of documents to determine average word frequencies. The same frequency counting method can be utilized to determine the word frequencies of a page or a document in question. The problem associated with such prior art approaches is that common words may occur more frequently in a given page or document than in the corpus, and may be incorrectly output as keywords. Similarly, if the given page possesses a small word count, quantization causes the word frequencies to be inaccurate, thereby resulting in non-keywords appearing more frequent than in the corpus. One solution to this problem is to utilize a list of stop words composed of a predetermined set of common words. Hence, if a given word in the page or document is a stop word, it is not considered a keyword. Similarly, the raw frequency in the given page or document can be compared against the raw frequency in the corpus to generate keywords. Such methods, however, generate frequency quantization problems due to small sample sizes.
Based on the foregoing it is believed that a need exists for an improved automated method and system for simple keyword extraction, as described in greater detail herein.