Field of the Invention
This invention generally relates to keyword extraction techniques, and more particularly, a method of extracting keywords from a document based on a combination of NLP (Natural Language Processing) information, frequency analysis as well as co-occurrence analysis of the document data.
Description of Related Art
With the advent of the Internet, there is now both a massive amount of information available, as well as a demand to be able to search through all of this information. As a result, keywords have been commonly used for search engines and document databases to locate specific information, or to determine if two pieces of text data are related to each other. Due to the intricate complexities of natural language, however, it can be quite challenging to locate and define a set of words (aka, the keywords) that accurately convey the theme or describe the topics contained in the text. Various keyword extraction techniques have been developed over the years. Despite their differences, most methods attempt to do the same thing: using some heuristic, such as distance between words, frequency of word use or predetermined word relationships, to locate and define keywords in a document or a piece of text data.
In some circumstances, these methods may not be sufficient or efficient, and thus, additional measurements are needed to help extract the keywords. For example, most content processed by printers in an organization comprises corporate documents, such as meeting notes or weekly reports, which contain important information regarding individuals or entities and their relationships within the organization. Such information may not be quickly captured using existing keyword extraction techniques. Therefore, a need exists to improve the performance of keyword extraction by combining various techniques, such as statistical, natural language and positional techniques, as well as additional word measurements, such as relationships between the keywords.