The Internet has created a proliferation of searchable content by providing access to millions, if not billions, of Web pages, documents, and other similar published content. To augment the searchability of such content, conventional document processing algorithms have been developed to automatically determine relevant keywords embedded in digitized documents.
In its most basic form, conventional document processing algorithms can measure the frequencies in which words or terms appear in a given document. Those words or terms having higher frequencies can theoretically have, to a certain degree, greater relevance or significance when classifying content in a particular document.
A more sophisticated method for keyword extraction involves the use of conventional Part-Of-Speech “POS” taggers. Conventional POS taggers are capable of identifying multi-word phrases by determining the part of speech for a particular term (e.g., verb, noun, adjective, etc.) and then, based on grammatical and syntactical statistical models, determining which word groupings (e.g., adjective-noun, noun-noun, etc.) are grammatically correct. The groupings can be analyzed to calculate corresponding frequencies of occurrence for those groupings in a given document.
The extraction of keywords by conventional document processing algorithms has several applications such as, for example, use in targeted advertising, tagging of documents in online social environments, database development, and other similar document cataloguing or classification endeavors. For example, a document analyzer can identify an essence of a document and apply keywords to the document so that an advertiser can distribute appropriate advertisements along with the document when downloaded by a computer user.