As the Internet and electronic devices have become ubiquitous, an incredible number of documents are being generated every day, such as blogs, comments, news articles, customer reviews of products, etc. For example, WORDPRESS.COM, owned by AUTOMATTIC INC. of San Francisco, Calif., receives 347 user published blogs every minute and AMAZON.COM, owned by AMAZON.COM INC. of Seattle, Wash., receives on the order of three-hundred thousand customer reviews of products every day. Many of these documents contain useful information. For example, news articles keep readers informed of the events occurring around the world. Similarly, customer reviews of products are not only helpful for customers to make purchase decisions, but also helpful for stakeholders such as authors, sellers, product managers, manufacturers in order to analyze and improve the products.
A very large number of documents can, however, be technically challenging to analyze. A common way to tackle this problem is through keyword extraction. Keywords are significant expressions in a document. Extraction of keywords allows a reader of a document to quickly determine the relevance of the document without reading its entire content.
Extracting meaningful and representative keywords is a nontrivial computing task. As the relevance of a keyword cannot be quantitatively defined, substantial background knowledge is often needed to extract a highly relevant set of keywords. Often times, supervised machine learning through annotating the documents is employed in order to achieve accurate keyword extraction. Corpus level statistics can also be utilized to facilitate keyword extraction. Despite these various efforts, however, existing keyword extraction approaches still do not provide satisfactory results.
The disclosure made herein is presented with respect to these and other considerations.