The present invention relates in general to semantic clustering of documents and in particular to semantic clustering using a combination of words and multi-word phrases that may appear in the document.
With the proliferation of computing devices and communication networks such as the Internet, an ever increasing amount of information is stored in the form of electronic documents. Such documents might be generated using application software such as word processing programs, e-mail programs, web page development tools, etc. Electronic documents can also be generated by scanning paper documents and employing optical character recognition (“OCR”) or other techniques to create an electronic representation of the content.
It is often necessary to search through a large collection of electronic documents to find information relevant to a particular question. For example, a number of search services provide interfaces via which users can search electronic documents that are accessible via the World Wide Web. In another context, discovery in civil litigation usually involves the production of massive quantities of electronic documents that the producing and receiving parties must sift through.
To facilitate review of a large corpus of documents, a number of analysis techniques have been developed that automatically determine properties of the document, e.g., by analyzing the patterns of occurrence of words. For example, semantic clustering attempts to group documents pertaining to the same topic, generally based on identifying words or combinations of words that tend to occur in documents within the cluster but not in documents outside the cluster.
One difficulty in semantic clustering is that many languages (such as English) include multi-word groups (phrases) that convey a meaning to a user. The meaning of such phrases can be different from the single words. For example “New York” and “ice cream” are recognized phrases. Human readers recognize such phrases, but computers do not. Semantic clustering algorithms based on single words can thus be missing important pieces of information, leading to less accurate results.
To address this, some efforts have been made to incorporate phrase identification into semantic clustering. For example, some clustering programs provide a list of phrases, and sequences of words from documents can be compared to the list to detect phrases. This form of phrase detection is limited to those phrases that happen to be on the list. Other clustering programs use punctuation cues (e.g., capital letters) to identify phrases; this works well for proper nouns such as “New York” or “Frank Sinatra” but not for phrases such as “ice cream” that are not normally capitalized.
It would therefore be desirable to automate the process of identifying meaningful phrases within documents or collections of documents.