Web-based applications have allowed users to be both producers and consumers of content. In an example, web sites have been developed to include web log (blog) applications that allow users to generate documents that include text and images, and share such documents with others. Additionally, social networking applications allow users to generate status messages (which can also be classified as documents) that include images and/or text and publish such information to defined contacts of the user and/or the general public, if desired. In yet another example, micro-blogging applications have been developed that facilitate user-publishing of micro-blogs (e.g., messages of a limited number of characters) that are accessible to subscribers and/or the general public. Much of this content that is produced by users is retained, at least temporarily, in data repositories that are accessible to web search engines.
For many applications, it is desirable to ascertain some semantic meaning of documents in a document corpus; for example, it may be desirable to identify trends in web-based documents for purposes of designing or marketing products. Further, it may be desirable to identify words or phrases that summarize a document, such that when a query that corresponds to one of such words or phrases is submitted by a user to a search engine, the search engine can consider the correspondence between the query and the word or phrase that summarizes the document when positioning the document in a ranked list of search results. Manually undertaking this task of analyzing a large corpus of text (e.g., millions of web-based documents), however, is not possible on a wide scale. For instance, several million micro-blogs are generated over the period of just a few days. Therefore, it is impractical to dedicate human resources to manually review each micro-blog for purposes of understanding and/or classifying a respective micro-blog.
Accordingly, computer-based algorithms have been developed to analyze large corpuses of text. An exemplary analysis undertaken over large corpuses of text is the identification of collocations therein. As used herein, a collocation refers to a sequence of terms that are co-located more often in a corpus of text than would be expected if terms in the corpus of text were arranged randomly. For example, the terms “President George Washington” in sequence may be collocations, as such terms may appear together often in text.
As described above, a significant amount of text is constantly being generated by web users. Relatively recently, algorithms have been developed for execution in distributed computing environments for the purpose of analyzing a large corpus of text, such that analysis of text is performed in parallel across multiple computing nodes. Conventional algorithms that have been developed for execution in distributed computing environments for identifying collocations in documents, however, are relatively inefficient and inflexible.