Although the origins of the Internet trace back to the late 1960s, the more recently-developed Worldwide Web (“Web”), together with the long-established Usenet, have revolutionized accessibility to untold volumes of information in stored electronic form to a worldwide audience, including written, spoken (audio) and visual (imagery and video) information, both in archived and real-time formats. The Web provides information via interconnected Web pages that can be navigated through embedded hyperlinks. The Usenet provides information in a non-interactive bulletin board format consisting of static news messages posted and retrievable by readers. In short, the Web and Usenet provide desktop access to a virtually unlimited library of information in almost every language worldwide.
Information exchange on the Web and Usenet both operate under a client-server model. For the Web, individual clients typically execute Web browsers to retrieve and display Web pages in a graphical user environment. For the Usenet, individual clients generally execute news readers to retrieve, post and display news messages, usually in a textual user environment. Both Web browsers and news readers interface to centralized content servers, which function as data dissemination, storage and retrieval repositories.
News messages available via the Usenet are cataloged into specific news groups and finding relevant content involves a straightforward searching of news groups and message lists. Web content, however, is not organized in any structured manner and search engines have evolved to enable users to find and retrieve relevant Web content, as well as news messages and other types of content. As the amount and variety of Web content have increased, the sophistication and accuracy of search engines have likewise improved. Existing methods used by search engines are based on matching search query terms to terms indexed from Web pages. More advanced methods determine the importance of retrieved Web content using, for example, a hyperlink structure-based analysis, such as described in S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Search Engine,” (1998) and in U.S. Pat. No. 6,285,999, issued Sep. 4, 2001 to Page, the disclosures of which are incorporated by reference.
Compounds frequently occur in Web content, news messages, and other types of content. A compound, sometimes also referred to as a collocation, is defined as any sequence of words that co-occur more often than by mere chance. Compounds occur in text and speech as a natural language construct and can include proper nouns, such as “San Francisco,” compound nouns, such as “hot dog,” and other semantic and syntactic language constructs, which result in the co-occurrence of two or more words. Compounds occur with regularity in a range of applications, including speech recognition, text classification, and search result scoring.
Recognizing compounds is difficult, especially when occurring in speech or live text. Moreover, most languages lack regular syntactic or semantic clues to enable easy identification of compounds. In German, for instance, the first letter of each noun is capitalized, which complicates the identification of proper nouns. Similarly, the types of potential compounds can depend on the subject matter. For instance, a scientific paper could include compounds wholly unique from those found in a sports column.
Conventional approaches to finding compounds in a text corpora typically rely on n-gram analysis, such as described in C. D. Manning and H. Schütze, “Foundations of Statistical Natural Languages Processing,” Ch. 5, MIT Press (1999), the disclosure which is incorporated by reference. An n-gram is a multi-word occurrence. N-gram-based approaches therefore count the frequencies of individual words or tokens and the frequencies of word sequences of varying lengths. N-gram-based approaches suffer from three principal difficulties.
First, n-gram-based approaches are storage inefficient. As the number of words occurring in each n-gram increases, the number of unique n-grams in a corpus approaches the number of words in a corpus. Storing the counts for long sequences of n-grams can require a prohibitively large amount of memory.
Second, with compounds of varying lengths, the likelihood of spurious shorter compounds being included as substrings increases. Spurious substrings of longer compounds can occur, skewing compound identification results. For example, “New York City” is a three-word compound, where the words “New,” “York,” and “City” are highly correlated. As a side effect, “York City” is also highly correlated, but generally does not represent a meaningful compound. “York” and “City” are only correlated in the context of the larger compound, “New York City.”
Similarly, with compounds consisting of three or more words, the likelihood that a longer compound will contain two-word or three-word compounds as substrings increases. Spurious long compounds that contain shorter, but significant, compounds as substrings can occur. For example, “San Francisco” as a two-word compound, but “San Francisco has” is not a three-word compound. Nevertheless, n-gram-based approaches, which assume all words are independent, would erroneously identify “San Francisco has” as a three-word compound.
Therefore, there is a need for an approach to efficiently identifying compounds in a text corpus based on a measure of association, such as a likelihood of co-occurrence between the words which constitute each compound.
There is a further need for an approach to forming a list of compounds though an analysis of a text corpus with minimal overlapping substrings, minimal overlapping compounds, and efficient memory utilization.