With the advent of the Internet, computer users can access a wealth of information with relative ease. Users may request information about a particular topic from sources connected to the Internet. The users seek to uncover relevant information.
For example, to obtain information, a user enters one or more search terms into an information retrieval system, such as a search engine, which then provides the user with locations of documents that include the search terms. To identify such documents, the search engine indexes all documents, and maintains a database including terms, such as words, and the locations of documents including those terms. The database is regularly updated by the search engine to include data about recently added or amended documents.
Information about one topic may be identified by many different words. For example, documents pertaining to the President of the United States of America may be uncovered by the search terms `President` or `Commander-In-Chief.` However, groups of words, such as `Executive branch,` may also identify the same topic. One type of group of words that may identify a topic is a collocation.
A collocation is a group of words whose meaning cannot be inferred from the individual meanings of its constituent words. For example, the term `White House` refers to the home and office of the President of the United States of America, and does not simply mean a house that is white. Some collocations may be used in a manner where the constituent words do not adjoin one another. For example, the term `a school of white and black fish` includes the collocation `school of fish`.
Collocations in documents must be accurately identified so that generally only documents including relevant information are uncovered. For example, because collocations are typically formed by group of words in close proximity to one another, a user utilizing a search engine to uncover information associated with a specific collocation must use proximity operators when formulating a search query. This technique ensures that documents, including collocations in which the constituent terms do not adjoin one another, are uncovered. However, it desirably precludes uncovering many irrelevant documents that include the constituent terms that are far removed from one another and that do not actually form a collocation.
The use of proximity operators is a burden for the average user. Moreover, permitting the use of proximity operators requires the database to performs the memory intensive task of storing all occurrences of all terms encountered, along with their relative position. Furthermore, this technique may also uncover documents containing documents that are not relevant, but coincidentally include the collocation terms in close proximity to one another. Thus, the precision of the search is diminished.
Precision is a relative measure of relevance. Precision may be calculated by dividing the number of uncovered documents by the number of uncovered documents that the user ascertains to be truly relevant. As shown in Maarek, Yoelle and Smadja, Frank, "Full Text Indexing Based On Lexical Relations," Proceedings of the 12th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, 1989, hereby incorporated by reference, precision can be enhanced by identifying collocations in documents. Thus, there is a need for a high-precision technique for uncovering relevant documents by identifying collocations in the documents.
To accomplish this goal, groups of words that form collocations in a set of documents collocations must be identified. Several methods, including n-squared statistical algorithms, have been developed to identify those groups of words that are based upon statistical significance. These methods, and related techniques, have been disclosed in Smadja, Frank, "Retrieving Collocations from Text: Xtract," Computational Linguistics, 19(1), 1993, pp. 142-177, hereby incorporated by reference.
These methods typically have a complexity of n-square, where n is the number of words in the set of documents. Moreover, these methods are static. Thus, the collocations must be identified every time the set of documents is altered. The set of documents is altered whenever a new document is added to the set, or when a document in the set is modified. Thus, these methods are computationally and memory intensive.
Therefore, there is a need for a method and apparatus that identifies documents with specified collocations and does so with high precision and in a manner that is less computationally or memory intensive.