Identifying information related to a given topic within large collections of documents is an ongoing challenge. The most common method is to use Boolean keyword searches to find documents that mention particular terms, but there are inherent limitations to using Boolean keyword searches to identify documents of potential interest. One limitation is that when a specific keyword is used in the Boolean search, there is a risk that the Boolean search will not return a relevant document because the document does not use the same phrasing or nomenclature of that specific keyword. On the other hand if a more general keyword is used in the Boolean search, there is a risk that the Boolean search will return a set of documents too large for a searcher to analyze all of the documents within a reasonable time. Thus, the limitations provided by using Boolean searches to gauge the relevancy of a document to a keyword reduces the efficiency with which information can be gleaned from large sets of documents. Although a human who manually searches documents for text relevant to a keyword often easily addresses the shortcomings of a Boolean search by employing intuition developed through years of familiarity with language as well as familiarity with a breadth of topics, when large document sets are to be reviewed, manual review is not practical.
In an effort to increase the efficiency with which sets of documents can be reviewed, other methods are used to assess the relevancy of documents identified by a search. Some internet search engines, for example, assess relevancy by prioritizing the documents (for example, web pages) that are returned to the user. More specifically, for example, some search engines use crowd sourcing which ranks the relevancy of documents returned from a Boolean search based upon the popularity or page-rank of those documents. Although priority or relevancy rankings based upon crowd sourcing works very well in instances where the search engine has sufficient users to generate the necessary statistics, it is poorly suited to more niche applications. For example, crowd sourcing is ill suited to small intranets, or within a single internet domain, because the volume of users may not be large enough to generate accurate relevancy rankings. Additionally, crowd sourcing may not generate accurate relevancy rankings when obscure search terms are used because the yielded results have not been viewed a sufficient number of times to be prioritized by popularity.
Because, many documents include common words (e.g. “the”, “a”, “that”, . . . ) which have no particular relevancy to the document or the keyword, some prior art methods for determining the relevancy of documents involve the elimination of the effect of these common words on the results. Such methods require, however, the identification of the common words and therefore knowledge of the language utilized in the documents containing these words is required.