With the advents of the printing press, typeset, typewriting machines, computer-implemented word processing and mass data storage, the amount of information generated by mankind has risen dramatically and with an ever quickening pace. As a result there is a continuing and growing need to collect and store, identify, track, classify and catalogue for retrieval and distribution this growing sea of information. One traditional form of cataloging and classifying information, e.g., books and other writings, is the Dewey Decimal System. In the area of patents, millions of patents have issued in the U.S. alone. Each patent is issued with a set of claims that define the property right granted by the U.S. and owned by the patentee. In addition to issued patents are the growing number of published patent applications that are now available for searching and reviewing. Each published patent application likewise contains one or more claims to the invention. The U.S. Patent Office uses a subject matter-based classification system to place submitted patent applications in technology centers, classes, and sub-classes of art to more efficiently handle the searching and granting, or denying, of patent claims. In addition a set of International Patent Codes further classifies patents and applications by subject matter. Historically, examiners assigned to examine patent applications would consult “shoes,” i.e., a box associated with a particular sub-class and containing collections of patents grouped together based on subject matter disclosed and claimed by previous inventors. Prior to electronic searching examiners would consult by hand the shoes in an effort to find prior art, this was very tedious, time-consuming, and inefficient. Electronic databases effectively place patent documents in electronic “shoes” for searching.
In many areas and industries, including the financial and legal sectors and areas of technology, for example, there are content and enhanced experience providers, such as The Thomson Reuters Corporation. Such providers identify, collect, analyze and process key data for use in generating content, such as law related reports, articles, etc., for consumption by professionals and others involved in the respective industries, e.g., lawyers. Providers in the various sectors and industries continually look for products and services to provide subscribers, clients and other customers and for ways to distinguish their firms over the competition. Such provides strive to create and provide enhance tools, including search and ranking tools, to enable clients to more efficiently and effectively process information and make informed decisions.
For example, with advancements in technology and sophisticated approaches to searching across vast amounts of data and documents, e.g., database of issued patents, published patent applications, etc., professionals and other users increasingly rely on mathematical models and algorithms in making professional and business determinations. Existing methods for applying search terms across large databases of patent documents, for example, have room for considerable improvement as they frequently do not adequately focus on the key information of interest to yield a focused and well ranked set of documents to most closely match the expressed searching terms and data. Although such computer-based systems have shortcomings, there has been significant advancement over searching, identifying, filtering and grouping IP documents by hand, which is prohibitively time-intensive, costly, inefficient, and inconsistent.
Search engines are used to retrieve documents in response to user defined queries or search terms. To this end, search engines may compare the frequency of terms that appear in one document against the frequency of those terms as they appear in other documents within a database or network of databases. This aids the search engine in determining respective “importance” of the different terms within the document, and thus determining the best matching documents to the given query. One method for comparing terms appearing in a document against a collection of documents is called Term Frequency-Inverse Document Frequency (TFIDF). In this method a percentage of term count as compared to all terms within a subject document is assigned (as a numerator) and that is divided by the logarithm of the percentage of documents in which that term appears in a corpus (as the denominator). More specifically, TFIDF assigns a weight as a statistical measure used to evaluate tile importance of a word to a document in a collection of documents or corpus. The relative “importance” of tile word increases proportionally to tile number of times or “frequency” such word appears in the document. The importance is offset or compared against the frequency of that word appearing in documents comprising the corpus. TFIDF is expressed as the log(N/n(q)) where q is the query term, N is the number of documents in the collection and N(q) is the number of documents containing q. TFIDF and variations of this weighting scheme are typically used by search engines, such as Google, as a way to score and rank a document's relevance given a user query. Generally for each term included in a user query, the document may be ranked in relevance based on summing the scores associated with each term. The documents responsive to the user query may be ranked and presented to the user based on relevancy as well as other determining factors.