With the advents of computer-implemented data capturing and processing and mass data storage, the amount of information generated by mankind has risen dramatically and with an ever quickening pace. As a result there is a continuing and growing need to collect and store, identify, track, classify and to assimilate, transform and re-define this growing sea of information for heightened use by humans.
One traditional form of cataloging and classifying information is the Dewey Decimal System. In the area of patents, millions of patents have issued in the U.S. alone. Each patent is issued each with a common set of features, e.g., claims, IPC code, title, cited references, abstract, specification, etc. In addition to issued patents are the growing number of published patent applications that are now available for searching and reviewing. Each published patent application likewise contains fields of interest. The U.S. Patent Office uses a subject matter-based classification system to place submitted patent applications in technology centers, classes, and sub-classes of art to more efficiently handle the searching and granting, or denying, of patent claims. In addition a set of International Patent Codes further classifies patents and applications by subject matter, namely the WIPO has established a set of 70,000 or so IPC codes. Historically, examiners assigned to examine patent applications would consult “shoes,” i.e., a box associated with a particular sub-class and containing collections of patents grouped together based on subject matter disclosed and claimed by previous inventors. Prior to electronic searching examiners would consult by hand the shoes in an effort to find prior art, this was very tedious, time-consuming, and inefficient. Electronic databases effectively place patent documents in electronic “shoes” for searching. Now the electronic documents are available for additional uses.
In many areas and industries, including the financial and legal sectors and areas of technology, for example, there are content and enhanced experience providers, such as The Thomson Reuters Corporation. Such providers identify, collect, analyze and process key data for use in generating content for consumption by professionals and others involved in the respective industries. Providers in the various sectors and industries continually look for products and services to provide subscribers, clients and other customers and for ways to distinguish their firms over the competition. Such provides strive to create and provide enhance tools, including search tools, to enable clients to more efficiently and effectively process information and make informed decisions.
For example, with advancements in technology and sophisticated approaches to searching across vast amounts of data and documents, e.g., database of issued patents, published patent applications, etc., professionals and other users increasingly rely on mathematical models and algorithms in making professional and business determinations. Existing methods for applying search terms across large databases of patent documents, for example, have room for considerable improvement as they frequently do not adequately focus on the key information of interest to yield a focused and well ranked set of documents to most closely match the expressed searching terms and data. Although such computer-based systems have shortcomings, there has been significant advancement over searching, identifying, filtering and grouping IP documents by hand, which is prohibitively time-intensive, costly, inefficient, and inconsistent.
Search engines are used to retrieve documents in response to user defined queries or search terms. To this end, search engines may compare the frequency of terms that appear in one document against the frequency of those terms as they appear in other documents within a database or network of databases. This aids the search engine in determining respective “importance” of the different terms within the document, and thus determining the best matching documents to the given query. One method for comparing terms appearing in a document against a collection of documents is called Term Frequency-Inverse Document Frequency (TFIDF). TFIDF assigns a weight as a statistical measure used to evaluate tile importance of a word to a document in a collection of documents or corpus. The relative “importance” of the word increases proportionally to the number of times or “frequency” such word appears in the document. The importance is offset or compared against the frequency of that word appearing in documents comprising the corpus. TFIDF is expressed as the log (N/n(q)) where q is the query term, N is the number of documents in the collection and N(q) is the number of documents containing q. TFIDF and variations of this weighting scheme are typically used by search engines, such as Google, as a way to score and rank a document's relevance given a user query. Generally for each term included in a user query, the document may be ranked in relevance based on summing the scores associated with each term. The documents responsive to the user query may be ranked and presented to the user based on relevancy as well as other determining factors.
Incorporated by reference is U.S. Pat. Publ. 2011/0191310 (Liao et al.) entitled Method and System For Ranking Intellectual Property Documents Using Claim Analysis.