The volume of information available today in many domains precludes exhaustive inspection. Even when attempting to restrict attention to sub-domains of interest, academic and industrial researchers and developers cannot give attention to the constant deluge of new documents published. In this context, automated search services are essential.
Search systems typically perform two roles. One is the provision of information via the documents they present to users. Another is the demonstration that the presented documents are the documents that contain the desired information. The popular Google search system is used primarily in the first of these roles. Its users want certain information. Once delivered, by presenting the “best” documents for the purpose, as ranked by known and proprietary methods, the possible existence of other documents providing similar information, perhaps using different terminology or in different languages, drops to marginal importance. On the other hand, intellectual-property lawyers doing prior-art searches are not interested just in the information contained in patent documents. It is their job not to miss any document that is sufficiently related in its content to the concern at hand, despite its information possibly being couched in different verbiage or using nonstandard or erroneous spellings, and even if some documents of very similar content have already been identified. Whereas a Google user typically looks no further than the first ten or twenty returned results, a patent prior-art searcher may individually inspect (to some depth) hundreds of results from a single search.
When using a search system in the second of these roles, the user has had to balance two strategies, one favoring “recall,” i.e., minimizing the search misses, the documents of interest not identified in the search results; and the second favoring “precision,” i.e., minimizing the false hits, documents identified in the search results that are not actually of interest. Recall is essential in that there may be significant adverse repercussions to having missed a relevant document. On the other hand, precision is essential simply in that at some stage of the workflow human resources begin to be required to evaluate the documents obtained, and human resources are limited. It is not efficient to squander them on documents that are not relevant, if only the screening out of these irrelevant documents could be automated via the search system itself.
The sophisticated search systems operating against patent, academic, and legal literature, and other such large corpora regularly accessed by the respective professionals, offer a host of operators including score-propagating versions of the Boolean (logical) sentential connectives. Professional users make extensive use of the Boolean operators as they navigate between the goals of recall and precision. To favor recall, the user amplifies search queries with additional clauses connected by the Boolean OR operator, these clauses attempting to account for different languages, terminologies within each language, grammatical forms, and variant spellings and frequent misspellings. Each such clause has the potential of pulling in its own set of unrelated results along with the otherwise unretrieved desired results it was intended to capture. That is, each OR-ed clause intended to improve recall threatens precision. Conversely, the user can favor precision by amplifying search queries with additional clauses connected by the Boolean AND (or, equivalently, BUTNOT) operator. Of course, such clauses, while enhancing precision, threaten recall.
In fact, in iteratively applying patch after patch to their search queries to attend either to recall or to precision, patent searchers have tended to accrue queries of hundreds of search terms. It takes a long time to develop such queries, and they are exceedingly difficult to maintain. This presents a significant and persistent problem in need of a solution.
Moreover, as communication and geographic, virtual and physical, boundaries are increasingly blurred or non-existent, people with different native languages increasingly become undifferentiated—at least in terms of goals, interests and jurisdiction. One area of particular difficulty is in enabling a wide and divergent and multi-national population of users to effectively identify and retrieve information of interest across an ever expanding universe of documents including content in multiple languages. In the area of patents, for one example, tens of millions of granted patents and patent applications have been published by the patent offices of the U.S., European Patent Organization (EPO), Japan, France, Germany, United Kingdom, and many other countries. In addition to patent publications from the numerous jurisdictions, the number of research papers and technical and other journals that are being published, and hence are in need of effective search access, continues to grow. A growing problem with regards to patent searching, technical research paper searching, etc., is that many geographically and linguistically diverse people are brought together legally and by interest. While this is, of course, a benefit to society, the linguistic diversity of documents, in addition to their sheer aggregate volume, poses a problem for intelligent access to the documents and for the technologies intended to support such access. In addition to issued patents and pending patent applications in numerous jurisdictions, the number of published research papers and technical and other journals that are now available for searching and reviewing continuous to grow.
In the context of the patent domain, the U.S. Patent Office uses a subject matter-based classification system to place submitted patent applications in technology centers, classes, and sub-classes of art to more efficiently handle the searching and granting, or denying, of patent claims. In addition a set of International Patent Classification (“IPC”) further classifies patents and applications by subject matter. Historically, examiners assigned to examine patent applications would consult “shoes,” i.e., boxes each associated with a particular sub-class and containing collections of patents grouped together based on subject matter disclosed and claimed by previous inventors. Prior to electronic searching examiners would consult by hand the shoes in an effort to find prior art, this was very tedious, time-consuming, and inefficient. Electronic databases effectively place patent documents in electronic “shoes” for searching and both governmental and proprietary systems attach keyword-dense fields to patents.
In many areas and industries, including the financial, accountancy, and legal sectors and scholarly, institutional, and corporate research and other areas of technology and development, for example, there are content and enhanced experience providers, such as The Thomson Reuters Corporation. Such providers provide repositories of content, and guidance materials and other resources to assist users in their respective field of interest. Such providers help identify, collect, analyze and process key data for use in generating content, such as law related reports, research papers, financial analysis and data products, articles, etc., for consumption by professionals and others involved in the respective industries, e.g., lawyers, accountants, researchers, professors, financial analysts, etc. Providers in the various sectors and industries continually look for products and services to provide subscribers, clients and other customers and for ways to distinguish their firms over the competition. Such providers strive to create enhanced tools, including search and ranking tools, to enable clients to more efficiently and effectively process information and make informed decisions.
For example, with advancements in technology and sophisticated approaches to searching across vast amounts of data and documents, e.g., database of issued patents, published patent applications, etc., professionals and other users increasingly rely on mathematical models and algorithms to enhance the delivery of professional services, e.g., to enhance search and retrieval of documents of interest responsive to a user input set of query terms. Existing methods for applying search terms across large databases of documents, for example patent documents, have room for considerable improvement as they frequently do not adequately focus on the key information of interest to yield a focused and well ranked set of documents to most closely match the searcher's intent as expressed by the entered search terms.
Prior efforts to enhance searching include Thomson Reuters' Results Plus function, which is in part implemented in Westlaw-based services and as disclosed in U.S. patent application Ser. No. 11/028,476, the disclosure of which is incorporated herein in the entirety. In terms of the Intellectual Property and patent area, Thomson Reuters' patent claims analyzer function, as disclosed in U.S. application Ser. No. 12/658,165, the disclosure of which is incorporated herein in the entirety, discloses a system for applying natural language processing on patents and pending applications. In addition, concept searching techniques are disclosed in U.S. Pat. No. 8,321,425 (Custis et al.), the disclosure of which is incorporated herein in the entirety; T. Custis and K. Al-Kofahi. A new approach for evaluating query expansion: Query-document term mismatch. In Proc. of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 575-582. ACM, 2007; and T. Custis and K. Al-Kofahi. Investigating external corpus and clickthrough statistics for query expansion in the legal domain. In Proc. of the 17th Conference on Information and Knowledge Management (CIKM), pages 1363-1364. ACM, 2008 (referred to collectively herein as “Custis-Al-Kofahi”)
Compared to existing methods, what is needed are systems that provide: 1) easier expression of the searcher's interest, including automatic accommodation of different languages of search-term entry, the responsive documents to be found independent of language and of intra-language linguistic variants; 2) smarter determination of the searcher's narrower and broader area(s) of interest; and 3) improved relevance ranking to enable the searcher to decide how far afield to go from the documents most narrowly focused on the expressed area of interest—which documents should be accumulated right at the top of returned search results.