Field of the Invention
The present invention relates generally to the field of document searching, data mining and data visualization.
Description of the Related Art
The field of data searching and data/text mining is replete with various search methods and algorithms for helping determine the identity and/or location of documents that may have relevance to a particular subject matter of interest. The most basic search techniques involve locating specific words or word combinations within one or more of a quantity of documents contained in a database. This search methodology, while very simple to implement, suffers from a number of significant drawbacks, including slow search processing time, limited ability to construct and execute complex search queries, and other well-documented limitations inherent in the use of keywords as search criteria. Improvements to the basic keyword search include the use of structured queries (e.g., based on Boolean logic), word stemming, wildcards, fuzzy logic, contextual analysis and latent semantic analysis.
Despite its well-documented drawbacks, simple key-word based searching is still a good entry point to quickly locate documents of general interest to a relevant subject matter. It is sufficient in many searching applications to locate a particular desired piece of information contained within one or more documents being searched. However, there are many specialized searching applications, particularly in the science, technology, academic and legal fields, where keyword searching (even with the various improvements to date) provides an unsatisfactory approach for locating some or all of the relevant documents that may be of interest to a researcher. The primary underlying difficulty is that words and word phrases are imprecise by their nature. Different words and word phrases can have completely different meanings in different associative contexts. As a result, key-word based searching in these and other specialized searching applications tends to be a slow and tedious process, typically producing significant numbers of irrelevant documents or “false hits” and often failing to turn up one or more desired relevant documents.
More advanced searching techniques rely on contextual or bibliographical linkages between two or more documents. For example, U.S. Pat. No. 6,754,873 issued Jun. 22, 2004 to Law, et. al. describes a search technique for finding related hyperlinked documents located on the world-wide-web using link-based analysis. In this case backlink and forwardlink sets are utilized to find web pages that are related to a particular selected web page of interest. The resulting list of related web pages is typically sorted in accordance with a calculated relevancy score, the intent being that presumably the most relevant and/or highest quality hits would be listed toward the top of the search results page and the least relevant and/or lowest quality hits would be listed toward the bottom of the search results page.
Relevancy scores are typically calculated as an arbitrary score or metric based on one or more selected factors determined (or assumed) to be informative as to the quality or relevance of the search output relative to the search input. For example, the search engine may assign an arbitrary rank or score to each hit calculated according to the number or frequency of keyword occurrences in each document, the intent being that the total score would roughly correspond to the relevance or importance of the particular located document relative to the input search query. Another example, described in the article entitled “The Anatomy of a Large-Scale Hypertextual Search Engine,” by Sergey Brin and Lawrence Page, assigns a degree of importance to a web page based on the link structure of the web page. In this manner, the Brin and Page algorithm attempts to quantify the importance of a web page based not on its content, but on the number and quality of linkages to and from other web pages.
U.S. Pat. No. 6,526,440 issued Feb. 25, 2003 to Bharat and assigned to Google, Inc. describes a similar search engine for searching a corpus of data and refining a standard relevancy score based on the interconnectivity of the initially returned set of documents. The search engine obtains an initial set of relevant documents by matching search terms to an index of a corpus. A re-ranking component in the search engine then refines the initially returned document rankings so that documents frequently cited in the initial set of relevant documents are preferred over documents that are less frequently cited within the initial set. The resulting hits in each case are typically displayed in a text-scrolled list, with the relative placement of each hit on the list being determined in accordance with the calculated relevancy score. This, in essence, is the primary search and relevance ranking algorithm behind the popular Google® search engine.
As with the Google® search engine, many of the more sophisticated search engines today are primarily optimized toward the task of searching the world wide web for relevant documents of a general-content nature and focusing typically on a single item of information or a single concept. Most searches conducted using these types of search algorithms seek to find particular items of information that are essentially known to exist and that can be described with a few simple key words. The probability that a user would be able to successfully use a search engine in this context to locate at least one source of information satisfying the user's need is fairly high. However, in certain specialized searching applications, particularly in the science, technology, academic and legal fields, conventional search engines provide an unsatisfactory approach for locating some or all of the relevant documents that may be of interest to a researcher.
For example, those skilled in the intellectual property arts and the patent legal field in general will readily appreciate the difficulty and challenge of searching through vast databases of case law, patents and related scientific documents looking for “prior art” documents relevant to a particular issued patent or pending application and/or cases relevant to a particular point of law. For patents the difficulty and challenge stems from the confluence of several unique factors affecting patents and patent-related documents. These factors include the shear volume of potentially relevant patent documents and related scientific literature (estimated at over 80 million documents worldwide), latent inaccuracies and inconsistencies in the technology classifications used by the various national and international patent offices, the complex scientific nature of patent disclosures, the ever evolving lexicon for describing novel patented concepts and structures, language translation issues in the case of relevant foreign patent documents and scientific literature, and the proclivity of patent attorneys and agents to use complex legalese and coined lexicon to describe novel concepts. The purpose of the patent search is also quite different than the normal search context. The point is not so much to find useful information relevant to a concept of interest, but to establish and document legal evidence of the existence or non-existence of a particular concept or idea in combination with one or more other related concepts or ideas at a particular point in time.
Traditional search engines are not particularly adept at efficiently handling these and other types of specialized searching applications. The standard input/output text interface of most conventional search engines also does a poor job of displaying and communicating input/output search criteria and search results in a way that facilitates intuitive understanding and visualization of the logical relationships sought to be explored between two or more related concepts being searched. It would be of particular benefit to provide an improved search algorithm, database and user interface that would overcome or at least mitigate some or all of the above-noted problems and limitations.