The present invention, in some embodiments thereof, relates to a search engine and methodology and, more particularly, but not exclusively, to such a search engine and methodology applicable to patent literature, for carrying out patent searches.
Central to corporations' growth and prosperity is patenting of knowledge assets. The number of patent applications has risen drastically in the past decade reaching about 1 million new applications per year in the United States and Europe alone.
Prior-art search is a critical part of the patent application process and is a determinant of patent scope. When the patent applicant fails to identify all relevant prior-art, the application claims may be rejected by the examiner, or may be subject to costly litigation in case the patent is granted.
To be granted, a patent claim has to satisfy two conditions in respect of the prior art: it has to be novel and non-obvious. Novelty means that the claim has to exclusively define a new piece of knowledge that has not been patented in the past and has not been published in public sources. Obviousness means that the inventive step, i.e., the technical advance over existing knowledge, has to be something more than just a straightforward change. To determine whether a new patent application is indeed novel and non-obvious, the patent examiner searches for related prior-art in other patent documents and public sources.
The market for prior-art search is developing rapidly following the exponential growth in patent filings. FIGS. 1 and 2 show the increase in the number of patent filings world-wide and across leading patent offices. In 2005, about 1,660,000 patent applications were filed worldwide. Patent application filings have risen at an annual rate of 4.7% since 1995.
Prior-art search occurs at every stage of the innovation process. The inventor conducts prior-art search to research the field and also to examine the novelty of her idea and its patentability, the venture-capitalist conducts prior-art search to assess commercial value, the patent attorney conducts prior-art search when filing patent applications and the patent examiner searches for prior-art to determine patentability and patent scope. We (conservatively) estimate the annual market size for patent prior-art search at $4 billion (this number reflects two million prior-art searches at a cost of $2,000 per search).
The search for prior-art is also central to the wider market of technology licensing. The market for technology licensing is rapidly increasing and is estimated at of billions of dollars per year.
Finally, the search for prior-art is integral to patent litigation, particularly infringement and invalidation suits. About 1,000 cases of patent litigations are filed each year in the United States. There are no clear estimates of the monetary transfer associated with these litigations which can vary from zero (the result of cross-licensing agreements) to hundred of millions of US dollars (for example, in the Blackberry litigation case RIM paid NTP $612.5 million).
The search for prior-art spans million of patent documents. The main challenge for automated prior-art search is how to identify scientific relations based on textual features for large scale datasets of patent documents. A common assumption by existing search engines is that semantic similarity of patent documents reflects scientific relatedness. This assumption performs poorly in practice as usually scientific relatedness is not tied to semantic similarity. In practice, related scientific ideas usually include different scientific concepts. Determining the conceptual relatedness of words and technical phrases requires specialized professional knowledge and evaluations of hundred of thousands related technologies. Up to recently, such systematic knowledge was almost impossible to obtain. The problem is particularly acute in the software field where technical usage varies widely.
Current Market Solution
Several for-profit and non-for-profit patent search engines have emerged in the past few years. Leading prior-art search engines are: USPTO, EPO, Google Patent, Dialog and Delphion.
These search engines are mostly based on semantic similarity analysis, also known as the bag-of-words approach (BOW). The search process computes the relatedness of documents based on measures of textual overlap of words in each document or query. Essentially, the central assumption is that patents that represent related scientific ideas share common or similar semantics. To the extent this assumption is violated, the performance of existing search engines would not be satisfactory.
The main drawback of the semantic similarity approach, including its extensions (see below), is that it does not provide any information about the conceptual meaning of words and technical phrases. For example, the word x can represent exactly the same idea as the word y. Without external information, or a scientific ‘dictionary’, that informs us that x and y represent the same idea, information retrieval which is based on semantic similarity would fail.
There are four main reasons for the poor performance of semantic search engines in the field of patent prior-art search. First, inventors have an incentive to phrase their inventions in a manner that would be as distant as possible from the text of the most related prior-art, hoping this would mitigate the risk the application would be rejected by the examiner.
Second, the textual domain used to describe scientific concepts is typically large.
Third, in numerous cases the prior-art cited by the patent examiner is from different technological areas than the application itself, where there is very little textual overlap between the prior-art and the application. For example, U.S. Pat. No. 7,137,001, entitled “Authentication of Vehicle Components” (IPC H04L Transmission of Digital Information), shares very little semantic similarity with U.S. Pat. No. 5,220,604 (IPC G06F Electric Digital Data Processing), entitled “Method for Performing Group Exclusion in Hierarchical Group Structures”. Yet, during the application process of U.S. Pat. No. 7,137,001, the patent examiner cited U.S. Pat. No. 5,220,604 as related prior-art and as a reason to reject the initial application on the grounds of obviousness. Another example is U.S. Pat. No. 7,051,570, entitled, “System and Method for Monitoring a Pressurized System” (IPC G01L Measuring Force, Stress, Torque, Work, Mechanical Power,
Mechanical Efficiency, or Fluid Pressure), which was rejected by the patent examiner over U.S. Pat. No. 5,454,024, entitled “Cellular Digital Packet Data Network Transmission System Incorporating Cellular Link Integrity Monitoring” (IPC G08B Signalling).
Fourth, patent documents usually include technical phrases (for example, CMOS —Complementary metal-oxide semiconductor and PMOS—Positive metal oxide semiconductor or Portable media operating system). Semantic similarity would fail to recognize relationships between different technical phrases as they are likely to have little textual similarity. For example, based on patent examiner evaluations, we find that the technical phrases PMG (permanent magnet generator) and BLDC (brushless DC controller) are related scientifically, although they differ textually.
Additional background art includes U.S. Pat. No. 4,839,853 Computer information retrieval using latent semantic structure. A methodology for retrieving textual data objects is disclosed. The information is treated in the statistical domain by presuming that there is an underlying, latent semantic structure in the usage of words in the data objects. Estimates to this latent structure are utilized to represent and retrieve objects. A user query is recouched in the new statistical domain and then processed in the computer system to extract the underlying meaning to respond to the query.
U.S. Pat. No. 5,297,039 Text search system for locating on the basis of keyword matching and keyword, teaches a text information extraction device extracts analysis networks from texts and stores them in a database. The analysis networks consist of lines each including elements and relations extracted from the texts. The analysis networks are complemented via synonym/near synonym/thesaurus process and via complementary template and the lines thereof are weighted via concept template. A text similarity matching device judges similarity of input and database analysis networks on the basis of agreements of words, word pairs, and lines. A text search system stores texts and complementary term lists prepared therefrom in respective databases. Queries are inputted in the form of analysis networks from which sets of keywords and relations are extracted. After searching the texts and complementary term lists stored in databases with respect to the keywords extracted from each input query, agreements of the sets of keywords and relations are determined.
U.S. Pat. No. 5,963,965 Text processing and retrieval system and method teaches a content-based system and method for text processing and retrieval is provided wherein a plurality of pieces of text are processed based on content to generate an index for each piece of text, the index comprising a list of phrases that represent the content of the piece of text. The phrases are grouped together to generate clusters based on a degree of relationship of the phrases, and a hierarchical structure is generated, the hierarchical structure comprising a plurality of maps, each map corresponding to a predetermined degree of relationship, the map graphically depicting the clusters at the predetermined degree of relationship, and comprising a plurality of nodes, each node representing a cluster, and a plurality of links connecting nodes that are related. The map is displayed to a user, a user selects a particular cluster on the map, and a portion of text is extracted from the pieces of text based on the cluster selected by the user.
U.S. Pat. No. 5,991,751 System, method, and computer program product for patent-centric teaches a system, method, and computer program product for processing data are described herein. The system maintains first databases of patents, and second databases of non-patent information of interest to a corporate entity. The system also maintains one or more groups. Each of the groups comprises any number of the patents from the first databases. The system, upon receiving appropriate operator commands, automatically processes the patents in one of the groups in conjunction with non-patent information from the second databases. Accordingly, the system performs patent-centric and group-oriented processing of data. A group can also include any number of non-patent documents. The groups may be product based, person based, corporate entity based, or user-defined. Other types of groups are also covered, such as temporary groups.
U.S. Pat. No. 6,298,327 Expert support system for authoring invention disclosures teaches a computer-implemented expert support system for authoring invention disclosures and for evaluating the probable patentability and marketability of a disclosed invention. The system comprises at least a computer, an input device, an output device, and software program. The software program is developed with an object-oriented design process and is implemented in an object-oriented computer language such as C++. The system facilitates communication of invention characteristics and enables output of invention disclosures in a plurality of formats, including that of a patent application.
U.S. Pat. No. 6,363,378 Ranking of query feedback terms in an information retrieval system teaches an information retrieval system processes user input queries, and identifies query feedback, including ranking the query feedback, to facilitate the user in re-formatting a new query. A knowledge base, which comprises a plurality of nodes depicting terminological concepts, is arranged to reflect conceptual proximity among the nodes. The information retrieval system processes the queries, identifies topics related to the query as well as query feedback terms, and then links both the topics and feedback terms to nodes of the knowledge base with corresponding terminological concepts. At least one focal node is selected from the knowledge base based on the topics to determine a conceptual proximity between the focal node and the query feedback nodes. The query feedback terms are ranked based on conceptual proximity to the focal node. A content processing system that identifies themes from a corpus of documents for use in query feedback processing is also disclosed.
U.S. Pat. No. 6,452,613 System and method for an automated scoring tool for assessing new technologies teaches an apparatus and method for an automated invention submission and scoring tool for evaluating invention submissions. The system comprises a server system and a plurality of server systems. The server system presents submission questionnaires over a networked connection to submitters at user systems. The user completes the questionnaires, which are returned to the server system for processing. The server system processes the answers to provide a quantified evaluation of the submission based on patentability and at least one other parameter, such as impact or value. An evaluator at an evaluator system can view a presentation of the quantified assessment of the invention submission. The evaluator can also view the results of multiple invention submissions on a status overview page Links between the status overview page, individual questionnaires, and individual assessment presentations are provided.
U.S. Pat. No. 6,542,889 Methods and apparatus for similarity text search based on conceptual indexing teaches a method of performing a conceptual similarity search comprises the steps of: generating one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search; building a conceptual index of documents with the one or more word-chains; and evaluating a similarity query using the conceptual index. The evaluating step preferably returns one or more of the closest documents resulting from the search; one or more matching word-chains in the one or more documents; and one or more matching topical words of the one or more documents.
U.S. Pat. No. 7,054,856 System for drawing patent map using technical field word and method discloses a system and a method for drawing a patent map using a technical field word are disclosed. In the system and the method, a word to be used for drawing a patent map is extracted by calculating weight values of significant words which are gotten by removing unnecessary words from patent data, and this extracted word is matched with a patent to draw the patent map.
U.S. patent application Ser. No. 11/697,447 Enhanced Patent Prior Art Search Engine teaches a search engine configured to search a database of documents and provide search results to an end user is described. The search engine may be configured to provide the end user with a list of synonyms for terms in the search query submitted by the end user and allow the end user to identify those synonyms which should be included in the search engine. Alternatively or additionally, the search engine may be configured to provide the end user with survey questions, the answers to which, may be used to further define the search query. The database may include notes and/or advertisements that are associated with specific documents in the database.
U.S. patent application Ser. No. 11/745,549 Systems and Methods for Analyzing Semantic Documents Over a Network teaches systems and methods for processing an intellectual property (IP) by providing an automated agent to execute one or more searches for a user to locate one or more documents relating to an IP interest, the agent accessing a user profile to determine the user's IP interest and identifying one or more IP documents each having a tag responsive to the IP interest; ranking one or more documents located by the automated agent; and displaying the one or more documents located by the automated agent.
U.S. patent application Ser. No. 11/809,455 Concept based cross media indexing and retrieval of speech teaches indexing, searching, and retrieving the content of speech documents (including but not limited to recorded books, audio broadcasts, recorded conversations) is accomplished by finding and retrieving speech documents that are related to a query term at a conceptual level, even if the speech documents does not contain the spoken (or textual) query terms. Concept-based cross-media information retrieval is used. A term-phoneme/document matrix is constructed from a training set of documents. Documents are then added to the matrix constructed from the training data. Singular Value Decomposition is used to compute a vector space from the term-phoneme/document matrix. The result is a lower-dimensional numerical space where term-phoneme and document vectors are related conceptually as nearest neighbors. A query engine computes a cosine value between the query vector and all other vectors in the space and returns a list of those term-phonemes and/or documents with the highest cosine value.
U.S. patent application Ser. No. 11/812,135 System and method for analyzing patent value, teaches at least one exemplary embodiment discloses a system, computer program product and method for evaluating the value of a legal document such as a patent-related document. In accordance with at least one exemplary embodiment, a Latent Semantic Analysis (“LSA”) search engine can search a database of patent-related documents to identify an “N” number of patent-related documents deemed thereby as relevant to a target document and indices of the target patent-related document can be compared and scored against the indices of the relevant identified patent-related documents. At least one exemplary embodiment evaluates a plurality of indices of patent-related document value using legal, commercial and/or technological factors.