There are many techniques for searching a database to retrieve relevant documents and publications in response to a query provided by a user. Searches are conducted for different reasons. Many searches are undertaken in an attempt to find material of interest for research and other purposes. A user conducting such a search may know of the existence of a desired document or publication, such as a book, and may be seeking to determine whether the database under investigation contains the desired document or publication, or other related documents. Alternatively, a user may construct a query of a database based on one or more keywords in an attempt to retrieve all records relating to an area of interest to the user.
The searching and retrieval of information from databases can also be used as a strategic tool to investigate and determine the activities of market competitors. Pharmaceutical companies are particularly interested in the activities of their competitors. There are large time and dollar costs associated with pharmaceutical research, so before committing resources to a particular area of interest, it is common for pharmaceutical companies to search industry and patent databases to determine what is presently known and understood in that particular area. Further, it is important to determine the nature and scope of technology in the field of interest that might be protected by patents or other intellectual property rights.
Patents provide a limited monopoly right to exploit an invention in a particular jurisdiction to the exclusion of all others, in exchange for providing an enabling disclosure of how the invention works. In the case of pharmaceutical companies, it is particularly important to determine which chemical compounds might be subject to patent protection before committing large resources to research in a given area. Without undertaking a relevant search of patent databases and the like, a company may invest large amounts of time and money to research a new drug, only to find that the drug is protected by a patent granted to a market competitor. However, searching industry and patent databases is difficult, as different publications may utilise different words or expressions in relation to the same subject matter. Thus, a query using a given keyword may not retrieve all relevant publications due to the variance that exists in technical jargon and terminology.
When investigating a particular field of interest, it is known to determine the similarity between two textual documents based on common keywords, as described, for example, in “A Vector Space Model for Automatic Indexing”, Salton G., Wong, A. and Yang, C. S. Communications of the ACM, 18(11), November 1975. Returning to the example of pharmaceutical companies searching patent databases, it may not be appropriate to search for common words between two documents, as pharmaceutical patent documents typically contain many different chemical and biological terms.
Many jurisdictions provide patent databases that are able to be accessed from a remote computer terminal, typically via an Internet-based interface. For example, the records of the United States Patent and Trademark Office are able to be accessed via the Internet at the Uniform Resource Locator (URL) uspto.gov/patft/. Other patent databases are provided by, for example, the European Patent Office, the Australian Patent Office, and the Japanese Patent Office. Online patent databases typically allow traditional keyword based searches on various fields of a patent or patent application. The searchable fields can include, for example, the name of an inventor, assignee, and title. However, under some circumstances the simple keyword based searches are inadequate. For example, a scientist about to file a patent application for a new invention requires more complex retrieval techniques to identify existing patents and patent applications that are similar to the new invention. Further, a company seeking to identify relationships with a competitor based on their assigned patents also requires more complex retrieval techniques than those afforded by traditional keyword based search techniques.
Research systems that utilize different techniques for retrieving information from patent databases have been studied. For example, “Evaluating Document Retrieval in Patent Database: a Preliminary Report”, M. Osborn et al., Proceedings of the ACM Conference on Information and Knowledge Management, Las Vegas, Nev., 1997 introduces a system that integrates a series of shallow natural language processing techniques into a vector based document information retrieval system for searching a subset of U.S. patents. Another study, “A Patent Search and Classification System”, L. Larkey, Proceedings of the ACM Digital Library Conference, Berkeley, Calif., 1999 uses a probabilistic information retrieval system for searching and classifying U.S. patents. Another search system is described in “Knowledge Discovery in Patent Databases”, M. Marinescu et al., Proceedings of the ACM Conference on Information and Knowledge Management, McLean, Va., 2002, which attempts to utilise techniques like Correspondence and Cluster analysis for mining patents. Some of the challenges in the domain of patent retrieval are discussed in “Workshop on Patent Retrieval: SIGIR 2000 Workshop Report”, N. Kando et al., ACM SIGIR Forum, 34(1):28-30, Apr. 2000.
Traditionally, text-based documents are compared based on the number of similar terms among the documents under comparison. Such techniques may not be reliable, however, for some technical disciplines in which synonyms are frequently used or in emerging areas of technology for which standardised terms are yet to be determined. Such technical disciplines include, for example, the computer science and pharmaceutical domains. In the computer science domain, Enterprise Java Beans may also be referred to as EJB. Thus, EJB is a synonym for Enterprise Java Beans in the computer science domain. In the pharmaceutical domain, many biomedical concepts are known by a variety of names. Further, biological concepts may be related as a result of belonging to the same class. For example, the terms Amylase and Somatostatin are related, because both are proteins.
Another complication is that a group of molecules may be similar in respect of a nominal attribute or characteristic, even if the formulae for the respective molecules are different. In such circumstances, it is generally not possible to utilise string-based matching techniques on the formulae to identify those molecules that possess a desired attribute or characteristic. Further, a search of a database using a generic or commercial trade name for a chemical composition may not retrieve relevant documents in which the composition is only described with reference to its formulaic representation. For example, 7-CHLORO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE and 7-CHLORO-1-METHYL-5-PHENYL-3H-1,4-BENZODIAZEPIN-2(1H)-ONE are different formulations of Valium. Thus, a simple keyword search for the term “valium” might return documents relating to the first formulation, the second formulation, or neither formulation. One technique for querying protein patents is described in “A Protein Patent Query System Powered by Kleisli”, J. Chen et al., Proceedings of the ACM SIGMOD Conference, Seattle, Wash., 1998. Given a protein sequence, Chen uses patent and protein databases, as well as bioinformatics tools, to identify whether similar protein sequences have already been patented.
Due to the complexities described above that exist in the pharmaceutical domain, it is known for pharmaceutical companies to employ one or more patent analysts, or to engage an external agency, to examine manually hundreds of patents retrieved by querying the patent databases. This is an expensive and time-consuming approach for searching patent databases and comparing the documents contained therein, and is subject to human error.
Thus, a need exists for an improved method of comparing two or more publications to determine the similarity of those documents.