A technical problem presently exists in the attempt to use modem day search engines for searching for documents on the world wide web (the “web”). Generally the problems facing users is that almost all search engines search for key words in all or portions of the documents. The problem with key word searches is that an extremely large number of documents are usually returned by the search engine, all of which typically must be read or scanned to find those few documents or that one document that contains the desired information. Lexis™, Altavista™, Yahoo™, are examples of such key-word based search systems. Some specialized databases , such as the database of U.S. issued patents, contained at the site, delphion.com and at the U.S. Patent Office web site, uspto.gov, permit customized searches with known parameters in lieu of key words, such as Inventor name, assignee name, patent agent name, etc., but also include key-word searches. These searches also suffer from the same malady: returning many documents which must generally be read to find the pertinent ones.
An article titled “The Search Engine as Cyborg” By LISA GUERNSEY, The New York Times, Jun. 29, 2000 further describes the problem. The article explains that “To cope, many search engines have concluded that simply indexing more pages is not the answer. Instead, they have decided to rely on the one resource that was once considered a cop-out: human judgment. Search engines have become more like cyborgs, part human, part machine.” For example, a highly ranked search service is AskJeeves™, which prods people to narrow their queries by picking from a list of questions and answers written by the company's employees.
Both Google™ and Northern Light™ rely on computers and software to scan and index the Web, but human judgment is part of the mix. At Google, Web pages that are linked from authoritative Web sites are deemed most relevant. At Northern Light, librarians constantly fine-tune their directory structure and come up with names of categories used for sorting Web sites. Similarly, some music sites appear to have songs indexed with ratings by distributors or listeners as to genre, type such as vocalist, instrumental, folk, jazz, hip-hop, etc. so that selections by these criteria can be made. See for example, listen.com.
Some other efforts have been made to solve this problem. For example Manning & Napier Information Services Inc.™ of Rochester, N.Y. has several products whose technologies are based on research and development in information retrieval (IR) and artificial intelligence (AI), including natural language processing (NLP), information extraction, agents, link analysis, question-answering, data visualization, data fusion, knowledge discovery, knowledge management, genetic algorithms, neural nets, and cross-language information retrieval (CLIR). This system is built around a process whereby the searcher is requested to give the system much more data than just a few key words (a paragraph, for example, to attempt to describe the document contents). The system then constructs a linguistic vector based upon the paragraph given as the search argument and attempts to find equivalent vectors in its document databases. This is not a general Internet search engine system but rather a proprietary one that has its own databases of documents which have been previously processed to produce linguistic vectors which characterize the documents, based on the word contents of the documents.
Another approach to solving the basic key word search problem has been developed by Dr. William Woods, at Sun Microsystems™, Inc. Laboratories. Dr. Woods has addressed the problem wherein the articulation of the desired subject matter is different that that used by the authors of the documents being searched. This is sometimes referred to as the “synonym problem” although Dr. Woods characterizes the problem in a broader connotation by referring to it as the “paraphrase problem” and his general solution approach is called “conceptual indexing” and more specifically as “subsumption technology.” Subsumption technology is used to automatically integrate syntactic, semantic, and morphological relationships among concepts that occur in the material, and to organize them into a structured conceptual taxonomy that is efficiently useable by retrieval algorithms and also effective for browsing. Dr. Woods conceptual indexing approach is described in a number of papers including
“Natural Language Technology in Precision Content Retrieval” by Jacek Ambroziak and William A. Woods, Proceedings of the International Conference on Natural Language Processing and Industrial Applications, Aug. 18–21, 1998, Moncton, New Brunswick, Canada, and
“Knowledge Management Needs Effective Search Technology,” by William A. Woods, Sun Journal, March, 1998
both of which are incorporated fully herein by reference.
As these papers describe, the Sun Microsystems Laboratories' Conceptual Indexing Project was created to address the problems cited above and to improve the convenience and effectiveness of online-information access. A central focus of this project is the “paraphrase problem,” in which the words in a query are different from, but conceptually related to, those in material one needs. This project developed techniques that use knowledge of word and phrase meanings and their inter-relationships to find correspondences between the words one uses in their request and concepts that occur in text passages.
In this solution to the problem, they use taxonomic subsumption algorithms that exploit generality, or subsumption, rather than synonymy. That is, when a concept is more general than another, the more general concept is said to subsume the more specific one and concepts are organized around the notion of conceptual subsumption rather than synonym classes. This relates more general concepts to more specific ones without losing information and enables a retrieval algorithm to automatically find subsumed concepts. The algorithms do not automatically explore more-general terms, so the level of generality is controlled by the searcher's choice of query terms. For example, if one asked for “motor vehicles,” he would get trucks, buses, cars, etc., whereas if he asked for “automobiles,” he would get cars and taxicabs but not trucks and buses. The algorithm can let one know about more-general concepts that subsume the searcher's query, in case he wants to generalize his request, but it does not make this decision without the user's knowledge and consent.
This approach is further taught in U.S. Pat. No. 5,724,571 issued Mar. 3, 1998 (Woods) titled “Method and apparatus for generating query responses in a computer-based document retrieval system” which is also incorporated fully herein by reference.
The key concepts in the Woods and Manning & Napier approaches are that a two step process is required: First a linguistic vector or structured conceptual taxonomy must be constructed by the indexing engine when the material is indexed, and second a special retrieval algorithm is used to find either equivalent linguistic vectors or combinations of morphological and semantic subsumption relationships that connect concepts in the request with concepts that occur in the indexed material. While both approaches appear to provide significant efficiency over key word searches, and while the Wood approach appears to be the more efficient of the two, both have the same disadvantages. Both systems require first a baseline database of target documents and second a powerful lexical computing engine to create the linguistic vectors or combinations of morphological and semantic subsumption relationships. Only then can the search technologies of the two be used.
However these systems as well as the earlier described databases containing popularity-based ratings use fixed, pre-determined indexing algorithms to mathematically combine words and phrases in a description vector which can be matched with a similarly computed vector based on search criteria inputted by the user.
What is needed is a database system with individual document ratings from experts in the field where these expert ratings are based on an accepted taxonomy of attributes for the specific field rather than an unrelated mathematical algorithm. It would be these expert ratings that would be the basis of a search rather than an algorithmatic computation built around the words in the document. And similarly needed is a search engine capable of mapping inputted search attributes to this expert ratings attribute indexed database.
Biomedicine is largely a knowledge industry. While a physical product, the medicine, does have to be developed, tested, manufactured and delivered, the knowledge of how to do so and the knowledge of which product works best in particular cases contributes most of the value.
A second characteristic of biomedical knowledge is that it is highly dynamic. At the research level, significant advances in our understanding of biomedical phenomena happen on a weekly basis. Therefore, biomedical professionals have an ongoing need to keep up with the advances relevant to their own specialty area. Such needs have become particularly acute in health-care, because patients can now use the Web to learn about the latest developments themselves; as a result, they demand increasingly detailed and timely information from health-care professionals.
Needs Relating to Centralization
There is as yet no centralized source of biomedical information on the web. The information one seeks may be available somewhere on the web. The hard part is finding it. There are thousands of biomedical Web pages, ranging from individual sites to corporate sites. These sites generally fall into the following categories:                Government research center sites        University biomedical sites        Commercial firm sites (including vendor firms)        Biomedical journal sites        Individual researcher/professor sites (usually only a few pages with papers and links)        
A list of the major Web sites can be found in an Appendix in the recently published book, “From Alchemy to IPO; The Business of Biotechnology,” by Cynthia Robbins-Roth, Perseus Books Group, 2000, ISBN 0-738202533, which is incorporated herein by reference.
Needs Relating to Search Strategies
Despite the availability of an enormous amount of information, this information is not indexed or summarized for easy consumption.
1. Existing human-edited directories, such as Yahoo, do not have the skilled biomedical personnel or the time to adequately index biomedical pages. Human-edited directories, such as Yahoo, generally index only a small fraction of the Web, because of the cost of having human workers look at each page.
2. Existing search engines that mechanically index pages, such as Alta-Vista, also have limitations as indicated above: the number of irrelevant pages generated; and the poor quality of links generated.
Needs Relating to Contextualization
Another problem caused by specialized content is incomplete understanding. No individual is a specialist in all subsets within a particular discipline. Thus, there are always parts of the content that are more understandable than others. This is particularly so when the user is a non-specialist and the content is, say, a biomedical research paper. There is a need to provide information in a form such that the user can quickly grasp the essentials of concepts underlying the content.
Needs Relating to Personalization
An additional issue of importance to the effective dissemination of biomedical content is the manner in which content is served to the user. Virtually all content on the web today is served in a one-size-fits-all mode. Nevertheless, studies have shown that people learn better when content is presented in a manner more suited to their own individual cognitive style.
Needs Related to Multidimensional Taxonomies.
Another problem with presently known search approaches is that they address taxonomies which are, basically, hierarchical i.e. one-dimensional. However in many domains, in the biomedical arena for example, an n-dimensional taxonomy is more appropriate. That is, a biomedical development might be considered mundane from a technical standpoint, yet highly significant from a social or business viewpoint. While it is true that this “significance” issue might be expected to be handled by the way the query is structured (i.e. from the technical viewpoint or from the social or business viewpoint), systems such as the Sun and Manning & Napier systems cannot handle these issues because of the pre-defined mathematical indexing algorithms they use.
The solution to these technical problems therefore is to provide a method for analyzing a database of documents wherein a multi-dimensional taxonomy of attributes for a specific domain can be developed and used to tag the related documents with significance rating indicia, which can then be searched by a qualitative matching engine.