This invention relates generally to electronic text and more particularly to searching and indexing of electronic text.
Current technologies in digital media storage have allowed text to be stored in electronic format on a magnetic medium or an optical medium such as compact disks. Storing text in electronic format has many advantages including space savings and near effortless mass distribution if required. Perhaps the biggest advantage is the ability to quickly search through the electronic text to retrieve the desired information. Two important factors about text searches are the speed and accuracy of the search. With increasing computing power, speed is becoming less of a concern. However, accuracy is an area where significant improvements can still be made. Search accuracy is the ability to search and locate relevant information on the subject of interest. Several criteria have been used to describe search accuracy. Search precision is the fraction of relevant search results returned to all results and search recall (also known as sensitivity) is the fraction of relevant search result returned to all possible relevant results. Therefore, one goal of a search is to increase the search precision without severely reducing the search recall.
The Internet is a gigantic set of databases linked together by a decentralized network. Because of this gigantic array of databases, there is a vast amount of data, or information, that can be searched for relevant information for a subject of interest. However, as the amount of data increases, the search accuracy decreases as there is more extraneous data.
Typical search engines such as Lycos and Infoseek on the Internet use keyword search methods. Keyword search methods involve parsing a document in a database through a search engine and selecting documents or sections that contains the keyword(s). With keyword searches, the search accuracy is usually very low. The keyword search returns many irrelevant results even though the results may contain the keywords. This low accuracy is caused by words having different meaning when in different context and also by search words being in close proximity but not being used together semantically in the text. Even when searching with multiple search keywords using boolean expressions do not yield in significant increases in accuracies. This lack of accuracy may be acceptable in the Internet environment where a user may have ample time to sieve through the irrelevant results. However, mission-critical users in other environments may not be as tolerant as time is of the essence in obtaining the relevant information.
Health-care professionals in clinical environments need precise and timely information if they are to provide optimal patient care. It has been shown that tertiary references such as textbooks or edited reviews, could meet the majority of these information needs. However, precise and timely extraction of information from these tertiary sources calls for the development of a system to efficiently search and index these tertiary sources.
Researches have developed a variety of systems to improve the indexing and searching of medical text sources with the primary goal to increase the search precision without severely reducing recall. For example, in an article titled xe2x80x9cMYCIN II: design and implementation of a therapy reference with complex content-based indexingxe2x80x9d Proc Amia Symp 1998: 175-179, Kim and associates built MYCIN II, a prototype information retrieval (IR) system capable of searching content-based markup in an electronic textbook on infectious disease. Users select from a pre-determined set of query templates (the query model) a query that is passed to a search engine for processing.
In an article titled xe2x80x9cAutomated Text Markup for Information Retrieval from an Electronic Textbook of infectious Diseasexe2x80x9d Proc Amia Symp 1998:975, Berrios and colleagues developed a markup tool that provided the HTML indexing required for the MYCIN II search engine. Because the tools in this system were developed independently with minimal integration, a significant amount of repeated work by the domain expert is required to generate the ontology of concepts in the concept model used by a domain expert during the markup process and the set of questions for the search engine in the query model.
A need therefore exists for a method and a highly integrated system to search and index electronic text for precise information retrieval.
Accordingly, it is a primary object of the present invention to provide a method and a highly integrated system that will significantly increase the search precision while reducing the time necessary to prepare a file of electronic text for searching.
The primary object of the present invention to provide a method and a highly integrated system that will significantly increase the search precision while reducing the time necessary to prepare a file of electronic text for searching.
Accordingly, the present invention consists of an electronic text indexing and search system comprising a concept model, a markup tool, a query model, a query interface, and a search engine.
The concept model defines a set of concept-value pairs. The concept model is modified by a concept model tool and new concept-values can also be added by a query model tool.
The query model defines a set of queries for submission to the search engine in terms of a first subset of concept-value pairs in the concept model. Each query in the query model is a template for a number of possible queries that are defined when a user uses concept-values from a menu.
The markup tool uses the first subset of concept-values used in the query model to create a set of allowable concept-values for assignment. The domain expert assigns the allowable set of concept-value pairs to the text. The markup tool also has the ability to suggest assignment of query and markup tags to the domain expert for marking up the electronic text.
The user query interface is generated automatically by using the query model. The user query interface allows the user to formulate a query to submit to the search engine.
The search engine tries to match the concept-value submitted by the query to the subset assigned by the markup tool. If there are any matches, the search engine will display a results page that displays an excerpt from the text that is found and also gives the user an option to output the query to an external database.
The user query interface can be a computer program that calls a function that selects the concept-values to be submitted to the search engine. The search engine can also output the search results, the concept-values assigned to the search results, or the original concept-values submitted by the query, to an external electronic resource.