The present invention is directed to a method and system for maintaining a knowledge base and evidence set.
The field of search engines is known. Known search engines include those developed by Verity, Inc., Alta Vista, and Lycos. By implementing a search engine, a user can express with precision a focussed area of interest in order to retrieve needed information. Typically, a search engine retrieves documents satisfying the exact terms in a search query. For example, if the search query includes the term "PDA," the search will not retrieve occurrences of "personal digital assistant," "pocket device," or other related terms. This produces under-inclusive results, meaning that documents containing relevant information are not retrieved. Often, however, it is difficult for a user to formulate a query capable of producing appropriately-inclusive results without existing knowledge of a subject area. This difficulty is especially prevalent when a lay user searches in subject areas containing technical terminology or jargon, which is unfamiliar to the lay user. For instance, when searching in the subject area of medical terminology, the lay user is more likely to employ everyday names for terms rather than the technical terms used by medical professionals. Even medical professionals may have difficulty in correctly spelling or recalling a proper medical term. Under-inclusive results also occur when relatively inexperienced users attempt to use search engines. For example, inexperienced users may fail to appreciate that certain search engines are case sensitive or require specific syntax.
Three approaches have been adopted to address under-inclusive results. The first approach employs manual query expansion. As noted above, if a search query is "PDA," the search will not retrieve occurrences of "personal digital assistant," "pocket device," or other related terms. Users familiar with these related terms may manually expand the query by substituting "PDA" in the search query with "`PDA` OR `personal digital assistant` OR `pocket device`". This query uses the logical OR operator and would retrieve those documents containing at least one of these terms. Manual query expansion, however, requires user knowledge of related terms. In addition, manual query expansion requires excessive user input. For instance, if a user manually expands the same query term and wished to repeatedly conduct the search, the user must reenter the same related terms each time the query is submitted. Finally, users must have working knowledge of the search engine syntax and the vocabulary of the subject matter that is being searched.
The second approach to address under-inclusive results employs meta tagging. To implement meta tagging, the author of a document inserts metadata, also known as metainformation, into the contents of document itself or otherwise associates it with the document. Metadata is data that describes other data. For example, an author of a web page on the Internet's World Wide Web may insert meta tags into the source code of the web page. Typically, the meta tag is invisible to those viewing the web page with a traditional browser, such as Netscape Navigator, but is present in the source code and visible to search engines. Meta tags are usually words and phrases, which are related to the content of the web page, but do not exist in the text of the web page visible to the user. For example, when a search engine searches for "PDA" on the World Wide Web, the search engine retrieves documents containing "PDA," if "PDA" is either in the meta tag or the contents of the document. One disadvantage to meta tagging, however, is the time investment required by authors to insert meta tags in each document. Moreover, once a document is created, it is time-consuming to modify the meta tags; each document must be reopened to edit the meta tags. Also, since meta tag information is inserted into each document there is an increased likelihood of a data entry error in the spelling or format of the meta tag information. In addition, the meta tag vocabulary might change, thus requiring a modification to all documents containing the meta tag information. Yet another disadvantage is that an individual inserting the meta tags must be taught the vocabulary relating to the content of the document. Finally, meta tagging requires knowledge of the content of the document. In many instances the author of a web page is a web page developer, who is developing the web page for others who are familiar with the content. Thus, meta tagging often requires coordination between a web page developer and those familiar with the content of the web page.
The third approach to address under-inclusive results employs evidence sets. An evidence set contains evidence, which constitute phrases or terms. The evidence is organized into topics. This knowledge is organized, typically in a hierarchical structure or taxonomy, and made available as a shared resource to users. An evidence set is employed by an application, such as a search engine, by incorporating knowledge about topics and associated phrases. One company, Sageware, Inc., has developed a number of Knowledge Sets, which are functionally similar to evidence sets, for specific subject areas. See SAGEWARE, INC., Our Products: Sageware KnowledgeSets (accessed on Mar. 21, 1998; copyright 1997). One use of evidence sets is for query expansion. In contrast to manual query expansion, query expansion with evidence sets does not require a manual substitution of related terms for each query. Rather, the search engine may access the contents of the evidence set to automatically expand the search query.
Known methods for creating evidence sets require extensive user input. Other methods for learning evidence sets exist, however, it is known that evidence sets generated with learning algorithms on training data typically produce inferior quality evidence sets. In addition, known methods for creating evidence sets often produce evidence sets that are difficult to modify. Typically, methods for creating evidence sets include the use of either a standard text editor or a graphical user interface (GUI). An evidence set may be created with a text editor by inputting text and symbols in accordance with a known evidence set format. As evidence sets generally require a specific syntax, text editor creation has the disadvantage that minor inadvertent input errors may create an improperly formatted or non-working evidence set. For instance, a misplaced symbol or term may inadvertently change the relationship between evidences or topics in an evidence set. Because the syntax of evidence sets is often cumbersome, a user cannot readily apprehend when mistakes have occurred. Moreover, once an evidence set has been created with a text editor, can be relatively difficult to modify its structure. A text-edited modification requires reentry of evidences in the evidence set to comport with the newly-modified structure. Also, modifying an evidence set with a text editor requires a user with working knowledge of the syntax of the evidence set. In addition, a user may create an inconsistent evidence set. For instance, a user may create a text-edited evidence set with multiple occurrences of the same topic. Moreover, using a text editor to create an evidence set, each topic may have a different set of evidences. This could create an internal inconsistency in the evidence set and result in an evidence set that is non-functioning or, at the very least, capable of producing inconsistent results. Finally, when making changes to a text-edited evidence set, a regression test must often be performed to fully understand the impact of changes to the evidence set.
A second known method for creating evidence sets employs GUIs. Such a method, developed by Verity, Inc., is topicEditor. VERITY, INC., Introduction to Topics Guide V2.0 (copyrighted Sep. 23, 1996; visited Mar. 21, 1998) discloses the use of topicEditor. In topicEditor users create topics and evidences in a hierarchical GUI environment, which allows users to expand and collapse topics, copy or move topics using drag and drop, and re-use topics by selecting them from a drop-down list. Once a topic is created in topicEditor, a user may generate topic sets, which are functionally similar to evidence sets. These topic sets may be stored in a knowledge base. Typically, these types of knowledge bases only include information that is represented in the GUI environment. For instance, a GUI-created knowledge base typically contains only information that relates to the hierarchical structure of the topics and evidences. Typically, for any given GUI-created knowledge base there exists only one corresponding evidence set. Finally, modification of a GUI-created knowledge base requires excessive manipulation of the GUI environment.
U.S. pending application Ser. No. 09/054,886 filed by McGuinness et al. and entitled "System and Method for Searching," which is incorporated herein by reference, discloses deriving an evidence set based on a knowledge base.
Often, however, the mere creation of an evidence set and the corresponding knowledge base from which it is derived is insufficient to satisfy the needs of users. Changes to the evidence set may be required to correct errors, modify content information, and add new terms that arise in the content area. For similar reasons, it may be desirable to modify the structure of a knowledge base to improve its structure or to modify information contained within the structure. Thus, once an evidence set is derived from a knowledge base it may be desirable to modify the structure of the evidence set, the knowledge base, or both.
Modification of the evidence set and knowledge base may be accomplished by modifying each independently from the other. For instance, the evidence set may be modified by using the methods for creating an evidence set discussed above. For example, the evidence set may be modified by opening the evidence set as a file in a standard text editor, modifying its syntax and contents in the editor, and saving the file in memory. Similarly, a knowledge base may be modified using the methods for creating a knowledge base that were discussed in U.S. pending application Ser. No. 09/054,886.
Disadvantages exist when modifying the evidence set and knowledge base independently of each other. First, independent modification requires duplicative actions in both the evidence set and the knowledge base. This requires more user input than is required to modify either the knowledge base or the evidence set alone. Second, there may be separate software tools for modifying independently the evidence set and the knowledge base; thus, requiring the execution of two separate software programs. Third, independent modification allows users to create inconsistencies between a given knowledge base and its corresponding evidence set. For instance, an inconsistency would arise if a user modifies an evidence set by deleting a term and fails to delete the same representation of the term in the corresponding knowledge base. Fourth, independent modification requires a user to understand the syntax of both the knowledge base and the evidence set. This, in turn, also introduces the possibility for input errors in both the modification of the evidence set and modification of the knowledge base.