Information retrieval systems, such as databases or search engines, allow users to retrieve information related to a specific subject by using one or more keywords that may be related to the specific subject. For example, a legal search service called Lexis® provided by the LexisNexis Group is widely used in the legal field to search for cases, journal articles, treaties, as well as other publications that are related to a specific topic or issue. Another information retrieval system, Google®, provided by Google, Inc., is a search engine commonly employed by internet users to search for web sites or online documents that are related to a specific subject matter.
In order to search for and retrieve documents related to a specific subject matter, users need to formulate query that typically comprises a set of keywords, phrases, symbols, commands, and/or other entities that are considered to be relevant to the subject matter or possibly contained in the documents relating to the subject matter. This type of information retrieval system poses problems to users because the users need to be familiar with the proper format for inputting queries into such systems. In addition, users need to have a basic understanding of the subject matter to be searched as well as of properties of the language used to describe that subject in order to formulate proper query to conduct the search.
Some information retrieval systems provide assistance on query formation. For example, a website www.ask.com provides a search function called Ask Jeeves that allows users to input their questions in natural language. The system will extract keywords from the questions and conduct a search accordingly. Lexis® also provides a similar function allowing users to input search terms in natural language, either as a question or a statement. The system then extracts keywords from such natural language inputs to search for information related to the keywords.
Although these tools provide basic assistance on query formation during information search and retrieval, such tools cannot function effectively in more realistic work environments in which the content of the query of question plays a paramount role. For example, consider the following search scenarios related to the same keyword “caterpillar:”
Scenario 1:
A biology student writing a term paper on animal development. In this case, the information search should be related to metamorphosis, the process by which the caterpillar becomes a butterfly.
Scenario 2:
A contractor working on a construction plan for a new building. The contractor is most likely referring to Caterpillar, Inc., a major manufacturer of construction equipment
Scenario 3:
A grade-school student writing a book report on Lewis Carroll's book, Alice's Adventures in Wonderland. In this case, information retrieved should preferably be related to the character in the book, chapter excerpts, and pictures that the student could include in her paper.
These scenarios illustrate various problems associated with conventional information retrieval systems. The first problem is that conventional information retrieval systems do not consider relevance of active goals in searching for information. The active goals of the user contribute significantly to the interpretation of the search terms and to the criteria for judging a resource as being relevant to the search terms. Typically, these goals are not fully expressed by users in forming their queries when using conventional information retrieval systems.
The second problem is that conventional information retrieval systems are subject to word-sense ambiguity. For example, The word “caterpillar” in scenario 1 should be treated differently from that in scenario 2. The context of the request provides a clear choice of word sense between the insect and the company. Conventional information retrieval systems cannot distinguish the subtle differences unless additional keywords or information are provided by the user.
The third problem is that conventional information retrieval systems fail to consider audience appropriateness when searching and retrieving information based on keywords or queries provided by the user. In addition to the keywords provided by the user, attributes related to the user in each of above the scenarios should also influence the choice of results. Sources appropriate for an advanced biology student will likely not be appropriate for a student in grade school.
Moreover, when using conventional information retrieval systems, users often are unable to provide sufficient information in their queries. Studies show that on average, users' queries tend to be two to three words long. Needless to say, a two-word query most likely does not contain enough information to discern the active goals of the user, or even the appropriate senses of the words in the query.
Furthermore, even if the user has sufficient knowledge to formulate workable queries to conduct a search, the user must be aware of the variety of available resources, decide where to find them, and must know how to use different information retrieval systems correctly, including details such as those concerning special operators like “and,” “or,” or “+” that are used differently in different information retrieval systems.
Therefore, there is a need to provide an automatic query formation system to assist users in retrieving information related to their active goals without their intervention. There is another need for an information retrieval system to consider the context of words or phrases when conducting an information search and retrieval. There is also a need to improve the performance of an information retrieval system by refining queries based on various attributes related to the users. An additional need exists to automate the information search and retrieval process by forming queries in proper format for conducting information search in different information sources.