1. Technical Field
The present invention relates in general to data processing and in particular to the utilization of data processing systems to locate desired data. Still more particularly, the present invention relates to methods and systems for processing search expressions for use in locating desired electronic documents.
2. Description of the Related Art
The World Wide Web (i.e., the Web) denotes a vast set of interlinked documents (i.e., Web pages) residing on various data processing systems around the globe. In recent years, the Web has experienced rapid growth, to the point that the Web now contains millions of documents. The data processing systems that serve up these documents on request are called servers, and when a data processing system is utilized to retrieve a document from a server, the retrieving data processing system is considered a client.
In general, the interlinked documents are publicly accessible and are retrieved using the communications protocols known as xe2x80x9cHTTPxe2x80x9d (which stands for Hypertext Transfer Protocol) and xe2x80x9cTCP/IPxe2x80x9d (which stands for Transmission Control Protocol/Internet Protocol). The servers, communications networks, and related facilities that provide access to the documents of the Web are known collectively as the Internet.
In addition to Web documents, a number of services are also available via the Internet, including search engines, which help in identifying which of the millions of Web documents relate to particular subjects of interest. Typically, a search engine includes a Web page that serves as a user interface through which a user enters a search expression, a database that associates Web page go addresses with Web page content, and a comparator that determines which of the Web pages in the database include content corresponding to the entered search expression. The addresses of the corresponding Web pages are returned in what is called a xe2x80x9chit list.xe2x80x9d For example, if a user were to enter a search expression consisting of a particular word, the resulting hit list would provide the addresses of Web pages containing that word.
However, search expressions that simply list a small number of words relating to a subject often cause search engines to produce inefficient hit lists (i.e., hit lists that include unhelpful sites and/or that fail to include a reasonably large number of helpful sites). For instance, a user wanting to identify Web pages with substantive content concerning World War II might enter the search expression xe2x80x9cWorld War II.xe2x80x9d The search engine would then return a hit list of Web pages containing the entered words. In addition to the hits with the desired substantive content, however, the hit list will likely also contain hits with no substantive content relating to the subject in question (such as hits identifying Web pages with mere advertisements for books on the subject). Unless one is looking for a book, the hits relating to mere book advertisements get in the way because they show up in the hit list but generally do not answer any substantive questions or provide any significant amount of substantive information regarding the subject of interest. In addition, due to the large number of Web pages now in existence, overbroad hit lists often identify substantially more Web pages than a user can conveniently explore.
Obtaining efficient hit lists is one of the biggest challenges associated with utilizing the Web. To address this challenge, many search engines allow users to enter searches, known as xe2x80x9cboolean searches,xe2x80x9d that are more complex than a simple list of words. In a boolean search, the user enters boolean operators along with the words of the search expression. Among the most common boolean operators is AND, OR, and NOT. Further, according to the syntax utilized by some search engines, AND, OR, and NOT may be abbreviated as and, |, and !, respectively. Also, OR is generally the default operator (which means that a search expression containing words but no explicit boolean operators is interpreted as if those words were joined with the OR operator). Quotation marks also act as boolean operators, allowing the user to group words into a phrase. Such a phrase produces a match only when that same phrase (i.e., all of the words in the same arrangement) is found in a Web page.
Some search engines also support include and exclude boolean operators, which may be entered as + and xe2x88x92, respectively. If a word is qualified with the include operator, a document is a match only if the document includes that word. If a word is qualified with the exclude operator, a document is a match only if the document does not include that word. In addition, parentheses may be utilized to group pieces of a search expression together, for instance to associate an include operator with one group of words but not another.
By utilizing boolean expressions, skilled database searchers are able to obtain more efficient hit lists. Many Web users, however, do not know and do not want to learn how to specify boolean searches. Furthermore, even for skilled searchers, substantial effort may be required to formulate and enter a search expression that is sufficiently complete to obtain a reasonably efficient hit list. What is needed, therefore, is a more convenient and effective way to generate efficient hit lists.
The present invention relates to a method, system, and program product for utilizing metawords to find electronic documents. According to the method of the present invention, a user specifies an initial search expression that includes at least one metaword. It is determined that the at least one metaword corresponds to a boolean expression, and, in response, an expanded search expression is generated. The expanded search expression includes the boolean expression in lieu of the at least one metaword, such that the expanded search expression is utilized in lieu of the initial search expression to find the electronic documents.
In an illustrative embodiment, the determining step includes the step of determining that one or more terms and a count qualifier are associated with the at least one metaword. The count qualifier specifies a threshold number of occurrences of the one or more terms within a single electronic document. The one or more terms and the count qualifier are included in the expanded search expression generated in the generating step.
All objects, features, and advantages of the present invention will become apparent in the following detailed written description.