Numerous search engines are currently available for searching large collections of documents such as the entire world-wide-web. A search is performed in response to a query provided by a user, wherein the query typically includes keywords and logical connectors. Different search engines handle queries in different ways and some of them support “advanced” searching. The results of a search are typically reported in the form of a ranked list of documents, which the user may examine. Sometimes the returned list may consist of hundreds to thousands of documents. The number of documents that the user can manually examine is naturally quite limited so, if a relevant document cannot be found among relatively few documents at the top of the list, the search results are not very useful.
Various methods for recognizing keywords in a search query are well known. An “alphabet” is a set of symbols. For example, a first alphabet may include only the digits 0 and 1, and a second alphabet may include all the lower-case a–z and upper-case letters A–Z. For text search purposes, the alphabet is either the ASCII character set, coded with numbers 0–127, or the entire set of 256 characters coded 0–255. A “string” is a finite sequence of symbols from the alphabet. Thus, the string “0110101” is a string over the alphabet set of {0,1} and AbZYe is a string over the alphabet set of lower-case and upper case letters. A “language” is a set of strings. A computer program recognizes a particular language, if the program can tell for any given string whether or not the string is in the language.
A “regular expression” is a simple description of a language that is recognized by simple computer programs called finite automata. The simplest type of an expression is a single symbol. More complicated expressions can be constructed from simpler expressions by applying operations. For example, the expression ‘(o+i)n’ refers to the set {on,in} and can be described as “either an ‘o’ or an ‘i’, followed by an ‘n’”. Further, by example, the expression ‘(s+t)(i+o)n’ refers to the set {sin,son, tin,ton}.
The concepts of alphabet and regular expression, described above, are further explained by the example of recognizing times that are expressed in hours and minutes, e.g., 11:43, 09:32, 7:19, etc. Here the alphabet consists of the numerals 0,1, . . . , 9 and the colon :. The first digit of the minutes part can be defined as M1={0, . . . , 5} whereas the second digit of the minutes part can be defined as M2={0, . . . , 9}. Next, the minutes part, MIN, is defined by the concatenation operation MIN=(M1)(M2)={00, 01, . . . , 09, 10, 11, . . . , 59}. Similarly, the hours part, HOUR, can be expressed by defining H1={1, . . . , 9}, H1={0,1}, H2=H1+(2)={0,1,2}, so that HOUR=(H1)+(0)(H1) +(1) (H2) (i.e., the hour part is either a numeral from the set {1, . . . , 9}, the numeral 0 followed by a numeral from the set {1, . . . ,9}, or the numeral 1 followed by a numeral from the set {0,1,2}. Finally, the time is TIME=(HOUR)(:)(MIN).
A “lexical analyzer” is a computer program that receives text and recognizes strings described by a regular expression. A “parser” is a program that analyzes a stream of lexical units according to a given grammar. For a specific language, a lexical analyzer finds within any given text all the occurrences of strings described by the regular expression. The lexical analyzer converts characters or sequences of characters into so-called conventional tokens that become atomic units that are passed to a parser. A “lexical analyzer generator” is a computer program that receives a regular expression and generates a corresponding lexical analyzer. A well-known lexical analyzer generator is LEX, which is supplied together with most of the UNIX systems. LEX typically works with YACC, which is a parser generator, i.e., a program that generates a parser corresponding to a given set of grammar rules.
The example of TIME can be processed by LEX if given in the form of a set of so-called “lexical rules.” The notation can be explained as follows. Each rule ends with a semi-colon. Each rule consists of two parts separated by a colon. The first part of the rule is a conventional token that is defined by the second part of the rule. The vertical bar | signifies the logical “or” and alphanumeric characters are denoted by ‘0’, ‘1’, etc. For any two characters a, b, the notation [a–b] stands for any one character that occurs between a and b in the standard order on the set of characters. So, [1–9] means any one of the digits 1,2, . . . , 9. Conventional tokens that occur in the second part must also appear in the first part of exactly one of the rules or declared in advance as “terminal” conventional tokens. The example of TIME, processed by LEX, is illustrated as follows.
time : hours‘:’ minutes;
hours : [1–9] |‘0’[0–9]|‘1’ [0–2];
minutes : [0–5][0–9];
The following example explains one of the weaknesses of conventional search methods that support only keyword queries. Suppose a user wanted to find documents that described how many workers were laid off. A query such as “laid off” is inadequate for this purpose since too many pages are returned with no number as desired. When this query was inputted into present conventional search engines, the Google search engine returned 47,600 hits, and the AltaVista search engine found 75,683 hits. Only a small fraction of these hits were relevant. In this case, numbers convey quantitative information, so it would be desirable to filter out occurrences of the phrase “laid off” in documents where no number is mentioned.
Accordingly, there is a need for a method and system for searching and retrieving documents that permits users to search and retrieve a greater number of relevant documents in a shorter amount of time.