Computer based text searches have many applications. Some of these include applications to automated Taxonomy, the categorization of text documents. The most prominent applications, however, though certainly not the most sophisticated, are those on the Internet. Most Internet based companies, such as Alta Vista, Yahoo, or Excite, provide users with a simple, literal text search function. Some of them also provide an xe2x80x9cAdvanced Searchxe2x80x9d function.
When using the simple, literal search function, the user types in the literal word or words to be found, and the search engine lists all the found texts. Many such search engines appear to allow for some synonyms of the words typed and are largely word based, that is, unless the text typed by the user is placed in quotes (or designated as literal in some other way) the search assumes a match if the typed words appear in a document in any order and in most search engines, with the implied xe2x80x9cORxe2x80x9d between every word and the next one. To distinguish the different possibilities, the found documents or URLs (Universal Resource Locators) are often listed with the xe2x80x9cbestxe2x80x9d matches first. Apparently, xe2x80x9cbestxe2x80x9d matches means those which contain the largest number of the typed words.
For example, when we are searching for documents containing the words xe2x80x9csearch enginesxe2x80x9d typing these into the simple AltaVista search engine finds over 680,000 documents!
Advanced search engines allow a full xe2x80x9clogicalxe2x80x9d or boolean search, by which is meant that words or phrases can be combined using the boolean operators xe2x80x9cANDxe2x80x9d, xe2x80x9cORxe2x80x9d, xe2x80x9cNOTxe2x80x9d, and NEAR. (Usually NEAR means within 10 words of each other, and this number of words is not under users control.)
For example, to find documents containing the word xe2x80x9ceducationxe2x80x9d and either the word xe2x80x9cInternetxe2x80x9d or xe2x80x9cnetworkingxe2x80x9d or xe2x80x9cnetworkxe2x80x9d but not containing the words xe2x80x9cschoolxe2x80x9d or xe2x80x9ccollegexe2x80x9d the user would type the advanced search expression:
education AND (networking OR network OR Internet) AND NOT (school OR college)
Such a set of boolean operator features gives the user greater control of the text being searched. It makes it easier to find the searched for document amongst the millions of available documents, by allowing the user to narrow down the description of the documents of interest.
A little more control is provided by a xe2x80x9cwild characterxe2x80x9d feature, which allows the user to substitute a special symbol, the wild character, for any uncertain character. Another feature, sometimes available, allows for either the presence or absence of any wild character so designated.
As users of the search engines become more discriminating and more experienced, they will demand more control than even the current advanced search engines can provide.
For example regarding the target, it would be useful to specify that the words searched for must all be within a sentence, or perhaps within a paragraph, rather than, as at present, anywhere in the document. Clearly the chances that a document, containing the specified words, is the one we want is lower if these words are spread throughout a long document, rather than if they are all within one sentence, or one paragraph. Such a feature is not currently available with search engines on the Internet, though it is available, for example, in a search tool for Eudora Email called xe2x80x9cPowerSleuthxe2x80x9d distributed by Nisus Software Inc.
Other possible extensions of search features include a complete text-pattern description language, allowing users to describe the text pattern without the need to know the specific text. For example, we may want to search for documents or web pages of a particular company containing a phone number or a street address without knowing either, or precisely because we do not know them. Such text pattern matching is implemented in the Unix search tools which support xe2x80x9cRegular Expressionsxe2x80x9d for describing text patterns and are referred to as GREP searches.
An example of a text pattern matching engine, implemented as part of a Macintosh word processor, is the PowerFind(trademark) and PowerFind Pro features within the Nisus Writer word processor for the Macintosh, first published as a software product in January of 1989 under the U. S. registered trade name xe2x80x9cNisusxe2x80x9d and in more recent versions re-named xe2x80x9cNisus Writer.xe2x80x9d
The PowerFind Pro engine, implemented in Nisus Writer, is an extension of the Unix GREP and includes only one boolean operator: the xe2x80x9cOR.xe2x80x9d The xe2x80x9cANDxe2x80x99 operator can however be simulated by using the xe2x80x9cORxe2x80x9d and the other features of PowerFind or PowerFind Pro. However simulating an xe2x80x9cANDxe2x80x9d is not very convenient. Simulating the NOT operator is not possible without additional features.
The present invention combines the features of the full boolean search with the extended Regular Expression search features, adding the control of the search target, to create a more powerful and useful search engine than any presently available. In addition, the invention adds several more boolean search features (such as the user definable NEAR, the FOLLOWED BY, and the NOT FOLLOWED BY binary operators) and extends further some of the already extended Regular Expression features from PowerFind Pro of Nisus Writer. The straight forward combination of a Regular Expression search engine and an extended Boolean search engine results in two types of OR and two types of parentheses: one used in GREP expressions and one used in Boolean expressions. Mixed expressions have to be parsed twice: once by the Regular Expression Parser the second time by the Boolean parser.
Grep expressions can be concatenated to form new meaningful expressions whose match is the concatenation of the respective matchesxe2x80x94an intuitive result. Logical expressions, on the other hand, can only be concatenated using one of the binary boolean operators.
For example, using uppercase to designate operators and lowercase to designate any boolean expression, the boolean expression
NOT z
cannot generally be concatenated with the boolean expression
NOT a
to form a meaningful boolean expression, except by using one of the binary logical operators, such as either OR or AND, between them. So one possibility where the two expressions are combined would be:
NOT z AND NOT a
This specifies the contents of the target independently of the positions of the matches to the boolean xe2x80x9caxe2x80x9d or the boolean xe2x80x9czxe2x80x9d. However, frequently we need to search for a text string which can be intuitively designated as
(NOT z)(NOT a)
which means
NOT z IMMEDIATELY FOLLOWED BY NOT a
or using a more understandable description NOT z NOT IMMEDIATELY FOLLOWED BY a, which could also be designated as:
NOT (az)
once concatenation of a and z is defined.
It is relatively simple to define such concatenations of boolean expressions. Including such concatenations is equivalent to the unification of the GREP language with the Boolean language. Such a unification is a great convenience to the user and is an innovation.
The combined availability of both the Regular Expression language and the Boolean operators (other than OR) is also an innovationxe2x80x94even when the user has to correctly formulate the search expressions so as not to (illegally) imply concatenation of boolean expressionsxe2x80x94that is, even before unification of booleans with Regular Expressions.
Boolean Expressions often need the definition of the xe2x80x9cSearch Target.xe2x80x9d In current search engines on the Internet, the Search Target is implicitly the whole document, or web page and the user has no ability to control that. As exemplified in the Introduction above, it is often useful to give the user better control of what part of the text is to contain the defined search pattern. This is best done by formally defining the Search Target. Although defining the search target in itself is not new, its combination with Regular Expressions and boolean searches is new and its use for searchers on the Internet is also new.