A known approach to searching and managing numerical data on the web is to hide the data behind vertical pipes of proprietary servers, expensive programmers, DBMS administrators, and non-standard formats. Users cannot search directly for numerical data. They must do a search for likely publishers of a certain type of data, then visit each site, go through a proprietary search routine, interpret the data to determine whether it fits the overall search criteria, then collect all of the individual results into a single result table—usually by manually retyping each site's results.
Furthermore, web search engines (“WSE”) only perform “lookup” functions based on keywords. They cannot do select queries such as database management systems are capable of performing (where a precise definition of the desired result is given and the matching data is structured into a result dataset or table). The database management systems in turn, however, can only query datasets that are structured on the relational model and that have been specifically programmed to work together; they cannot perform queries directly over the web, and they are limited to special types of queries. Therefore, current web search engines (“WSEs”) cannot search for numerical data, and database management systems (“DBMS's”) cannot work across the web.
FIG. 15 illustrates a conventional approach to searches on the web which are characterized by two layers of query engines. The one closest to the end user 1500 (i.e., the web search engine 1502) provides indexing to particular sites or collections of tables. These servers 1504 are typically operated by companies such as Yahoo! and Lycos. The next layer back is a layer of query engines or servers 1504, which submit queries and retrieve data from either a relational database management system 1506, or an object-oriented database repository. For the most part, the query engines or servers are maintained by database administrators and developers for the individual web sites.
Four general sub-architectures of a conventional approach described above are used to conduct searches. They are summarized as follows. First, WSEs conduct searches using direct keyword indexing to HTML documents (e.g., Yahoo, Alta-Vista, Lycos, etc.). These search engines maintain very large indexes that map keywords to URLs. If a user types, for example, “57” as an input keyword, the user will receive instances where “57” is used in an HTML document (“NASDAQ falls 57”, “#57—Doug Henry, Major League Baseball”, etc.). As an example, for this particular query on AltaVista, a user will receive a list of over 11 million pages. The shortcomings of this approach are obvious: no context to the numbers, too many returns, no way to narrow the query to useful numerical data.
Second, WSEs conduct searches for database publishers (e.g., Yahoo, Alta-Vista, Lycos, etc.). If the user is searching for a number and it is not on an HTML page itself, it may be in a relational database that is accessible through an HTML form. The web search engine therefore can be queried for words or phrases that might be on that HTML page or related pages. In this approach, the burden is on the user to guess what words or phrases might be associated with such numbers, and who might publish such data.
Third, users may conduct searches of a repository of XML documents. For XML data, vendors such as XYZFind take the approach of essentially modeling the XML documents in a relational or object-oriented database structure, and building indexes to documents based on this internal “repository” structure. Among the shortcomings are the facts that only documents in that particular relational database can be accessed (not data distributed in documents across the web) and that data from different taxonomies are not directly comparable and, therefore, a search would not produce all possible results.
Finally, users may conduct searches of direct relational databases (e.g., proprietary database access objects based on SOAP, network objects, etc.). The approach for these databases is to directly connect a relational database management system to the user's client browser. The advantages are speed of access and ability for the programmers to control exactly what is accessed, by whom, and in what form. The disadvantages are that multiple, unrelated document sites cannot be searched, a common data model is assumed, and there is a high cost for middleware programmers and database administrators.
Based on the above conventional systems and the general sub-architectures of the conventional approaches, WSEs and relational database systems have serious drawbacks uniting a network of remote computers. They are incapable of efficiently tying millions of different data tables together into a single database, so that a query put to the system will return a table of data just as a single database system is capable of providing.
Also, conventional systems and general sub-architectures of conventional approaches are incapable of performing non-select queries. A “keyword select query” is a request for all items in a dataset that contain a particular word(s). For example, “Give me a list of all web pages with the word ‘baseball’ on them.” Conventional WSEs are designed to do keyword select queries, but they cannot perform many different types of queries such as the following.
First, conventional WSEs do not perform navigational queries. A common occurrence for users is “I don't know what I'm looking for, but I want to know what's available”. A user undertaking a navigational query wants to see a lot of context, suggestions for related information, and leads to other items. A “select query”, by contrast, returns only an answer that meets the strict requirements of the query put to the system.
Second, conventional WSEs do not perform database record-level queries. The current model for accessing data on the web is the “client-server-database” model. The client PC sends a request to the database server, which sends the request on to a relational database, which returns an answer. The problem with this approach is that current web search engines can only index HTML pages, not relational data tables—the database server acts as a wall preventing the search engine from seeing the data. Users must guess which sites might internally have the data they want, do a search for those sites, then go query those databases individually.
Third, conventional WSEs do not perform semantic queries. Current search engines operate by indexing keywords according to the URL's of pages that contain them. A search for “car” will bring up pages with “car” in them, but not “truck” or “bus” or other related items that the user may wish to know about such as variants in spelling, variants in usage, and variants in language. For example, a user searching for “dog” may want German pages with “Hund,” the German translation for dog.
Fourth, conventional WSEs do not perform numerical queries. Numbers on HTML pages are merely text characters; they possess no value, units, measure, meaning, or structure. A search cannot, therefore, be performed for numbers which possess such qualifiers. A user cannot, for example, search for pages, which reference companies with “sales >$100 million” (value= >100, meaning=sales, units=US Dollar). Companies with “$101 million”, or “$100,000,000” even “revenues >$100 million” cannot be found.
Fifth, conventional WSEs do not perform transformational queries. Transformational queries are those that require numbers to be transformed in some way to test whether they meet the requirements of a query statement. Suppose, a company's financial statements are presented in quarterly data, and a query is made for companies with “annual sales>$100 million”. This request may be equivalently stated as companies with “annual sales>75 million (British Pounds)”, or sales listed quarterly with sales greater than $25 million per quarter. Keyword searches (and general database select queries) cannot make these types of transformations in the course of their searches.
Sixth, conventional WSEs do not perform arithmetic queries, such as those that may require a mathematical calculation to be performed on the data searched during the searching process. For example, a user may not perform a search such as the automatic calculation of a batting average over time.
Arithmetic queries involve complex calculations, often requiring a specialized language (such as the Reusable Macro Language, U.S. patent application Ser. No. 09/573,780). For example, a query might draw financial data from 10 web sites, calculate a set of financial rations, then conduct a search for companies that meet that profile. By their use of multiple sources, derived values, and complex operations, arithmetic queries are distinguished from numerical queries, which use basic comparison operators and transformational queries, which change the underlying units, measures and magnitudes.
Seventh, conventional WSEs are incapable of performing time-dependent queries. Queries that are time-dependent may take the form of “Let me know when . . . .” These types of queries may have a refresh capability, with related controls on scheduling, expiration of request and so forth. An example of this type of query would be if the user wants a notification (for example, by email) any time the search engine becomes aware of a bank company stock trading at less than 1.0 times book value.
Finally, conventional WSEs are incapable of performing select queries between unrelated databases. They are incapable of performing select queries when there is no common key. “Keys” are words or strings of text that are common to two datasets and allow joins to be performed on the two tables (linking columns from each to create a new dataset) in a relational database. A “join” is a method by which data from two different data tables can be combined into one table by matching records based on a common set of information (i.e., a social security number). This is not an efficient solution where the whole database is not under the control of one authority which can enforce consistent vocabulary, spelling and usage.
The current reliance of relational databases on matching exact spellings of key fields is a major hindrance to the development of a web-like linking of documents. Differences in wording, spelling and usage mean that documents created by different people, in different countries, potentially in different languages, will never have a common “key” that allows them to be linked.
Therefore, based on the deficiencies of conventional systems and the general sub-architectures of conventional approaches to searching data on networks such as the Internet and in relational databases, it is desirable to overcome the aforementioned problems and other related problems.