In today's economies, data is generated, gathered, and stored at an ever-accelerating rate. Financial markets trade with varying stock prices, scientists decode the human genome, patents are filed, and each of these events is reported in some publication, or stored in some database. The ability to access these different sources of information, and to combine them, is becoming crucial for making informed decisions. Ignoring the available information, on the other hand, can result in bad investments, scientific efforts being wastefully repeated, and intellectual property rights being violated. Clearly, the list of advantages of having access to relevant information, and the respective list of disadvantages of not having this information, can be extended ad infinitum.
At the same time, providing access to relevant information is a challenging technical problem. Data is stored in distributed locations and varying formats. It is stored in structured databases, electronic libraries, or even on pages of the World Wide Web. Moreover, different information sources use different vocabularies and have different degrees of credibility.
Information retrieval techniques have been developed to find relevant documents in electronic libraries. These techniques have been widely deployed and refined in order to search for information on the World Wide Web. Users formulate queries by typing in keywords that are related to the information they want to find. For example, if a user is searching for a listing of the law firms in the Palo Alto area, she might provide the keywords "LLP" and "Palo Alto". If the user is lucky, she will retrieve a listing of all law firms in the Palo Alto area. Very likely though she will also have to scan through pages that provide irrelevant information, like news articles about a Palo Alto based software company suing a Seattle based software company. Moreover, relevant information, like a listing of law firms based in neighboring Menlo Park, might not be retrieved.
These retrieval problems are the subject of considerable academic interest. See, for example, the Proceedings of the Annual International ACM SIGIR Conferences on Research and Development in Information Retrieval.
Whereas the problem of guessing a document's relevance given a list of keywords is "just" difficult, searching a structured database by entering keywords is in most cases absolutely impossible. As an example, consider a database that stores all sales transactions of a department store chain. Assume a manager of this company wants to promote the sales clerk that generated the highest revenue in the previous year. In order to find this sales clerk the database system has to scan all sales transactions, add up the sales for each clerk, and find the clerk with the highest amount of total sales. Obviously, searching the database using keywords could never yield an answer to the manager's query.
Database management systems can be queried using sophisticated query languages. These query languages are expressive enough to formulate a query that would answer the manager's question in the previous example. For instance, using the relational query language SQL the query might look as follows:
CREATE VIEW Totals AS SELECT employee-id, SUM(sales-amount) AS total-sales FROM Transactions GROUP BY employee-id SELECT employee-name FROM Employees, Totals WHERE Employees.employee-id = Totals.employee-id AND total-sales &gt;= ALL (SELECT totals-sales FROM Totals)
This query accesses just a single database. Data in this database is stored in a single common format. Clearly, query languages that allow formulating queries across multiple databases or across multiple formats, or that allow combining information from structured databases, electronic libraries, and the World Wide Web, are even more complex. Non-technical users, like managers in a department store chain, obviously cannot be expected to formulate their information requests in these complex query languages.