The “hidden Web” has had an explosive growth as an increasing number of databases go on line, from product catalogs and census data to celebrities' burial sites. That information is hidden in the sense that pages displaying it are constructed on demand by query programs that dip into a database not directly available to World Wide Web (“Web”) users. It is estimated that 80% of all data in the Web can only be accessed via forms in this manner.
There are many reasons for providing such interfaces on the Web. Transferring big files resulting from broad queries of large databases can unnecessarily overload the Web servers, especially if users are interested in only a small subset of the data. Further, many users may find it very cumbersome to access the particular record they require by directly accessing a database. Giving direct access to the databases through expressive query languages such as SQL or XML-QL is not practical, as those languages are too complex for casual Web users. Form interfaces are thus a good choice as they provide a very simple way to query (and filter) data. A last concern is that of attractiveness to the users and providers. On the provider end, a restrictive form interface (or a series of them, for that matter) allows the presentation of many more advertisement hits than simply presenting a database for the users to search with the browser. For users, a click-intensive point and click interface may be more appealing than a cold and official-looking flat file.
Form interfaces can be quite restrictive, disallowing interesting queries and hindering data exploration. In some cases, the restrictions are intentionally imposed by the content provider to protect its data. For example, a book database and readers' comments presented in a Web site of a bookseller may be competitively important to the bookseller and it would therefore be to the bookseller's benefit to prevent large-scale replication of that data by requiring the use of restricted queries. Frequently, such entities discourage replication of the data available on their Web sites by detecting series of systematic queries or large numbers of queries from a single source.
In other instances, the restrictions appear to simply be the result of poor interface design. For example, the U.S. Census Bureau Tract Street Locator (http://tier2.census.gov/ctsl/ctls.htm) currently requires a ZIP code and the first letter of a street name, making it difficult for users to gather information about all streets within a given ZIP code. As a result of such interfaces, there is a great wealth of information buried and apparently inaccessible in many Web sites.
Retrieving information through restricted interfaces can be a difficult task. Network traffic and high latencies from Web servers often make access times so long that it is not feasible to retrieve data using serial queries through the provided interfaces. Furthermore, because of the methods in place by some providers to discourage data replication, systematic queries may not be possible. It would be advantageous to have a method of presenting large-scale queries to Web databases that solves those problems.