The World Wide Web (the “Web” or “WWW”) is an architectural framework for accessing documents (or Web pages) stored on a worldwide network of distributed servers called the Internet. An information source is any networked repository, e.g., a corporate database, a WWW site or any other processing service. Documents stored on the Internet are defined as Web pages. The architectural framework of the Web integrates Web pages stored on the Internet using links. Web pages consist of elements that may include text, graphics, images, video and audio. All Web pages or documents sent over the Web are prepared using HTML (hypertext markup language) format or structure. An HTML file includes elements describing the document's content as well as numerous markup instructions, which are used to display the document to the user on a display.
Access to online information via the Web is exploding. Search engines must integrate a huge variety of repositories storing this information in heterogeneous formats. While all files sent over the Web are prepared using HTML format, the heterogeneity issue remains both in terms of search query formats and search result formats. Search engines must provide for homogeneous access (to the underlying heterogeneity of the information) and allow for homogenous presentation of the information found. A search engine (including a meta-search engine such as Xerox askOnce) uses wrappers to extract and regenerate information from documents stored in heterogeneous Web information sources.
A wrapper is a type of interface or container that is tied to data; it encapsulates and hides the intricacies of a remote information source in accordance with a set of rules known as a grammar or a wrapper grammar, providing two functions to an information broker. Wrappers are used to translate a client query to a corresponding one that the remote information source will understand. Wrappers are also used by search engines to extract the information stored in the HTML files representing the individual Web pages; the wrapper scans the HTML files returned by the search engine, drops the markup instructions and extracts the information related to the query. If an information broker is involved, the wrapper parses (or processes) the results in a form that can be interpreted and filtered by the information broker. Then the wrapper takes the search answers, either from the different document repositories or from the information broker, puts them in a new format that can be viewed by the user. Extraction and parsing is done in accordance with the grammar or rules for the particular type of response file.
In addition to generating a wrapper which is capable of extracting and regenerating information from documents stored in heterogeneous Web information sources, a search engine must also be capable of automatically acquiring the individual search query language features supported by each Web information resource. Search query language features include, for example, the Boolean operators, how a Web information source treats quotation marks around search key words, how a Web information source treats parentheses and commas, etc. Due to the innate heterogeneity of Web information sources, sources differ not only in the query features but also in their syntactic representation. For example, the Boolean operator “AND” can be represented as “and”, “&” or “” (whitespace).
Searching for relevant information is a primary activity on the Web. Often, people search for information using general-purpose search engines, such as Google or Yahoo!, which collect and index billions of Web pages. However, an important fragment of the Web remains unavailable for centralized indexing, called the “hidden” Web, which includes the content of local databases and document collections accessible though search interfaces offered by various small and mid-sized Web sites, including company sites, university sites, media sites, etc. According to a study conducted by BrightPlanet in 2000, the size of the Hidden Web is about 400 to 550 times larger than the commonly defined (or “Visible”) World Wide Web.
Commercial approaches to the Hidden Web have usually the form of Yahoo!-like directories organizing local sites in specific domains. Some important examples of such directories are www.InvisibleWeb.com and www.BrightPlanet.com. BrightPlanet.com's gateway site CompletePlanet.com is a directory as well as a meta-search engine. For each database it incorporates into its search, a meta-search engine uses a manually written wrapper.
Similar to the Visible Web, search resources on the Hidden Web are highly heterogeneous. In particular, they use different document retrieval models, such as the Boolean or the vector-space model. They allow different operators for query formulation and moreover, the syntax of supported operators can vary from one site to another. Conventionally, query languages are determined manually; reading the help pages associated with a given search interface, probing the interface with sample queries and checking the result pages is often the method of choice.
The manual acquisition of Web search interfaces has important shortcomings. First, the manual approach is hardly scalable to thousands of search resources that compose the Hidden Web. Second, the manual testing of Web resources with probe queries is often error-prone. Third, cases of incorrect or incomplete help pages are frequent. Operators that are actually supported by an engine may not be mentioned in the help pages, and conversely, help pages might mention operators that are not supported by the engine.