The World Wide Web (“www” or “Web”) continues to rapidly “deepen” by many searchable databases online, where data are hidden behind query forms. Unlike the surface Web providing link-based navigation, these “deep Web” sources support query-based access. Data are thus hidden behind their query interfaces. With the myriad databases online, at the order of 105, the deep Web has clearly rendered large-scale integration a real necessity and a real challenge.
Guarding data behind them, such query interfaces serve as “entrances” to the deep Web. These interfaces, or HTML query forms, express query conditions for accessing objects from databases behind them. Other documents may also guard or provide access to data in an analogous manner. Each condition, in general, specifies an attribute, one or more supported operators (or modifiers), and a domain of allowed values. A condition is thus a three-tuple [attribute; operators; domain] e.g., Cauthor=[author;{“first name . . . ”, “start . . . ”, “exact name”}; text] in interface Qam (see, FIG. 3(a)). Users can then use the condition to formulate a specific constraint e.g., [author=“tom clancy”] by selecting an operator (e.g., “exact name”) and filling in a value (e.g., “tom clancy”).
For modeling and integrating Web databases, the first step is to “understand” what a query interface says—i.e., what query capabilities a source supports through its interface, in terms of specifiable conditions. For instance, amazon.com (FIG. 3(a)) supports a set of five conditions: (on author, title, . . . , publisher). These query conditions establish the semantic model underlying the Web query interface. According to an aspect of the present invention, one may extract such form semantics.
Automatic capability extraction is critical for large-scale integration. Any mediation task generally relies on such source descriptions that characterize sources. Such descriptions, largely constructed by hands today, have been identified as a major obstacle to scale up integration scenarios. For massive and ever-changing sources on the Web, automatic capability extraction is essential for many tasks: e.g., to model Web databases by their interfaces, to classify or cluster query interfaces, to match query interfaces or to build unified query interfaces.
Such form understanding essentially requires both grouping elements hierarchically and tagging their semantic roles: first, grouping associates semantically related HTML elements into one construct. For instance, Cauthor in Qam is a group of 8 elements: a text “author”, a textbox, three radio buttons and their associated texts. Such grouping is hierarchical with nested subgroups (e.g., each radio button is first associated with the text to its right, before further grouping). Second tagging assigns the semantic roles to each element (e.g., in Cauthor, “author” has the role of an attribute, and the textbox an input domain.)
Such extraction is challenging, since query forms are often created autonomously. This task seems to be rather “heuristic” in nature, with no clear criteria but only a few fuzzy heuristics as well as exceptions. First, grouping is hard, because a condition is generally n-ary, with various numbers of elements nested in different ways. ([heuristics]: Pair closest elements by spatial proximity. [exception]: Grouping is often not pairwise.) Second, tagging is also hard, as there is no semantic labeling in HTML forms. ([heuristics]: A text element closest to a textbox field is its attribute. [exception]: Such an element can instead be an operator of this or next field.) Finally, with various form designs, their extraction can be inherently confusing—The infamous Florida “butterfly” ballots in US Election 2000 indicate that ill-designed “forms” can be difficult, even for human voters, to simply associate candidates with their punch holes. This incident in fact generated discussions on Web-form designs.