The number of online databases and services has been growing at a rapid pace. These are typically accessed through Web forms filled out by users, and content is returned on demand, upon form submission. This content can reside in databases or can be generated by an application. Such sites are collectively called ‘hidden web sites’ As the volume of hidden information grows, there has been increased interest in techniques that allow users and applications to leverage this information. Applications that attempt to make hidden-Web information more easily accessible include metasearchers, hidden-Web crawlers, online-database directories, and Web information integration systems. Because for any given domain of interest, there are many hidden-Web sources whose data need to be integrated or searched, an important requirement for these applications is the ability to locate the hidden web sources. However, doing so is a challenging problem given the scale and the dynamic nature of the Web. Thus, it is important to include the ability to automatically discover forms that serve as entry points to the hidden-Web databases and Web applications.
Forms are very sparsely distributed over the Web even within a well-defined domain. To efficiently maintain an up-to-date collection of hidden-Web sources, a crawling strategy should perform a broad search while simultaneously avoiding visiting unproductive regions of the Web. The crawler must also produce high-quality results. Having a homogeneous set of forms in the same domain is useful, and sometimes required, for a number of applications. For example, the effectiveness of form integration techniques can be greatly diminished if the set of input forms is noisy and contains forms that are not in the integration domain. However, an automated crawling process invariably retrieves a diverse set of forms. For example, a focus topic may encompass pages that contain searchable forms from many different database domains. For example, while crawling to find “airfare.” search interfaces a crawler is likely to retrieve a large number of forms in different domains, such as “rental cars” and “hotels”, because these are often co-located with “airfare” search interfaces in travel sites. The set of retrieved forms also includes many non-searchable forms that do not represent database queries such as forms for login, mailing list subscriptions, quote requests, and Web-based email forms.
Given a set of heterogeneous forms, it is also beneficial to group together forms that correspond to similar databases or Web services provided by Web applications, so that people and applications can more easily find the correct databases/services and consequently, the hidden information they are seeking on the Web. There are several challenges in organizing these forms. Notably, a scalable solution must be able to automatically parse, process and group form interfaces that are designed primarily for human consumption. In addition, because there is a very wide variation in the way Web-site designers model aspects of a given domain, it is not possible to assume certain standard form field names and structures. Even in simple domains such as job search, the heterogeneity in forms is amazing. Different terms may be used to represent the same attributes. For example, in a first form related to a job search topic, two fields are named “Job Category” and “State”, whereas in a second form related to the job search topic two fields are named “Industry” and “Location” to represent the same concepts. Simple search interfaces often have a single attribute with generic labels such as “Search”, while others have no labels. For example, a text field of a form may have no associated label between the “FORM” tags though the label “Search Jobs” appears above the text field in a hypertext markup language (HTML) subtree which lies outside. There are also forms that do not contain any parseable attribute names. GIF images are used instead. Thus, what is needed is a method and a system for identifying relevant content including forms used to access content on a network such as the Internet and/or for clustering the identified relevant content.