Searching for information on the World Wide Web can often become a very frustrating experience for users. A free form text search engine provides no help to a user regarding how to structure their search to find desired content. Thus, often users enter a search term or terms in a free form text box without any concern for the form of the data for which the user is searching. Results may provide an unmanageable number of irrelevant search results while potentially missing some of the most relevant information because of the format in which the information was stored or displayed on a hosted website. These are some of the major challenges with search web pages also referred as unstructured content.
Any such challenges with web search become much more apparent when it comes to searching structured data also known as the “Deep Web”. The Deep Web comprises a vast set of content hidden inside a multitude of databases, each stored in a format that may be unknown, and different from the storage format of any other database. According to the University of California, Berkley estimates place the size of Deep Web at over 500× the amount of content currently contained in all of the currently existing Internet of web pages, also called the surface web.
While there are a number of current approaches that attempt to search the Deep Web using automated crawlers and keyword searches, they have many challenges. This is because, in general, search technology that works for web pages or unstructured content does not translate for structured data. As is shown in FIG. 1, keyword search engines for searching for data contained in web pages primarily 1) crawl web pages where keywords are searched and captured, 2) organize content based upon the popularity of content, 3) match the content to any small number of keywords, and 4) present links to web pages that match the keywords from a user query.
In contrast, as has been recognized by the inventors of the present invention, the requirements for searching through the structured Deep Web are quite different. In crawling such Deep Web data, 1) understanding structure of the dataset is more important than capturing keywords. 2) In composing queries to retrieve data, knowledge of the structure of a dataset being searched and recognition of data fields and their inter-relationships is critical; popularity doesn't help in presenting relevant answers as this data is typically not subject to popularity determinations. 3) When querying such Deep Web data, users tend to ask elaborate queries unlike 2-3 keywords when searching web pages. Therefore, understanding the meaning of user queries is critical. Simply employing a keyword-to-keyword correlation does not suffice. 4) Lastly, when interpreting results from data, users expect meaningful summaries and visuals of data not links to detailed records of the datasets—as it would be if technology of web pages were to be translated on to data search. In short, knowledge base specific to content, understanding a user's complex queries and relationships between the query terms, and finally matching them both meaningfully to present relevant answers are important in understanding and delivering the content contained in the Deep Web.
Prior solutions to search web data are either are not knowledge base driven or cannot practically scale to the vast Deep Web.
Google®, Microsoft®, Kosmix™, DeepPeep™, DeepDyve™, Socrata™ Infochimps™, Data.gov and many others have been trying to implement methods for searching the Deep Web. Each has different technologies, tools, and most importantly very different approaches, as will now be described.
Automated Crawlers to Peek-thru Web Forms or APIs to search Underlying Databases. This approach focuses on searching datasets behind HTML forms, and thus excluding from web Search a vast majority of datasets that do not have forms in front of them. For example, if one is looking to buy a car, one might visit Edmunds.com and fill the search form by selecting the Manufacturer, Model, Price Range, Zip Code, etc. The information filled into the form is used to compose a database query which is then submitted to the one or more databases to present the results as an HTML page. Because this page is created on demand current search engine can't see the page.
Google®'s approach to the Deep Web is to find HTML forms, send input to these forms, and index the resulting HTML pages. Google®'s approach is fully automated, can easily scale and fits nicely with its infrastructure built for searching web pages. A more indepth discussion of this approach is included at Alon's VLDB paper published in 2008. Kosmix™ takes a similar approach of tapping into web forms as Google®, but does so by using API calls instead. DeepPeep™ follows a similar approach of tapping into web forms to search underlying databases.
While this approach offers some benefits, it has severe limitations in the scope of content that can be searched and level of analytics that can be conducted. Given the simplicity of this approach, it can be easily scaled to a large number of Web Forms or APIs. This approach, however, leaves out a huge portion of the Deep Web comprising of datasets that do not have a Form or API in front of them as is the case with many government, finance, research, or other similar datasets on the web. Additionally, forms & APIs offer only a limited window into the underlying databases, and hence allow only simple queries but not advanced analytics.
Powerset™, Hakia™, Kosmix™, Wolfram Alpha™ and others use knowledge bases to search content but anyone from the community cannot make their datasets searchable for other users on the web. These approaches depending on internally built taxonomies, knowledge bases, etc. can be applied to limited content but cannot be scaled to the billions of web pages and datasets.
Therefore, it would be desirable to provide an apparatus, method, system and solution that overcomes the drawbacks of the prior art and allows for efficient and effective searching of all of the Web.