The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as Web pages. Users navigate these pages by means of computer software programs commonly known as Internet browsers.
The authors of Web pages provide information known as metadata, within the body of the markup language document that defines the Web pages; commonly the markup language used is HTML, hypertext markup language. A computer software product known as a Web crawler systematically accesses Web pages by sequentially following hypertext links from page to page. The crawler indexes the pages for use by the search engines using information about a Web page as provided by its address or universal resource locator (URL), metadata, and other criteria found within the page.
Much of the publicly available information on the Internet is intended to be viewed by humans using browsers. These Web sites are primarily designed using the HTTP transport protocol and HTML display markup language. The use of HTML for presentation of Web data makes it difficult for application developers to gain programmatic access to the data on those Web sites.
The World Wide Web is increasingly used as a critical information source for both businesses and consumers. Recent advances in business-to-business (B2B) technologies have resulted in the definition of standard interfaces for data exchange between companies. Web services are standard mechanisms for applications to inter-operate using Web protocols, i.e., programs that access remote computers to access data or to render other services such as weather, etc. These Web services use a remote procedure call (RPC). The RPC mechanism is an established concept, but uses Web protocols and data formats such as HTTP, XML, SOAP, or UDDI. The design of application programming interfaces (API) to access Web data is relatively difficult because Web sites make data available in a presentation markup language such as HTML but not in a computer-friendly format like XML, SOAP, or UDDI.
As an example, a Web site provides data that a business is interested in, such as weather or stock quotes. Human users can view this site using their browser, but the business wishes to programmatically automate the retrieval of information from that site. There is currently no means to automatically create an API to access the data on the Web site of interest.
Access to the Web site includes the request for information by the user and the response of the Web site to that request. Currently, for every Web site or request to a specific Web site, the programmer must spend significant time and effort analyzing methods for sending the request and retrieving the desired information. This task takes a great deal of programming skill, wherein the business wishing to programmatically access the information on the desired Web site may not have an employee with sufficient skill.
Though the lack of application programming interfaces (API) for Web data is partially helped by recent advances in Web data extraction technologies, the problem remains that no solutions exist for creating a standard interface description for the Web at large. What is therefore needed is a system for Web site analysis and program development that aids programmers in the development of programs or applications to automatically request and retrieve data from Web sites on the Internet that use a variety of data forms and protocols. The need for such a system has heretofore remained unsatisfied.