This invention relates to capture and storage of information retrieved from a network.
The World Wide Web (WWW) is a collection of Hypertext Mark-Up Language (HTML) documents resident on computers that are distributed over networks such as the Internet. The WWW has become a vast repository for knowledge. Web pages provide information spanning the realm of human knowledge from information on foreign countries to information about the community in which one lives. The number of Web pages providing information over the Internet has increased exponentially since the World Wide Web""s inception in 1990. Multiple Web pages are sometimes linked together to form a Web site, which is a collection of Web pages devoted to a particular topic or theme.
Accordingly, the collection of existing and future World Wide Web pages represents one of the largest databases in the world. However, access to the data residing on individual Web pages is hindered by the fact that World Wide Web pages are not a structured source of data. That is, there is no defined xe2x80x9cstructurexe2x80x9d for organizing information provided by the Web page, as there is in traditional, relational databases. For example, different Web pages may provide the same geographic information about a particular country, but the information may appear in various locations of each page and may be organized differently from page to page. One particular example of this is that one Web site may provide relevant information on one Web page, i.e. in one HTML document, while another Web site may provide the same information distributed over multiple, interrelated Web pages.
These problems are not limited to retrieving data from HTML documents distributed over the Internet. Larger organizations have begun building xe2x80x9cintranetsxe2x80x9d, which are collections of linked HTML documents internal to the organization. While xe2x80x9cintranetsxe2x80x9d are intended to provide a member of an organization with easy access to information about the organization, the problems discussed above with respect the WWW apply to xe2x80x9cintranetsxe2x80x9d. Requiring members of the organization to learn the data context of each Web page, or requiring them to learn a specialized query language for accessing Web pages, would defeat the purpose of the xe2x80x9cintranetxe2x80x9d and would be virtually impossible on the Internet.
The periodic retrieval of Web pages and extraction of useful information are hindered by several difficulties that have not been solved by prior art. In particular, a large percentage of Web pages are dynamically created. Those Web pages contain data that depends upon input parameters sent to the Web server. Thus, a single uniform resource locator (URL) may, with appropriate parameters, return many data sets. Further, the pages returned may vary in format. For example, some pages may have additional elements, while other pages have had elements deleted. In addition, valuable information may be contained in graphical elements, such as JPEG or BMP images. This information often does not exist in text form in the page data.
A method for capturing and storing data from a network includes specifying at least one target data accessible from a network location addressable by a network address. The method also includes capturing the target data from data received from the network location at specified dates and times.
In some embodiments, the method further includes easy-to-use graphical user interfaces; integration with Web browsers; point-and-click selection of data targets; automatic input element parameter substitution to retrieve multiple pages from a single network address; periodic Web page retrieval from Internet servers at pre-specified intervals; target data matching; intelligent character recognition of graphical HTML or XML elements; graphical database, database table and table record creation; and automatic creation of formatted data files or direct storage to database.
The present invention also includes a data capture and storage system. The system includes a graphical interface element configured to display at least one target page. The system also includes a selection device and a processor. The selection device operates to enable selection of target data on the target page for capture and storage. The processor is coupled to the graphical interface element, and is capable of being programmed with a plurality of configurations to locate, extract, and store the target data according to the plurality of configurations.