The present invention relates to acquisition of data and, more particularly, to web browsers for the Internet, as well as to database utilities for data accessible through the Internet. Specifically, one embodiment of the present invention provides a system to navigate to one or more data sources on the Internet preferably in an automated manner, extract data irrespective of the format of that data and display, store and/or process the extracted data.
The number of users professionally using the Internet (and particularly the xe2x80x9cWorld Wide Webxe2x80x9d) as a data source, and hence analogously to a database or collection of databases, on a daily basis is increasing. The Internet has helped create rich new sources of information accessible through a ubiquitous user interface, i.e., the web browser such as those provided by Microsoft (called Internet Explorer) and Netscape (called Navigator). However, today""s web merely brings up individual web pages to individual users. Unfortunately, these web pages are typically depicted as HTML xe2x80x9cpicturesxe2x80x9d of data, and usually not the data itself. Users can easily browse information, but it is difficult to edit, analyze or manipulate the underlying data. Gleaning relevant information from individual web pages is tedious. Most web operations are largely performed manually. This is true of the input side, for example, entering uniform resource locator (xe2x80x9cURLxe2x80x9d) specifications, login names, passwords and other access codes, profiles, queries and other inputs, as well as on the output side, for example, evaluating search results, data scraping from a web page, composing, editing and further processing data. Moreover, useful applications of information accessible through the Internet often require consolidation of data from multiple sources. Professional web users currently lack tools that are standard on modern databases, and, accordingly, a substantial amount of time is spent performing mundane manipulations with repetitive and less than systematic inputs.
One of the reasons why standard database tools cannot readily be used on the web is the fact that there is no standardized way to access data, largely because web pages are designed primarily for human, and not machine, readability. Further exacerbating the situation, data is typically not stable; i.e., even if the core information of a web page remains the same, presentation, and therefore the coding, of a page can change arbitrarily often, thus defeating any hard-coded access, search or retrieval and other techniques.
Accordingly, there is a need to overcome these problems, and an object of the present invention is to provide a data location and extraction tool capable of automated operation. A further object of the present invention is to provide a computerized tool capable of automatically navigating to a plurality of destination web sites, extracting select pieces of data therefrom, processing the extracted data and displaying the processed data in an organized format.
One embodiment of the present invention provides a system for collecting unstructured data from one or more web sites on the Internet and providing structured data, for example, to navigate to multiple web sites and extract data snippets. The system in accordance with one embodiment of the present invention enables the process of collecting such data to be automated so that one or more target data sources can be constantly monitored. In accordance with a preferred embodiment of the present invention, the data location and scraping tool of the present invention comprises a browser plug-in to facilitate data collection, for example, scripts are added to the browser such as Microsoft Internet Explorer. Thus, the browser effectively serves as the operating system, and the scripts embedded in the browser form an input layer that locates and extracts data and effectively serves as a BIOS for retrieval of unstructured data. The data can be simply displayed or imported and stored in a database, for example, or can be further processed, for example, using a spreadsheet application, and even imported directly to one or more applications.
The system of the present invention performs the tasks of precisely locating and extracting the select data with a granularity specified by the user from any information source such as search engine results, web pages, other web-accessible documents, e-mail or text feeds in any format, for example, HTML, .txt, .pdf, Word, Excel, .ppt, .ftp text feeds, databases, XML and other standard, as well as non-standard, formats. The system scrapes or transforms the information into a format that is understood by database-centric machines. Transformation may involve the intermediate step of first converting non-HTML to HTML, or in some cases, for example, in the case of a .pdf document, a browser plug-in is preferably provided to convert directly to XML without that intermediate step. Preferably, the system in accordance with the present invention converts information to xe2x80x9cXMLizedxe2x80x9d snippets of valuable data gleaned by meta-surfing through one or more web pages or other web-accessible documents. Thus, the system in accordance with the preferred embodiment of the present invention enables conversion of any web page or web-accessible document in any format in any location into a usable XML snippet of relevant data. The XML tagged data will in turn be database friendly and in a form that is easily integrated into existing business processes.
The system of the present invention preferably comprises a navigation module that accesses one or more web pages or other web-accessible documents. The navigation module provides the capability for a user to specify and store a procedure such as a series of clicks and entries of information, for example, a user name and password, to access a web page or other web-accessible document, as well as the capability to perform the procedure to actually access the web page or other web-accessible document in an automated manner. The system in accordance with the present invention also preferably comprises an extraction module that scrapes information from the accessed web page or other web-accessible document. The web page or other web-accessible document can have any format, because the extraction module has the capability for the user to identify the data to be collected, whether the data appears in HTML or other format. If the data is in HTML format, the data can be analyzed, and a scraping procedure specified by the user based on the contents, structure and formatting of the HTML web page or other web-accessible document can extract data. The user can lock onto an item of relevant data on the web page or other web-accessible document for extraction by specifying relationships of contents, structure and/or formatting within the web page or other web-accessible document such that the data can be located even if the web page or other web-accessible document is modified to some extent in the future. If the format of the web page or other web-accessible document is other than HTML, for example, a text (.txt) document, e-mail, Microsoft Excel or other legacy document, the data can first be converted to HTML using a conventional translator. If a conventional translator is not available such as in the case of .pdf, for example, a translation module comprising a visual programming interface can be used to extract relevant data. The extraction module also has the capability to scrape or harvest the data from the source that is identified by the location procedure so that data can be imported. Preferably, the data is converted to a format that provides structured data such as XML format which is standardized for use by various database and other applications so that the data can be stored or further processed as determined by the user. The system of the present invention preferably provides a visual programming interface for the user to specify the navigation procedure and the one or more items of data to extract from a web page or other web-accessible document accessed by the navigation procedure.
Accordingly, the present invention provides a method for automatically extracting data from at least one electronic document accessible over a computer network such as the Internet, the method including: recording a sequence of actions operable to electronically navigate to a target page of the electronic document, the target page including a plurality of elements each having a structural definition wherein the structural definitions interrelate the plurality of elements; identifying a target pattern for a select subset of the plurality of elements; automatically accessing the target page according to the recorded sequence; and automatically identifying and copying and/or processing select ones of the plurality of elements dependent upon the target pattern. The method and system in accordance with the various embodiments of the present invention enable extraction of data irrespective of the format of the electronic document. The data can be stored, made available for further processing or displayed such as by Web Bands so that a customized data display can be structured by the user.
In summary, the system of the present invention provides an engine for accessing data on one or more web pages or other web-accessible documents primarily intended for human readability preferably using a browser, for scraping web page or other web-accessible document data identified by a user as being relevant and for structuring the collected data so that relevant data is in a structured form that can be utilized by a microprocessor-based device. Using a convenient visual programming interface, the user can automate collection of data from the Internet and transform the data to a machine usable format such that the unstructured data available on the Internet can be stored and later processed, effectively converting document-centric information to database-centric information and thus to accessible intelligence. This enables applications to be run using the extracted data and avoids the presently required laborious manual or hard-coded inputting of information gleaned from the Internet into such applications. The result is that the user cannot only access and manipulate database-centric forms of information available within an enterprise, but also document-centric forms of information available on the Internet.