The present invention relates to a method and system for automated browsing and data extraction of data found at global communication network sites such as Web sites that include HTML or XML data.
The Internet is becoming the de facto default network for people and computers to connect to each other because of its truly global reach and its free nature. HTML (HyperText Markup Language) is the widely accepted standard for human interaction with the Internet and particularly the World Wide Web (the xe2x80x9cWebxe2x80x9d). HTML, in conjunction with a browser, allows people to communicate with other people or computers at the other end of the network.
The Internet can also be viewed as a global database. A large amount of valuable business information is present on the Internet as HTML pages. However, HTML pages are meant for human eyes, not for a computer to read them, posing serious limitations on how that information can be used in an automated manner.
HTML Web pages are built as HTML tags within other tags, in effect forming a xe2x80x9ctreexe2x80x9d. Certain automated browsers interpret the hierarchy and type of tags and render a visual picture of the HTML for a user to view. HTML data-capture technology currently available follows a paradigm of xe2x80x9cdesignxe2x80x9d and xe2x80x9crunxe2x80x9d modes. In design mode, a user (e.g., a designer), through software, locates Web sites and extracts data from those sites, by way of an xe2x80x9cexamplexe2x80x9d. The software program saves the example data and in the xe2x80x9crunxe2x80x9d mode, automatically repeats the example for the new data. However, most Web pages can, and do, change as frequently and as much as their Webmaster desires, sometimes changing the tree hierarchy completely between design time and run time. As a result, reliable extraction of data, including business data, from an HTML page becomes a challenging task.
There are certain known methods for extracting this information. For example, OnDisplay Inc. of San Ramon, Calif. has a xe2x80x9cCenterStage eContentxe2x80x9d product that can access, integrate and transform data from multiple HTML pages. OnDisplay""s HTML data recognition algorithm works by remembering the depth and location of the required business information within the HTML xe2x80x9ctreexe2x80x9d between the design and run modes.
As another example, Neptunet Inc., of San Jose, Calif., provides for a system comprising a method, whereby, after getting the Web data, all further processing of that data has to be programmatically specified. Neptunet""s HTML data recognition algorithm works by remembering the depth and location of the required business information within the HTML xe2x80x9ctreexe2x80x9d between the design and run modes.
Other HTML data capture mechanisms include methods whereby HTML data extraction is performed by specifying (i.e., hard coding) the exact HTML tag number of the data to be extracted using a programming language such as Visual Basic or Visual C++. The drawbacks of these types of methods is that at the slightest change in the appearance of the Web page, the program has to be changed, making it an impractical solution for reliable data processing solutions.
HTML is a very useful information presentation protocol. It allows visually pleasing formatting and colors to be set for data being presented to make it more understandable. For example, a stock price change can be shown in green color if the stock is going up and in red if the stock is going down, making the change visually and intuitively more understandable.
But more and more, the Internet is also being used for machine to machine (i.e., computer to computer) communications. While HTML is a wonderful mechanism for the purpose of human interaction, it is not ideally-suited for computer to computer communication. It has the main disadvantage for this purpose that there is no way for the data being sent to be described as xe2x80x9cwhatxe2x80x9d the data is supposed to represent. For example, a number xe2x80x9c85xe2x80x9d appearing on a Web stock trading screen in the browser may be the stock price or the share quantity. The data just gets shown in the browser and it is the human being looking at the browser who knows what numbers mean what because of casual context information shown around the data. But in machine to machine communication, the receiving computer lacks the context resolution intelligence and has to be told very specifically that the number xe2x80x9c85xe2x80x9d is the stock price and not the share quantity.
The need for correct and specific understanding of the data at the receiving computer""s end has been conventionally satisfied via EDI (Electronic Data Interchange), where the sending and receiving computers have to be synched up to agree on the sequence, length and format of the data elements that can be sent as a complete message. This mechanism, while it works, is cumbersome because of the prior agreement required between the two computers and hence can be used effectively only in a network of relatively few computers in communication with one another. It does not work in an extremely large network like the Internet.
The void of clarity of data definition in a large network is being filled today by a new Internet protocol called XML (Extensible Markup Language). XML provides a perfect solution to specify explicitly and clearly what each number reaching the receiving computer is supposed to be. XML has a feature called xe2x80x9ctagsxe2x80x9d which go with the data and describe what the data is supposed to be. For example, the stock price will be sent in a XML stream as:
 less than Stock Price greater than  85  less than /Stock Price greater than 
The xe2x80x9c/xe2x80x9d in the second tag signifies that the data description for that data element is complete. Other tag pairs may follow, describing and giving values of other data elements. This allows computer to computer data exchange without needing a prior agreement between the computers about how the data is formatted or sequenced. additionally, XML is capable of showing relationships between pieces of data using a xe2x80x9ctreexe2x80x9d or hierarchical structure.
But XML has its own unique problems. While useful as data definition mechanisms, XML tree structures cannot be fed to existing data manipulation mechanisms operating on relational (tabular) data formats using well known languages like SQL.
It is believed that OnDisplay, Neptunet and WebMethods are companies allowing a fairly user-friendly design time specification of XML data interchange between computers, saving the specifications and reapplying them at a later point in time on new data. Several companies offer point-and-click programming environments with varying capabilities. Some are used to generate source code in other programming languages, while others execute the language directly. Examples are Visual Flowcoder by FlowLynx, Software-through-pictures by Aonix, ProGraph by Pictorius, LabView by National Instruments and Sanscript by Northwoods Software. All of these methods lack the critical built-in ability to capture and use Web based (HTML/XML) real-time data.
One aspect of the present invention provides a computer-implemented method for automated data extraction from a Web site. The method comprising: navigating to a Web site during a design phase; extracting data elements associated with the Web site and producing a visible display corresponding to the extracted data elements; selecting and storing at least one Page ID data element in the display from the data elements; selecting and storing one or more Extraction data elements in the display; selecting and storing at least one Base ID data element having an offset distance from the Extraction elements; setting a tolerance for possible deviation from the offset distance; and renavigating to the Web site during a playback phase and extracting data from the Extraction data elements if the Page ID data element is located in the Web site and if the offset distance of the Base ID data element has not changed by more than the tolerance.
Preferably, the user-specific information is entered into the Web site and used in connection with producing the data to be extracted from the Extraction data elements. The data elements preferably are HTML elements. The visible display may comprise a grid containing rows and columns including information about each the data elements extracted. Desirably, the information comprises, for each data element, fixed information of grid row number, HTML tag number and visible text, and user-selected information of Page ID, Base ID, Extract and tolerance. Also preferred, a position of the Page ID data element within the Web site is stored and the extracting occurs during the playback phase if the Page ID data element has not changed position. Further, the Page ID data element is desirably selected as a data element that is unlikely to change position upon reformatting of the Web site and the display contains data desired to be extracted.
Another aspect of the present invention provides a computer-implemented method for automated data extraction from a Web site, comprising: navigating to a Web site during a design phase; extracting data elements associated with the Web site and producing a visible current display grid corresponding to the extracted data elements; selecting and storing at least one Page ID data element in the current display from the data elements; selecting and storing one or more Extraction data elements in the current display; selecting and storing at least one Base ID data element in the current display having an offset distance from the Extraction elements; entering a tolerance in the current display for possible deviation from the offset distance; displaying a playback display grid during a playback phase with the selected Page ID data element, the Extraction data elements, and the Base ID data element; renavigating to the Web site; extracting data elements associated with the Web site to the visible current display grid; and comparing the extracted data elements in the current display grid with the playback display grid and extracting data from the Extraction data elements if the Page ID data element is found in the current display grid and if the offset distance of the Base ID data element has not changed by more than the tolerance. Preferably, the tolerance comprises a forward and backward tolerance.
A further aspect of the present invention provides a computer-implemented method for automated browsing of Web sites on a global communications network and for extracting usable data, comprising: accessing at least one Web site page containing data, wherein the data comprises a hierarchy of HTML tags; transforming the hierarchy of tags into a computer-readable list; identifying a base data element from the list; identifying an offset from the base data element to the usable data; and extracting the usable data for use by a user regardless of changes to the Web site, provided that the offset between the base data element and the usable data does not change. Desirably, the offset is identified during a design phase and saved for use in a run time phase, which extracts the usable data.
Another aspect of the present invention provides a computer-implemented method for automated browsing Web sites and for extracting usable data, comprising: filling a current display grid with rows of HTML data elements from at least one Web site page currently selected by a Web browser; displaying in a playback display grid previously-stored HTML data elements; examining the rows of the playback grid to locate an HTML data element previously selected as a Page ID data element; comparing the rows of the current grid to locate an HTML element that matches the Page ID data element; examining the rows of the playback grid to locate HTML data elements previously selected as Extraction data elements and a Base ID data element used as a reference for locating the Extraction data elements; comparing the rows of the current grid to locate HTML elements that match the Extraction data elements and match the Base ID data element; and extracting data from the Extraction data elements regardless of changes to the Web site, provided that the Page ID elements match and any offset between the Base ID elements is within a predetermined tolerance.
A still further aspect of the present invention provides a computer-based system for automatically browsing Web sites, comprising a client computer and a server computer for receiving requests from the client computers over a network connecting the client and server computers, the client computer running an application to: navigate to a Web site during a design phase; extract data elements associated with the Web site and produce a visible display corresponding to the extracted data elements; select and store at least one Page ID data element in the display from the data elements; select and store one or more Extraction data elements in the display; select and store at least one Base ID data element having an offset distance from the Extraction elements; set a tolerance for possible deviation from the offset distance; and renavigate to the Web site during a playback phase and extract data from the Extraction data elements if the Page ID data element is located in the Web site and if the offset distance of the Base ID data element has not changed by more than the tolerance.
Another aspect of the instant invention provides a computer-implemented method for automated XML data extraction, comprising: identifying selections of XML data elements for extraction from a source of XML data comprising XML data stored in XML format; storing information related to the identified selections of XML data elements for subsequent use; acquiring the source of XML data and retrieving the XML data elements; comparing the retrieved XML data elements to the identified selections and extracting only the data from the XML data elements that correspond to the identified selections; and reformatting the extracted XML data into a relational format. The source of XML data can be a Web site or a file. The extracted data may be saved into a relational data table, and the reformatted extracted XML data is passed to a calling application.
A further aspect of the instant invention provides a computer-implemented method for automated XML data extraction, comprising: navigating to a Web site containing XML data; identifying selections of XML data elements for extraction from the Web site, the XML data comprising data elements containing the data stored in XML format; storing information related to the identified selections of XML data elements for subsequent use; re-navigating to the Web site and retrieving the XML data elements; comparing the retrieved XML data elements to the identified selections and extracting only the data from the XML data elements that correspond to the identified selections; and reformatting the extracted XML data into a relational format. The extracted data is desirably saved into a relational data table.
A yet further aspect of the present invention provides a computer-implemented method for automated XML data extraction, comprising: navigating a client computer to a Web site containing XML data; generating a graphical tree structure on the client computer to display XML nodes and subnodes representing the XML data at the Web site; selecting one or more of the nodes and/or subnodes from the tree structure associated with the data to be extracted; storing information related to the selected nodes and/or subnodes; renavigating the client computer to the Web site and retrieving the XML data using the information; comparing the retrieved XML data with the selected nodes and/or subnodes and extracting only the data corresponding to the selected nodes and/or subnodes; and reformatting the extracted XML data into a relational format. Desirably, selecting one subnode under a parent node automatically selects all subnodes under the parent node.
Another aspect provides a computer readable medium storing a set of instructions for controlling a computer to automatically extract desired XML data from a source of XML data, the medium comprising a set of instructions for causing the computer to: identify selections of XML data elements for extraction from a source of XML data comprising XML data stored in XML format; store information related to the identified selections of XML data elements for subsequent use; acquire the source of XML data and retrieve the XML data elements; compare the retrieved XML data elements to the identified selections and extract only the data from the XML data elements that correspond to the identified selections; and reformat the extracted XML data into a relational format.
A still further aspect provides a computer-based system for automated XML data extraction, comprising a client computer and server computer for receiving requests from the client computer over a network connecting the client and server computers, the client computer running an application to: identify selections of XML data elements for extraction from a source of XML data contained at the server computer and comprising XML data stored in XML format; store information related to the identified selections of XML data elements for subsequent use; acquire the source of XML data and retrieve the XML data elements; compare the retrieved XML data elements to the identified selections and extract only the data from the XML data elements that correspond to the identified selections; and reformat the extracted XML data into a relational format.