1. Field of the Invention
The invention generally relates to data query and collection, and particularly relates to a system and method for adaptively locating dynamic web page elements.
2. Description of Related Art
Along with the rapid growth of the World Wide Web, (WWW), web contents are becoming richer and richer. In the era of Web 2.0, it is estimated that there are about 15 to 30 billion web pages on the Web. Therefore, it is becoming a burdensome effort for users to manually access web pages one-by-one and locate contents of interest. Therefore, many web sites provide the web services named REST, SOAP, WSDL, FEED, and other web services for machine access. However, in contrast to the fast growth of web pages and their contents, the improvements of these Web services are much slower. Most information on web pages is still only accessible to people visiting the web pages.
Although web pages may be well designed for accessing by users, such design only focuses on the presentation structure or type setting for the end users. It is difficult to simultaneously give consideration to the demands of Web services for machine access. Further, web pages distributed on the Web are usually highly dynamic, volatile, distributed, and heterogeneous. Moreover, when compared with traditional plain text documents, web page contents are often much more diverse.
To this extent, in order to leverage the huge informational and functional resources of the Web, there are many existing tools that allow users to cut user interfaces from the existing Web, extract data, functions and processes, and transform them into reusable subscription files (FEEDs) and services.
The extraction of data from web pages is always implemented through XPath. XPath means XML Path Language, which is a language for finding information in an Extensible Markup Language, (XML), document and determining the location of some part in the XML document. XPath can be used as a light-weight query language by developers, for navigating elements and attributes through an XML document. There are seven kinds of nodes in XPath: element, attribute, text, namespace, processing-instruction, comment, and document (root) nodes. XML documents are treated as trees of nodes. The root of a tree is called the document node or root node. XPath uses path expressions to select nodes or node-sets in an XML document.
These path expressions look very much like the expressions that can be seen in a traditional computer file system. The path can be an absolute path or a relative path. A path expression may have predicates, wildcards, and operators. XPath also includes over one hundred built-in standard functions, for string values, numeric values, date and time comparison, node manipulation, sequence manipulation, Boolean values, and more. Some exemplary XPath path expressions may be shown as below:                /html/body/div/div/form/table/tr[1]/td/input[@name=keyword]; /html/body/div/ . . . p/div/a[@content=next]; and . . . /input[@id=12345].        
There are many new technologies and applications in commercial, academic and industrial fields, which are developed and implemented for extracting data, functions, and processes from the Web based on XPath. For example, first a web page is parsed into an HTML (Hypertext Markup Language) DOM (Document Object Model) tree. The DOM mentioned here means the standard document object model defined by W3C, (World Wide Web Consortium). It represents HTML and XML (Extensible Markup Language) documents in tree structures, and defines methods and attributes for traversing the tree and checking and modifying the nodes of the tree.
Under the DOM tree structure, various nodes of an HTML document are regarded as various types of node objects. Each node object has its own attributes and methods, which may be utilized for traversing the whole document tree. After a DOM document tree is generated, the required elements may be queried with attribute and tag names. Then, the elements, i.e., the required data, can be located through XPath. Once the data required by a user has been extracted from the web page, its XPath-based path expression can be recorded, and the data can be located and accessed once again through the recorded XPath path expression when needed in the future.
However, as a result of the highly dynamic nature of web pages, most web pages are generated dynamically, so that the contents of web pages are often varied. Further, many web sites may update their web pages periodically, such as by adding, modifying, or deleting contents, formats, or layouts of the existing web pages. These updates or modifications will often affect the XPath path expressions of the data in the web pages such that when a user tries to access the required data through the XPath path expression previously recorded, the data may not be found or wrong data is located. Therefore, the above method for accessing and extracting data based on XPath is not adaptive.
Thus, in order to extract required data and functions from web pages when the pages are varied dynamically, one of the biggest challenges is to locate unstructured or semi-structured data accurately. Therefore, there is a need for the technology for XPath-based locating required Web contents in dynamic web pages in spite of the variety of web contents.