1. Field of the Invention
This invention generally relates to world wide web navigation and content extraction and, more specifically, to methods and systems for automating such processes.
2. Description of the Related Art
The amount of information on the world wide web has increased dramatically in the past several years. Many individuals and organizations use such a resource for gathering information. Unfortunately, harvesting information from the world wide web is typically a time consuming process. In particular, collecting information from the world wide web often involves manually navigating through sites and extracting information by copying the information via manual data reentry and/or cut and paste features. In some cases, custom applications can be written to automate the collection process. However, the development of such codes is time consuming. In particular, the development of custom applications typically involves a great deal of analysis to outline the navigational routes through a website and the steps needed to query and extract content from the website. In addition to being time consuming to prepare, custom applications are generally complex to write and, therefore, typically require one or more skillful programmers. In cases in which the content is not formatted in the industry standard layout, the script will further need to outline steps to convert the content to the standard layout so that the material may be interpreted correctly. Such a conversion will further complicate the application code, consuming more of the programmer's time.
In addition to requiring a complex set of program instructions and taking a large amount of time to prepare, custom applications are highly prone to failure when changes are made to websites that the code accesses. More specifically, the navigational routes included in the custom applications may be rendered useless when information on an accessed website has changed. Consequently, custom applications are generally restricted to collecting unscripted content from websites. “Unscripted content,” as used herein, may generally refer to website content which does not depend on client-side scripts in order to obtain the content. In other words, unscripted content may refer to information displayed on a website that is governed by the website's server.
In contrast, “scripted content” may refer to website content which includes one or more executable scripts from which to access the content. In this manner, scripted content may refer to information on a website which is susceptible to change without interface with the website's server. Such scripted content is sometimes referred to as dynamic hyper text markup language (DHTML), however, other markup languages known in the website development industry may be referred to as scripted content as well. Examples of information which may be desirable to display as scripted content may include, for instance, stock quotes from brokerage websites, prices of specific items from online commercial vendors and online auction sites, regional weather information, airline ticket information, shipment tracking information, news headlines on news organizations websites, and bank account balances. Other information may be displayed as scripted content as well or alternatively, depending on the design specifications of a website.
As such, it would be advantageous to develop systems and methods for automating world web navigation and content extraction. For example, it would be beneficial to develop systems and methods which extract content, particularly scripted content, from websites without user intervention. Such systems and methods may also be configured to navigate websites without user intervention as well. In addition to automating website navigation and content extraction, it would be advantageous to develop a system which can standardize web content and/or allow for the incorporation XPath query language within a custom application.