1. Field of the Invention
This invention relates to the field of content retrieval. In particular, the invention relates to a computer system and methodology for extracting and aggregating data from dynamic content.
2. Description of the Background Art
Computers are very powerful tools for storing and providing access to vast amounts of information. The first personal computers were largely stand-alone units with no direct connection to other computers or computer networks. Data exchanges between computers were mainly accomplished by exchanging magnetic or optical media such as floppy disks. Over time, more and more computers were connected to each other and exchanged information using Local Area Networks (“LANs”) and/or Wide Area Networks (“WANs”). Initially such connections were primarily amongst computers within the same organization via an internal network. More recently, the explosive growth of the Internet has provided access to tremendous quantities of information from a wide variety of sources. The Internet comprises a vast number of computers and computer networks that are interconnected through communication links. The World Wide Web (WWW) portion of the Internet allows a server computer system to send graphical Web pages of information to a remote client computer system. The remote client computer system can then display the Web pages in a Web browser application (e.g., Netscape Navigator or Microsoft Internet Explorer). To view a specific Web page, a client computer system specifies the Uniform Resource Locator (“URL”) for that Web page in a request (e.g., a HyperText Transfer Protocol (“HTTP”) request). The request is forwarded to the Web server that supports that Web page. When that Web server receives the request, it sends the specified Web page to the client computer system. When the client computer system receives that Web page, it typically displays the Web page using a browser application.
Currently, Web pages are typically defined using Hyper-Text Markup Language (“HTML”). HTML provides a standard set of tags that define how a Web page is to be displayed. When a user indicates to the browser to display a Web page, the browser sends a request to the server computer system to transfer to the client computer system a HTML document that defines the Web page. When the requested HTML document is received by the client computer system, the browser displays the Web page as defined by the HTML document. The HTML document contains various tags that control the displaying of text, graphics, controls and other features. The HTML document may also contain URLs of other Web pages available on that server computer system or other server computer systems. Web pages may also be defined using other markup languages, including cHTML, XML, and XHTML.
Every day, more and more information is made available via the Internet. The challenge posed to users is how to efficiently locate, access, and use information and applications that are relevant to them from amongst the huge quantities of materials that are available in a variety of different formats. The WWW is made up of millions of “Web sites” with each site having a number of HTML pages. Each HTML page usually has a number of Web objects on each page such as graphics, text, and “Hyper-Text” references (URL's) to other HTML pages. For example, a user may wish to collect information from three different sources. Each of these sources may potentially maintain information in a different format. For instance, one source may be a database, a second may be a spreadsheet, and a third may be a Web page. There is also a need to identify and retrieve dynamically updated content from these diverse network sources.
One mechanism for providing access to personalized information is a “portal”. Corporate portals or enterprise information portals (ElPs) have many uses and advantages, but the most common overarching task of any portal is to provide users with efficient access to personalized information and applications. For an internal network, this can mean employee information such as retirement account balances and vacation requests. It may also include sales force automation and enterprise resource planning (ERP) applications. Externally, portals can collect and make information available to third parties such as business partners and customers. Portals can be used to simplify customer support, drive e-business, and/or accelerate content distribution.
A basic portal assembles static information in a single browser page for employee or customer use. Typically, static content is retrieved and placed into a giant repository which maintains all of the information used in the portal. The information is then reformatted and published to users of the portal. However managing a huge repository of content presents a number of difficulties to the organization operating the portal. One problem is that some types of information must be constantly updated. For example, stock prices or weather reports change regularly and frequently.
Another problem with information content retrieved via the Internet is that much of the information is not in a structured format. The data is not formatted in tables and rows like information stored in a database system. Instead, most of the information is unstructured or semi-structured data.
What is needed is a solution that enables user to capture dynamic content from a variety of sources such as Web pages, databases, and XML documents. The solution should provide an easy-to-use and flexible means to extract and aggregate data from content captured from various sources. Ideally, the solution should enable useful data to be extracted from dynamic content available from a data source and aggregated with other information. The present invention provides a solution for these and other needs.