1. Field of the Invention
This invention relates to the field of hypertext systems. Specifically, this invention is a new and useful method, apparatus and computer program product for archiving hyperlinks contained in hypertext documents.
2. Background
The World Wide Web (WWW) is a massive hypertext system accessed by a computer user using an information access apparatus such as a WWW browser computer application. The WWW browser application communicates with an information provider executing on a computer apparatus to obtain information and services in the form of a hypertext document. The hypertext document can represent a variety of information, including, but not limited to, news, mail, documentation, menus of options, database queries and results, simple documents with graphics, and hypertext views of bodies of information. The background of the WWW is described by reference to the first chapter of Instant HTML Web Pages, by Wayne Ause, Ziff-Davis Press, ISBN 1-56276-363-6, copyright 1995, pages 1-15, hereby incorporated by reference as illustrative of the prior art.
The hypertext document is identified in the WWW context by a universal resource locator (URL). The URL specification, also incorporated by reference, is described in RFC1738 and can be found on the WWW at:
http://andrew2.andrew.cmu.edu/rfc/rfc1738.html PA1 http://andrew2.andrew.cmu.edu/rfc/rfc1808.html PA1 http://andrew2.andrew.cmu.edu/rfc/rfc1866.html
Briefly, the URL contains a protocol specification and a path specification. The protocol specification notifies the browser of what protocol to use when accessing a remote server containing the hypertext document. The path specification is generally a hierarchical path that specifies a data server followed by a hypernode (such as a hypertext web page document) that actually provides the information for the browser.
The currently presented hypertext document is termed the base document. The base document often includes one or more hyperlinks to related information outside the base document. A hyperlink is a labeled relationship to a resource. A hyperlink generally contains a user-meaningful label and an identifier of the referenced resource. Activating the hyperlink often results in accessing a completely different hypertext document supplied from completely different WWW server applications on other computer systems.
In HTML (Hyper Text Markup Language) a commonly used markup language that describes the hypertext document a hyperlink can be defined by an anchor (specified by an &lt;A&gt; element). The anchor contains a number of attributes, one of which can be an HREF attribute. The HREF attribute identifies a portion of the hyperlink that specifies the URL. The URL specified by the HREF attribute may be an absolute URL or a relative URL. The absolute URL is the URL in its complete form. It includes the scheme, network location and the URL-path. The relative URL is a compact representation of the location of a resource relative to an absolute URL. The relative URL is parsed from an absolute URL using the protocol specified in RFC1808. RFC1808 can be found on the WWW at:
The absolute URL may also be derived from the relative URL using the protocol described in RFC1808(4).
FIG. 1a illustrates a sample of HTML data as indicated by general reference character 100. The HTML data sample 100 includes a head section 101 that contains HTML header information. The HTML data sample 100 also contains a relative hyperlink anchor 103 that contains an (Hypertext Reference) attribute that specifies a hyperlink to a file named "chat.html" within a directory named "developers" that is a subdirectory of a base directory known to the browser. The HTML data sample 100 also contains an absolute hyperlink anchor 105 that provides the absolute URL. Also, the HTML data sample 100 includes a base document fragment anchor 107 that provides a hyperlink to a named section in the base document. The named section in the base document is defined by a fragment defining anchor 109. One version of the HTML specification is defined by RFC1866 and can be found on the WWW at:
FIG. 1b illustrates a presentation of the HTML data sample 100 as indicated by general reference character 120. The presentation 120 is generated by a browser application that processed the HTML data within the base document. The presentation 120 is similar regardless of whether it is displayed on a computer display with active hyperlinks or stored in an archival form such as a printed page. When a browser application displays the presentation 120 on a computer display, the information is presented in a window 121. Each of a plurality of displayed hyperlinks 123 is indicated by the display text provided within the anchor definition of the corresponding HTML markup as is well known in the art. A fragment text 125 starts after the fragment defining anchor 109. When a browser application displays the presentation 120 on a computer display, the user can select any of the plurality of displayed hyperlinks 123 to present the information referenced by that hyperlink. However, when the browser application archives the presentation 120 (such as on a printed page), the reader of the archive cannot determine the location of the information referenced by the hyperlink. The only information the reader receives is the display text associated with the URL by the hyperlink. Thus, the reader is unable to access the information referenced by the hyperlink.
FIG. 1c illustrates an overview of a `prior art print processing` process as indicated by general reference character 150. The process initiates at a `start` terminal 151 and continues to a `print command initialization` procedure 153 that initializes the print command options. Then a `get print parameters` procedure 155 displays a dialog and retrieves print information and options such as the number of copies, the page range to be printed and other printing related information from the user. Next a `print pages` procedure 157 prints the pages in accordance with the print command options. The process completes through an `end` terminal 159.
To summarize, the `prior art print processing` process 150 does not archive the hyperlinks contained in the base document because the URLs contained in the hyperlinks are not printed. Although a display, presented by a browser application, of a hypertext document described by HTML indicates hyperlinks and the hyperlink's URL, the archived document (such as a printed version of the hypertext document) does not. Thus, an archived hypertext document does not provide the URL for the hyperlink. Only the display text associated with the URL by the hyperlink is printed. Thus, a reader of the printed document does not have the URL associated with the hyperlink and thus the information referenced by the hyperlink is not accessible to the reader. One skilled in the art will understand that saving the HTML data describing the document will save the URL specifications within the document. However, finding the URL of interest when it is embedded within the HTML definition of a document is often difficult and time consuming for the average WWW user. Another approach is to save the HTML statements in a file. This saved HTML file can be later input to a WWW browser so that the hyperlinks can be accessed in the normal manner. The difficulty with this approach is that users often prefer to keep paper images (often for filing with other information that is not from a computer source--such as a news clipping etc.). Additionally, handouts and seminar and conference proceedings are now being created in hypertext form. However, paper copies of these handouts and proceedings do not include the hyperlinks addresses. Finally, when the document is archived onto paper, the user can physically write the URL on the paper. However, physically writing the URL on the paper is both error prone and often difficult to read by another.