The invention relates to capturing hypertext web pages for convenient viewing.
The World Wide Web (“the web”) of the Internet has become in recent years a popular means of publishing documentary information. In particular, it is now common for users with access to the web to browse through collections of linked documents through the use of hypertext browsers, such as Netscape Navigator™ or Microsoft Internet Explorer™, whereby selection by the user of certain screen objects in a displayed document causes the contents of another document to be retrieved and displayed to the user.
Many of the documents on the web are encoded using a markup language known as the Hypertext Markup Language (HTML). HTML Version 3.2 with Frame Extensions is described in Graham, HTML Sourcebook, Third Edition, published by Wiley Computer Publishing, 1997. A markup language is a set of codes or tags that can be embedded within a document to describe how it should be displayed on a display device, such as a video screen or a printer. HTML is what is known as a “semantic” markup language. This means that, while it is possible to use HTML to dictate certain physical characteristics of a document (such as line spacing or font size), many HTML tags merely identify the logical features of the document, such as titles, paragraphs, lists, tables, and the like. The precise manner in which these logical features are displayed is then left to the browser software to determine at the time the document is displayed.
Because HTML tags often do not specify a fixed physical size of a document or its components, the precise appearance of a particular document displayed by a browser will often depend on the size of the browser window in which it is displayed. For example, FIGS. 1 and 2 show two views of the home web page of the US Patent and Trademark Office (specified by Uniform Resource Locator (URL) http://www.uspto.gov/ in September of 1997). In FIG. 2, the web browser window is significantly smaller than that in FIG. 1 and, as can be seen, the web page as seen through the two windows differs in its overall appearance, for example with respect to the width of the title 30 and list element 40.
One important feature of HTML is the ability, within an HTML document, to refer to external data resources. One way that such references are used within HTML is to identify auxiliary documents that are sources of content to be displayed as part of the display of the HTML document. For example, the HTML tag “IMG” specifies that the contents of a specified image document should be displayed within a portion of the display of the HTML document in which the IMG tag is found. Similarly, the tag “FRAME” within an HTML document specifies that the content of a specified document should be displayed within a particular frame of a frame set defined by the HTML document. The use of frames and frame sets within HTML is explained in more detail below.
HTML also features the ability to have a hypertext link within an HTML document. A hypertext link within an HTML document creates an association between a screen object (e.g., a word or an image) and an external resource. When the HTML document is displayed by a browser, a user may select the screen object, and the browser will respond by retrieving and displaying content from the external resource. A hypertext link may be specified within an HTML document with, for example, the HTML anchor tag with an HREF attribute.
The use of such external references within HTML facilitates distributed document storage on a wide area network (WAN). A large document may be broken up and stored as a set of smaller documents logically associated by external references. For example, it is common for the graphical images in an HTML document to be stored as separate documents (e.g., in the GIF or JPEG format). It is also common to store sections of a large text as separate documents, and to facilitate easy movement from one section to another through the use of hypertext links.
In addition, a set of pre-existing documents may be linked together with HTML tags to form a coherent whole. For example, an HTML document may be created containing hypertext links to a set of pre-existing documents relating to a common subject, thus facilitating the systematic review of such documents by a user.
A characteristic of HTML documents is that they are not paginated. That is, the displayed “height” of an HTML document is determined solely by the amount and arrangement of the screen objects defined within it, as displayed by the browser used to view it, and not by any fixed page size associated with the document. (Here “page size” does not necessarily refer to physical pages printed on paper, for example, but is simply a characteristic of an electronic document in which the content of the document is divided into a sequence of regions with fixed dimensions.) If the displayed document does not fit within the height of the browser window, the browser permits scrolling of the web page to permit additional content to be viewed. FIG. 3 shows the home web page of the US Patent and Trademark Office displayed within the same browser window as in FIG. 2, except that the page has been scrolled somewhat to reveal additional material.
A recent extension to HTML permits multiple scrollable and resizable “frames” to be displayed within a single browser window. A frame is defined by a special type of HTML document known as a “frame set”. A frame set provides information giving the size and orientation of frames in a window, and specifies the contents of each frame. The contents of a frame may be either the contents of an HTML document, or a subsidiary frame set (i.e., a frame set, the entire contents of which appear within a single frame of the larger frame set). As with other HTML screen objects, the height or width of a frame may be specified in absolute or relative terms.
FIGS. 4, 5 and 6 illustrate the operation of frames in HTML. FIG. 4 shows a browser window displaying a frame set containing two frames. Frame 50 is a narrow vertical column on the left hand side of the screen. Frame 55 is a wider column to the right of frame 50. Frame 50 contains an HTML document that is as long as the browser window is high, while frame 55 contains a document that is longer than the browser window's height. As can be seen in FIG. 5, frame 55 can be scrolled independently of frame 50 to display the remainder of the HTML document contained within it.
In the above example, frame 50 is defined to have a fixed width of 115 pixels, whereas the width of frame 55 is defined relative to the width of frame 50—its width is set equal to the browser window's width, less the 115 pixels used by frame 50. As can be seen in FIG. 6, when the browser window is made smaller, frame 55 shrinks accordingly, while frame 50 remains at a fixed width.
As explained above, the ultimate appearance of an HTML document being displayed by a browser will usually depend on the size of the browser window (or frame) in which it is to be displayed. In general, a web browser will extract from an HTML document a series of screen objects (e.g., words, images, lists, frames or tables), and place them sequentially in rows on the screen. When a row has been filled, the next object is placed in a successive row. This process continues until all screen objects within the HTML document have been placed.
This general principle, however, is limited by the constraint that the width of the displayed HTML document cannot be narrower than the minimum width of the widest screen object contained within it. Under this constraint, if the minimum width of a screen object is wider than the width of the browser window, parts of the document will remain off screen (to the left or right) when viewed through the browser window, and a horizontal scroll bar will typically be displayed to permit the user to shift views of the document to the left or right.
HTML screen objects may have either a fixed or a variable width. For example, the width of a single word of text in an HTML document is fixed (given the font chosen by the browser in which to display it). Its width is determined by the characters in the word and the size font in which they will be displayed. Similarly, the width of a cell in an HTML table may be made fixed by explicitly specifying its width as a certain number of pixels.
By contrast, the width of a variable width screen object will vary, depending on the width of the browser window in which it appears. However, even a variable width screen object will have a minimum width. For example, the width of a paragraph of text will generally vary according to the size of the browser window; however, it can be no narrower than the widest word contained within the paragraph. Similarly, a table containing images may have cells whose widths are defined in relative terms, but the table nonetheless cannot be narrower than the sum of the widths of the images within its widest row.
This constraint is illustrated in FIGS. 7, 8, 9 and 10. In each of FIGS. 7, 8 and 9, an identical HTML document is displayed in a browser window 65. An excerpt of the underlying HTML code is shown in FIG. 10. Referring to FIGS. 7 and 10, the document being displayed includes a table 80 having two cells aligned to the top, one cell 85 containing a client-side image map and the other cell 90 containing the heading “US Patent and Trademark Office”, a horizontal line, and an unordered list with the heading “New on the PTO site:”. In FIG. 8, the window 65 is narrower than in FIG. 7, but wider than the minimum width of any object on the screen. Therefore, each line of the document is adjusted to be as wide as the window 65 and nothing is hidden from the user to the right of the browser window. By contrast, in FIG. 9, window 65 is narrower than the minimum width of table 80, since the fixed width of the image map in cell 85 plus the width of the widest word in cell 90 (the word “trademark”) is greater than the width of the browser window 65. Therefore, the resulting display width of the document is wider than window 65, resulting in the rightmost part of the document being hidden from view.
While collections of visual display data on the web are typically stored as sets of linked HTML documents, it is also common and convenient for visual display data to be stored as a single document, having a fixed page size, using a physical markup language such as the portable document format (PDF). PDF is described in the publication Adobe Systems, Inc., Portable Document Format Reference Manual, Addison-Wesley Publishing Co., 1993.