Electronic publication of documents, using non-paper media for transmission and storage, has become increasingly common. Electronically published documents are generally viewed by computer, and are preferably rendered or displayed on a computer screen or other output device in a formatted form. The DYNATEXT system, a computer system available from Electronic Book Technologies of Providence, Rhode Island, is a system which is particularly useful for this purpose for very large documents.
Electronically published documents are increasingly being made available using a general markup language. A markup language provides indications of structure of the document, but excludes streams of graphic display instructions which are typically found in formatted documents. Markup languages are more portable between a variety of different machines that may use different graphic display commands. A commonly used markup language is the Standardized General Markup Language (SGML), an ISO standard.
Client-server computer systems for electronically publishing documents have also become increasingly available. Such a system typically includes one computer system (the server) on which documents are stored so that other computer systems (the clients) can access the information. The server and client communicate via messages conforming to a communication protocol sent over a communication channel such as a computer network. The server responds to messages from clients and processes requests to transmit a requested document.
An example of a client-server computer system for retrieval of electronically published documents that use a markup language is the World Wide Web (WWW) on the Internet. The WWW is a "web" of interconnected documents that are located in various sites on a global computer network. The WWW is also described in "The World-Wide Web," by T. Berners-Lee, R. Cailliau, A. Luotonen, H. F. Nielsen, and A. Secret, Communications of the ACM, 37 (8), pp. 76-82, August 1994, and in "World Wide Web: The Inforrnation Universe," by T. Berners-Lee, et al., in Electronic Networking: Research, Applications and Policy, Vol. 1, No. 2, Meckler, Westport, Conn., Spring 1992.
Documents that are published on the WWW typically are written in the Hypertext Markup Language (HTML), such as described in Hypertext Markup Language Specification--2.01 by T. Berners-Lee and D. Connolly, Internet Draft Document, Oct. 14, 1994, and in "World Wide Web & HTML," by Douglas C. McArthur, in Dr. Dobbs Journal, December 1994, pp. 18-20, 22, 24, 26 and 86. HTML documents stored as such are generally static, that is, the contents do not change over time unless the publisher modifies the document.
HTML is a markup language used for writing hypertext documents. HTML documents are SGML documents that conform to a particular Document Type Definition (DTD). An HTML document includes a hierarchical set of markup elements, where most elements have a start tag, followed by content, followed by an end tag. The content is a combination of text and nested markup elements. Tags are enclosed in angle brackets (`&lt;` and `&gt;`) and indicate how the document is structured and how to display the document, as well as destinations and labels for hypertext links. There are tags for markup elements such as titles, headers, text attributes such as bold and italic, lists, paragraph boundaries, links to other documents or other parts of the same document, in-line graphic images, and many other features.
Each document available on the WWW has one or more identifiers called a Uniform Resource Identifier (URI). These identifiers are described in more detail in Universal Resource Identifiers for the World Wide Web, T. Berners-Lee, submitted as an Internet Request for Comments (RFC), as yet unnumbered. A URI allows any object on the Internet to be referred to by name or address, such as in a hypertext link in an HTML document. There are two types of URIs: a Universal Resource Name (URN) and a Uniform Resource Locator (URL). A URN references an object by name within a given name space. The Internet community has not yet defined the syntax of URNS. A URL references an object by defining a location and/or an access algorithm using network protocols. An example URL is "http://www.ebt.com" A URL has the syntax "scheme://host:port/path?selector" where "scheme" identifies the access protocol (such as HTTP, FTP or GOPHER); "host" is the Internet domain name of the machine that supports the protocol; "port" is an optional the transfer control protocol (TCP) port number of the appropriate server (if different from the default); "path" is an identification of the object; and "selector" contains optional parameters.
A site on a network which electronically publishes documents on the WWW documents is called a "Web site" and runs a "Web server," which is a computer program that allows a computer on the network to make documents available via the WWW. The documents are often hypertext documents in the HTML language, but may be other types of documents. Several Web server software packages exist, such as the Conseil Europeen pour 1a Recherche Nucleaire (CERN, the European Laboratory for Particle Physics) server or the National Center for Supercomputing Applications (NCSA) server. Web servers have been implemented for several different platforms, including the Sun Sparc 11 workstation running the Unix operating system, and personal computers with the Intel Pentium processor running the Microsoft MS-DOS operating system and the Microsoft Windows operating environment. The Web server also has a standard interface for running external programs, called the Common Gateway Interface (CGI). A gateway is a program that handles incoming information requests and returns the appropriate document or generates a document dynamically. For example, a gateway might receive queries, look up the answer in an SQL database, and translate the response into a page of HTML so that the server can send the result to the client. A gateway program may be written in a language such as "C" or in a scripting language such as Practical Extraction and Report Language (Perl) or Tcl or one of the Unix operating system shell languages. Perl is described in more detail in Programming Perl. by Larry Wall and Randal L. Schwartz, O'Reilly & Associates, Inc., Sebastopol, Calif., USA, 1992. The CGI standard specifies how the script or application receives input and parameters, and specifies how any output should be formatted and returned to the server.
A user (typically using a machine other than the machine used by the Web server) accesses documents published on the WWW runs a client program called a "Web browser." The Web browser allows the user to retrieve and display documents from Web servers. Some of the popular Web browser programs are: the Navigator browser from NetScape Communications, Corp., of Mountain View, Calif.; the Mosaic browser from the National Center for Supercomputing Applications (NCSA); the WinWeb browser, from Microelectronics and Computer Technology Corp. of Austin, Tex.; and the InternetWorks browser, from BookLink Technology, of Needham, Mass. Browsers exist for many platforms, including personal computers with the Intel Pentium processor running the Microsoft MS-DOS operating system and the Microsoft Windows environment, and Apple Macintosh personal computers.
The Web server and the Web browser communicate using the Hypertext Transfer Protocol (HTTP) message protocol and the underlying TCP/IP data transport protocol of the computer network. HTTP is described in Hypertext Transfer Protocol--HTTP/1.0 by T. Berners-Lee, R. T. Fielding, H. Frystyk Nielsen, Internet Draft Document, Dec. 19, 1994, and is currently in the standardization process. In HTTP, the Web browser establishes a connection to a Web server and sends an HTTP request message to the server. In response to an HTTP request message, the Web server checks for authorization, performs any requested action and returns an HTTP response message containing an HTML document resulting from the requested action, or an error message. For instance, to retrieve a static document, a Web browser sends an HTTP request message to the indicated Web server, requesting a document by its URL. The Web server then retrieves the document and returns it in an HTTP response message to the Web browser. If the document has hypertext links, then the user may again select a link to request that a new document be retrieved and displayed. As another example, if a user completes in a form requesting a database search, the Web browser sends an HTTP request message to the Web server including the name of the database to be searched and the search parameters and the URL of the search script. The Web server calls a program or script, passing in the search parameters. The program examines the parameters and attempts to answer the query, perhaps by sending a query to a database interface. When the program receives the results of the query, it constructs an HTML document that is returned to the Web server, which then sends it to the Web browser in an HTTP response message.
Interaction between Web browsers and Web servers has a number of drawbacks. First, when a document is retrieved from a server by a client, the client typically must load the entire document into the client's memory. There is no protocol which allows access only a portion of a document. To provide acceptable performance, publishers to maintain a large document as a collection of small document fragments, typically less than a few tens of printed pages equivalent in length. Such collections of small document fragments lead to document management problems.
Another restriction of the Web is that the destination of a link is typically an entire document file identified by its URL. There is no protocol for linking to targets that are a portion of a document. Although bookmarks may be used which are in the form of "http://x.com/doc.html#chap4", using such a URL causes the whole document "doc.html" to be loaded and causes the client browser to scroll to the portion labeled "chap4". Since, in practice, URLs point to entire documents, the protocol effectively requires transfer of an entire document when requested. The use of whole documents in the current implementation of the World Wide Web requires end users to wade through irrelevant information after invoking a hyperlink unless publishers commit to managing reusable information in many little files.
One difficulty with maintaining several small documents is that an electronic document without reference to a paper-based medium may not have clearly definable portions. Although documents prepared using a descriptive markup language have a structure defined by the markup, such markup defines segments which generally have variable sizes. They may be as small as one word or as large as several printed pages.
When a publisher provides many small documents, a user may want to view related documents which could be considered as occurring prior to or after the document being viewed. In current systems designed for Web servers either on a global or a local computer network, where the document is already divided into predetermined segments, the publisher typically inserts a hypertext link, in the form of a graphic or text for example, in each document to refer to the previous or next document related to the document. Such a publication system however places an unnecessary document management burden on the publisher.
In systems like the DYNATEXT publishing system, a predetermined amount of data is selected from within a document and is viewed by the user. Such a system may read files from a CD-ROM or from an electronic document stored on a file server on a LAN. If a previous or subsequent document fragment is requested, another predetermined amount of data is prepared, or the system scrolls through previous and subsequent portions of the electronic document. However, a sequence of requests for previous segments and then following segments may not produce the same result all of the time in DYNATEXT.
Accordingly, it is a general aim of the invention to provide a mechanism for accessing only a portion of a large electronically published document, and to automatically determine what portion of the document to select as a previous portion or a next portion without maintaining separate data files of each portion of the document.