This invention is related to client-server computer systems for retrieval of electronically published documents. More particularly, this invention is related to computer document retrieval systems for large documents written using a generalized markup language.
Electronic publication of documents, using non-paper media for transmission and storage, has become increasingly common. Electronically published documents are generally viewed by computer, and are preferably rendered or displayed on a computer screen or other output device in a formatted form. The DynaText system, a computer system available from Electronic Book Technologies of Providence, Rhode Island, is a system which is particularly useful for this purpose for very large documents.
Electronically published documents are increasingly being made available using a general markup language. A markup language provides indications of structure of the document, but excludes streams of graphic display instructions which are typically found in formatted documents. Markup languages are more portable between a variety of different machines that may use different graphic display commands. A commonly used markup language is the Standardized General Markup Language (SGML), an ISO standard.
Client-server computer systems for electronically publishing documents have also become increasingly available. Such a system typically includes one computer system (the server) on which documents are stored so that other computer systems (the clients) can access the information. The server and client communicate via messages conforming to a communication protocol sent over a communication channel such as a computer network. The server responds to messages from clients and processes requests to transmit a requested document.
An example of a client-server computer system for retrieval of electronically published documents that use a markup language is the World Wide Web (WWW) on the Internet. The WWW is a xe2x80x9cwebxe2x80x9d of interconnected documents that are located in various sites on the Internet. The WWW is also described in xe2x80x9cThe World-Wide Web,xe2x80x9d by T. Berners-Lee, R. Cailliau, A. Luotonen, H. F. Nielsen, and A. Secret, Communications of the ACM, 37 (8), pp. 76-82, August 1994, and in xe2x80x9cWorld Wide Web: The Information Universe,xe2x80x9d by T. Berners-Lee, et al., in Electronic Networking: Research, Applications and Policy, Vol. 1, No. 2, Meckler, Westport, Conn., Spring 1992.
Documents that are published on the WWW are written in the Hypertext Markup Language (HTML), such as described in Hypertext Markup Language Specificationxe2x80x942.01 by T. Berners-Lee and D. Connolly, Internet Draft Document, Oct. 14, 1994, and in xe2x80x9cWorld Wide Web and HTML,xe2x80x9d by Douglas C. McArthur, in Dr. Dobbs Journal, December 1994, pp. 18-20, 22, 24, 26 and 86. HTML documents stored as such are generally static, that is, the contents do not change over time unless the publisher modifies the document.
HTML is a markup language used for writing hypertext documents. HTML documents are SGML documents that conform to a particular Document Type Definition (DTD). An HTML document includes a hierarchical set of markup elements, where most elements have a start tag, followed by content, followed by an end tag. The content is a combination of text and nested markup elements. Tags are enclosed in angle brackets (xe2x80x98 less than xe2x80x99 and xe2x80x98 greater than xe2x80x99) and indicate how the document is structured and how to display the document, as well as destinations and labels for hypertext links. There are tags for markup elements such as titles, headers, text attributes such as bold and italic, lists, paragraph boundaries, links to other documents or other parts of the same document, in-line graphic images, and many other features.
Each document available on the WWW has an identifier called a Uniform Resource Identifier (URI). These identifiers are described in more detail in Universal Resource Identifiers for the World Wide Web, T. Berners-Lee, submitted as an Internet Request for Comments (RFC), as yet unnumbered. A URI allows any object on the Internet to be referred to by name or address, such as in a hypertext link in an HTML document. There are two types of URIs: a Universal Resource Name (URN) and a Uniform Resource Locator (URL). A URN references an object by name within a given name space. The Internet community has not yet defined the syntax of URNS. A URL references an object by defining an access algorithm using network protocols. An example URL is xe2x80x9chttp://www.ebt.comxe2x80x9d A URL has the syntax xe2x80x9cscheme://host:port/path?selectorxe2x80x9d where xe2x80x9cschemexe2x80x9d identifies the access protocol (such as HTTP, FTP or GOPHER); xe2x80x9chostxe2x80x9d is the Internet domain name of the machine that supports the protocol; xe2x80x9cportxe2x80x9d is an optional the transfer control protocol (TCP) port number of the appropriate server (if different from the default); xe2x80x9cpathxe2x80x9d is an identification of the object; and xe2x80x9cselectorxe2x80x9d contains optional parameters.
An Internet site electronically publishes documents on the WWW documents is called a xe2x80x9cWeb sitexe2x80x9d and runs a xe2x80x9cWeb server,xe2x80x9d which is a computer program that allows a computer on the network to make documents available via the WWW. The documents are often hypertext documents in the HTML language, but may be other types of documents. Several Web server software packages exist that provide information on the Web, such as the Conseil Europeen pour la Recherche Nucleaire (CERN, the European Laboratory for Particle Physics) server or the National Center for Supercomputing Applications (NCSA) server. Web servers have been implemented for several different platforms, including the Sun Sparc 11 workstation running the Unix operating system, and personal computers with the Intel Pentium processor running the Microsoft MS-DOS operating system and the Microsoft Windows operating environment. The Web server also has a standard interface for running external programs, called the Common Gateway Interface (CGI). A gateway is a program that handles incoming information requests and returns the appropriate document or generates a document dynamically. For example, a gateway might receive queries, look up the answer in an SQL database, and translate the response into a page of HTML so that the server can send the result to the client. A gateway program may be written in a language such as xe2x80x9cCxe2x80x9d or in a scripting language such as Practical Extraction and Report Language (Perl) or Tcl or one of the Unix operating system shell languages. Perl is described in more detail in Programming Perl. by Larry Wall and Randal L. Schwartz, O""Reilly and Associates, Inc., Sebastopol, Calif., USA, 1992. The CGI standard specifies how the script or application receives input and parameters, and specifies how any output should be formatted and returned to the server.
A user (typically using a machine other than the machine used by the Web server) accesses documents published on the WWW runs a client program called a xe2x80x9cWeb browser.xe2x80x9d The Web browser allows the user to retrieve and display documents from Web servers. Some of the popular Web browser programs are: the Navigator browser from NetScape Communications, Corp., of Mountain View, Calif.; the Mosaic browser from the National Center for Supercomputing Applications (NCSA); the WinWeb browser, from Microelectronics and Computer Technology Corp. of Austin, Tex.; and the InternetWorks browser, from BookLink Technology, of Needham, Mass. Browsers exist for many platforms, including personal computers with the Intel Pentium processor running the Microsoft MS-DOS operating system and the Microsoft Windows environment, and Apple Macintosh personal computers.
The Web server and the Web browser communicate using the Hypertext Transfer Protocol (HTTP) message protocol and the underlying TCP/IP data transport protocol of the Internet. HTTP is described in Hypertext Transfer Protocolxe2x80x94HTTP/1.0 by T. Berners-Lee, R. T. Fielding, H. Frystyk Nielsen, Internet Draft Document, Dec. 19, 1994, and is currently in the standardization process. In HTTP, the Web browser establishes a connection to a Web server and sends an HTTP request message to the server. In response to an HTTP request message, the Web server checks for authorization, performs any requested action and returns an HTTP response message containing an HTML document resulting from the requested action, or an error message. For instance, to retrieve a static document, a Web browser sends an HTTP request message to the indicated Web server, requesting a document by its URL. The Web server then retrieves the document and returns it in an HTTP response message to the Web browser. If the document has hypertext links, then the user may again select a link to request that a new document be retrieved and displayed. As another example, a user may fill in a form requesting a database search, the Web browser will send an HTTP request message to the Web server including the name of the database to be searched and the search parameters and the URL of the search script. The Web server calls a program or script, passing in the search parameters. The program examines the parameters and attempts to answer the query, perhaps by sending a query to a database interface. When the program receives the results of the query, it constructs an HTML document that is returned to the Web server, which then sends it to the Web browser in an HTTP response message.
At present, interaction between Web browsers and Web servers has a number of drawbacks. First, when a document is retrieved from a server by a client, the client must load the entire document into the client""s memory. There is no protocol for accessing only a portion of a document. To provide acceptable performance, publishers are forced to maintain a large document as a collection of small document fragments, typically less than ten printed pages equivalent in length.
Another drawback is that Web browsers and servers generally do not support navigation tools to enhance reader navigation of a document once received by the client. Generally navigation is limited to scrolling, string searches, and links within a document. There generally is no table of contents or index that takes advantage of a document structure unless it is manually created and maintained by the publisher as a separate document.
Another restriction is that the destination of a link must always be an entire document file identified by its URL. There is no protocol for linking to targets that are inside of a document. All hyperlinks are required to behave in the same way and publishers can only use one type of hypertext link. Since URLs point to entire documents, the protocol requires transfer of an entire document when requested. The use of whole documents in the current implementation of the World Wide Web requires end users to wade through irrelevant information after invoking a hyperlink unless publishers commit to managing reusable information in many little files.
Using HTML for electronic publishing has additional problems. For example, if an application-specific SGML DTD is used by a publisher, the content must be down-converted to conform to the HTML document type before it can be retrieved via a Web browser. Potentially useful information, attributes and structure may be lost during conversion from SGML to HTML, because HTML tags are very limited, providing on rudimentary document elements and a fixed set of distinctions inadequate for effective information retrieval. There is also no direct support for SGML tables, or SGMLTeX equations in HTML. As a result of the limitations of HTML, optimization of information retrieval capabilities using document structure are lost.
Web browsers also do not typically provide much control over the style or format to be applied to document elements. The same document is generally viewed in the same way every time, with variations depending only on the size of the window on the computer screen which they are viewed. Sophisticated view controls that show information selectively (e.g., based on an access authorization code for security) are not available. Additionally, publishers cannot support similar documents that have multiple views of a single deliverable. Instead, WWW publishers must either annotate the exceptions in a single document or clone the shared content in multiple files. The end result is that either the end user is presented with irrelevant information or publishers must perform redundant maintenance.
Yet another problem with the Web browsers is that they have no collection-level or SGML-aware full-text searching. The only way to find out about a document is to know its URL in advance. Once inside a document only rudimentary string search operations are supported, and there is no full-text index. A few Web servers support full-text search across documents via WAIS, which is not SGML-aware. However, WAIS cannot communicate with Web browsers directly. Thus, it can be difficult for a user to find relevant information because browsers generally a) have no full-text search across large collections of documents, b) have no support for Boolean expressions or proximity searches, and c) are more difficult for users to find relevant information because they cannot use the SGML structure to narrow the scope of a search using SGML attributes.
Finally, there is little data anywhere but on the World Wide Web which conforms to the HTML document-type definition. Most of this existing HTML data does not pass any SGML parser and is difficult to reuse. Also, a number of client systems, also known as Web browsers, are providing their own enhanced versions of HTML which results, or may result, in incompatibility among the different types of systems. At present, server systems do not include the processing capability to be client-aware, and thus require that multiple versions of documents be made available to accommodate for different client systems.
These and other problems need to be overcome before the WWW or similar client-server system can be used to make very large documents available to many people over a heterogenous computer network.
In the present invention, the client-server system improves over the prior art by providing the ability to reference an element within a document on the server. Improvements are also provided by using a mapping table to map elements in a first markup language, e.g., SGML, in one document into markup elements in another markup language, e.g., HTML. Additionally, customizable and client-aware formatting features may be provided in this down-conversion process. A mechanism to generate references to elements within a document, such as a full-text search engine, table of contents or concordance, may also be provided.
The document server system that handles documents written using a general markup language, such as SGML, down-converts documents into another markup language, such as HTML, using the mapping table. Any client system that processes the other markup language can then be used to view the down-converted documents. A client system can request an element with a document, as well, and the server system can down-convert only that portion of the document for transmission to the client. The reference to the element within the document includes a document locator, which is a reference to the document on the server, such as a URL, and an element locator, which indicates an element within referenced document. Such a system takes advantage of the generality and rich contextual information provided by a general markup language which improves portability, the ability to interact with many client systems in a heterogenous computer system and simplifies the electronic publishing process by reducing storage of redundant information. Tables of contents, full-text indexing and other navigational tools can be provided automatically and dynamically without requiring the generation of separately published documents with such information.
In one embodiment of this invention on the WWW, the system provides a mechanism whereby very large SGML documents can be easily viewed by a client system by sending only a portion of the document, such as a chapter or section. In this embodiment, a document locator is a URL used in the WWW which is provided with an additional field, defined by a delimiter character followed by a unique element identifier which identifies an element within the referenced document by a number. Standard methods used for identifying an element within a document, such as those defined by the Text Encoding Initiative (TEI) may also be used. Such URLs can be generated at the time of construction of the element identifiers for cross-references within a document, at the time of a full-text index, or the time of generation of a table of contents.
In an embodiment of the invention using SGML documents, down-conversion of an SGML document fragment can be performed in a manner similar to formatting of the SGML document fragment by using a mapping table for the corresponding document type definition which maps SGML elements to elements of the other markup language. The mapping table can be readily implemented as a style sheet, commonly used for specifying formatting properties for SGML documents. The use of a mapping table, particularly using style sheets, can also provide for a variety of customizing features, including automatic insertion of copyright notices, or other text before and after the document fragment. Conditional formatting, including client-aware formatting, can also be performed.
Another aspect of the invention is the provision of a mechanism which transmits two frames of information to a client, for those client programs which can provide for multiple views. For example, a table of contents and full-text search results can be sent to the client which can then display them in two separate display areas.