1. Field of the Invention
The present invention relates to a computer system, and deals more particularly with a method, system, and computer-readable code for reducing the size of documents (such as XML and DTD documents) through novel compression techniques.
2. Description of the Related Art
Extensible Markup Language, or xe2x80x9cXMLxe2x80x9d, is a standardized formatting notation, created for structured document interchange on the World Wide Web (hereinafter, xe2x80x9cWebxe2x80x9d). XML is a tag language, where specially-designated constructs referred to as xe2x80x9ctagsxe2x80x9d are used to delimit (or xe2x80x9cmark upxe2x80x9d) information. In the general case, a tag is a keyword that identifies what the data is which is associated with the tag, and is typically composed of a character string enclosed in special characters. xe2x80x9cSpecial charactersxe2x80x9d means characters other than letters and numbers, which are defined and reserved for use with tags. Special characters are used so that a parser processing the data stream will recognize that this a tag. A tag is normally inserted preceding its associated data: a corresponding tag may also be inserted following the data, to clearly identify where that data ends. As an example of using tags, the syntax xe2x80x9c less than email greater than xe2x80x9d could be used as a tag to indicate that the character string appearing in the data stream after this tag is to treated as an e-mail address; the syntax xe2x80x9c less than /email greater than xe2x80x9d would then be inserted after the character string, to delimit where the e-mail character string ends.
The syntax of XML is extensible because it provides users the capability to define their own tags. XML is based on SGML (Standard Generalized Markup Language), which is an international standard for specifying document structure. SGML provides for a platform-independent specification of document content and formatting. XML is a simplified version of SGML, tailored to Web document content. (Refer to ISO 8879, xe2x80x9cStandard Generalized Markup Language (SGML)xe2x80x9d, (1986) for more information on SGML, and to xe2x80x9cExtensible Markup Language (XML), W3C Recommendation Feb. 10, 1998xe2x80x9d which is available on the World Wide Web at http://www.w3.org/TR/1998/REC-xml-19980210, for more information on XML.)
XML is widely accepted in the computer industry for defining the semantics (that is, by specifying meaningful tags) and content of the data encoded in a file. The extensible, user-defined tags enable the user to easily define a data model, which may change from one file to another. When an application generates the tags (and corresponding data) for a file according to a particular data model and transmits that file to another application that also understands this data model, the XML notation functions as a conduit, enabling a smooth transfer of information from one application to the other. By parsing the tags of the data model from the received file, the receiving application can re-create the information for display, printing, or other processing, as the generating application intended it.
A Document Type Definition, or xe2x80x9cDTDxe2x80x9d, may be used with an XML file. In general, a DTD is a definition of the structure of an SGML document, and is written using SGML syntax. The DTD is encoded in a file which is intended to be processed, along with the file containing a particular document, by an SGML parser. The DTD tells the parser how to interpret the document which was created according to that DTD. DTDs are not limited to use with XML, and may in fact be used to describe any document type. For example, suppose a DTD has been created for documents of type xe2x80x9cmemoxe2x80x9d. Memos typically contain xe2x80x9cToxe2x80x9d and xe2x80x9cFromxe2x80x9d information. The DTD would contain definitional elements for these items, telling the parser what to do when it encounters xe2x80x9cToxe2x80x9d and xe2x80x9cFromxe2x80x9d in an actual memo (such as using bold text for printing or displaying the words xe2x80x9cToxe2x80x9d and xe2x80x9cFromxe2x80x9d, left-justifying the lines on which they appear, etc). The HyperText Markup Language, or xe2x80x9cHTMLxe2x80x9d, is a popular example of a notation defined using an SGML DTD. HTML is used for specifying the content and formatting of Web pages, where xe2x80x9cWeb browserxe2x80x9d software processes the HTML definition along with a Web page in the same manner an SGML parser is used for other DTDs and document types. When used with XML, a DTD specifies how the tags defined for this particular document type are to be inserted into the XML data stream when the XML file is being created. When a user wishes to print or display a document encoded according to this DTD, the software (i.e. the parser, compiler or other application) uses the DTD file to determine how to process the contents of the XML document file.
Because the XML tags are defined by humans, and intended to be human-readable as well as machine-processable, they may become quite long in terms of character length. Each opening tag requires a matching closing (or xe2x80x9cendxe2x80x9d) tag, so that the number of characters required to express a given tag effectively doubles. As an example of tag that may be defined, suppose a user wishes to represent names and addresses in a file. The tags used to delimit the name may be simply xe2x80x9c less than name greater than xe2x80x9d and xe2x80x9c less than /name greater than xe2x80x9d, where the angle brackets are the SGML (and XML) syntax designated as bracketing a tag, and the combination of the xe2x80x9c/xe2x80x9d symbol with an opening angle bracket further designates that this is the end tag. Alternatively, longer tags could be used such as xe2x80x9c less than customer_name greater than xe2x80x9d and xe2x80x9c less than /customer_name greater than xe2x80x9d, or separate tags could be used to separate the first name, middle initial, and last name when the name was associated with a person. The longer the tag, the more descriptive it will tend to be. For example, if the data model includes not only one person""s name, but perhaps a spouse name and children""s names, or an employer""s name, then more characters will need to be used in the tags (such as xe2x80x9c less than employee_name greater than xe2x80x9d and xe2x80x9c less than company_name greater than xe2x80x9d) to enable a human reader to understand which name is which. The value to be used for the information represented by a tag is then encoded between the opening and closing tag. For example, suppose a company name is xe2x80x9cAcme Widgetxe2x80x9d. According to this example, the string xe2x80x9c less than company_name greater than Acme Widget less than /company_name greater than xe2x80x9d would be used to encode this information in a document. The document could contain many other company names, which would be similarly encoded. Other document types which do not use company names simply define different tags, for the information that is pertinent to those document types.
There is one exception to the requirement for matching end tags for each opening tag. It may be that there is no value for the tag in a particular usage. Suppose, for example, that the person from the data model discussed above has no spouse. In that situation, no value appears between the tags where the spouse name would otherwise be located. A short-hand specification technique has been defined for this null-value case, where a xe2x80x9c/xe2x80x9d character is inserted into the opening tag preceding the xe2x80x9c greater than xe2x80x9d character. If xe2x80x9c less than spouse greater than xe2x80x9d and  less than /spouse greater than xe2x80x9d are the tags used for bracketing the spouse string in this model, then the shorthand representation takes the form xe2x80x9c less than spouse/ greater than xe2x80x9d.
The longer the length of the tags in the file, the larger the file becomes. While file size may not be an issue in some computing environments, such as where a server in a network has access to banks of storage devices, there are many situations where file size can become a critical factor in operating a computer. When the file is to be received at a constrained-storage device such as a handheld computer, Personal Digital Assistant (xe2x80x9cPDAxe2x80x9d), or other pervasive computing device, the larger the size of the file, the more likely it is that problems will arise when trying to store it at the receiver. And, the larger the file, the longer it will take to transmit the file between computers. The popularity of using portable computers such as handheld devices for connecting to the Internet, or other networks of computers, is increasing as user interest in computing becomes pervasive and users are more often working in mobile environments. At the same time, the popularity of making network connections using connection services that charge fees based upon the duration of connections (such as cellular services, which are commonly used for wireless connections from portable computers) is also growing. When using this type of relatively expensive connection, the longer the user must wait to receive a file, the higher his connection charges will be.
These factors illustrate the importance of minimizing the size of files being transmitted and stored. Efforts to compress XML files using binary compression algorithms are underway. This type of compression is similar to the commonly-known techniques with which xe2x80x9czipxe2x80x9d files are created to reduce file size. A widely used program for compressing files into zip format is known as xe2x80x9cPKZIPxe2x80x9d, developed by PKware, Inc. A companion program, xe2x80x9cPKUNZIPxe2x80x9d, is required to decompress the zipped file back into a usable form. In a similar manner, binary compression of XML files will require that a decompression program is installed and running on a machine that receives compressed XML and wishes to process the original XML contents. If the receiving machine is storage-constrained, storing an XML decompression program on the machine will result in less storage available for the user""s data files and other applications. Further, the compressed XML file will be unreadable (and therefore unusable) on any machine which does not have the decompression software, thereby limiting the transportability of the XML files in contradiction to one of the original goals of the XML notation. Thus, if compression is implemented in a manner that requires complementary decompression software, care must be taken to ensure that the decompression software is available at the receiver, and that the software is as compact as possible.
Accordingly, a need exists for a technique with which files encoded according to an SGML derivative notation can be compressed, thereby removing the negative aspects of storing and transmitting large file size discussed above. The proposed technique provides a novel way to reduce the size of XML files for transmission and storage, and a novel way to reduce the size of DTD files as well. In one aspect, this compression is achieved in a manner that does not require special decompression software to decompress the files when the user requests to process them.
An object of the present invention is to provide a technique whereby XML files can be compressed, reducing the file size for storage and transfer.
Another object of the present invention is to provide a technique whereby DTD files can be compressed, reducing their file size for storage and transfer.
Still another object of the present invention is to provide a technique whereby XML file size is reduced by compressing strings.
It is another object of the present invention to provide this XML string compression in a manner that does not require special decompression software to decompress the files for processing.
A further object of the present invention is to provide a technique whereby XML and/or DTD file size is reduced by compressing tags.
Yet another object of the present invention is to provide a technique whereby XML and/or DTD file size is reduced by compressing attributes within tags.
Other objects and advantages of the present invention will be set forth in part in the description and in the drawings which follow and, in part, will be obvious from the description or may be learned by practice of the invention.
To achieve the foregoing objects, and in accordance with the purpose of the invention as broadly described herein, the present invention provides a software-implemented process for use in a computing environment for reducing document file size by tag compression, comprising: an input file encoded in a derivative of Standard Generalized Markup Language (SGML); a subprocess for reading the encoded file; a subprocess for locating each of a plurality of tags in the encoded file; a subprocess for substituting a unique short tag for each unique one of the located tags in the encoded file; and a subprocess for storing a correspondence between each of the short tags and the located tag for which it was substituted. Preferably, the derivative is Extensible Markup Language (XML). Further, a subprocess for decompressing a compressed file resulting from the subprocess for substitution may be used, comprising: a subprocess for reading the compressed file; a subprocess for locating each of the substituted short tags in the compressed file; a subprocess for reading the stored correspondence between short tags and located tags, retrieving the stored located tag corresponding to the located substituted short tag; and a subprocess for substituting the retrieved tag for the located substituted short tag in the compressed file.
The present invention also provides a software-implemented process for reducing document file size by tag attribute compression, comprising: an input file encoded in a derivative of Standard Generalized Markup Language (SGML); a subprocess for reading the encoded file; a subprocess for locating each of a plurality of tag attributes in the encoded file; a subprocess for substituting a unique short tag for each unique one of the located tag attributes in the encoded file; and a subprocess for storing a correspondence between each of the short tags and the located tag attribute for which it was substituted. Preferably, the derivative is Extensible Markup Language (XML). Further, a subprocess for decompressing a compressed file resulting from the subprocess for substitution may be used, comprising: a subprocess for reading the compressed file; a subprocess for locating each of the substituted short tags in the compressed file; a subprocess for reading the stored correspondence between short tags and located tags, retrieving the stored located tag corresponding to the located substituted short tag; and a subprocess for substituting the retrieved tag for the located substituted short tag in the compressed file.
The present invention also provides a software-implemented process for reducing document file size by string compression, comprising: an input file encoded in a derivative of Standard Generalized Markup Language (SGML); a subprocess for reading the encoded file; a subprocess for locating each of a plurality of strings in the encoded file; a subprocess for substituting a unique entity name reference for each unique one of the located strings in the encoded file, provided that a first cost of substituting the located string is less than a second cost of using the located string without substitution, and provided that the located string contains no embedded entity references; and a subprocess for creating an entity declaration for each of the unique entity name references. Preferably, the derivative is Extensible Markup Language (XML). Further, a subprocess for decompressing a compressed file resulting from the subprocess for substitution may be used, comprising using a standard parser for the derivative.