Markup languages provide formatting information necessary for Web browsers to display documents found on the World Wide Web within the Internet. Commonly used markup languages include Hypertext Markup Language (HTML) and Extensible Markup Language (XML). A document formatted or written in a markup language contains two types of information—formatting information and content information. “Markup” refers to sets of commands or tags that describe to the web browser how to format and layout content information on a page. The content data consists of readable information characters typically encoded according to the American Standard for Information Interchange (ASCII) that is actually displayed to a user. The markup commands or “tags” typically consist of multiple ASCII characters that describe the format for the content information to be displayed. For example, a tag for creating a table in a document would start with an opening tag <TABLE> followed by the content information for the table and then closed by an end tag </TABLE>.
As markup language documents are transmitted and stored on the World Wide Web, their binary representations may be compressed to facilitate efficient transmission of the data. Conventional compression techniques reduce the size of the binary markup language document representation that reduces the time required to transmit and the space required to store the document. Each character within a markup language document is weighed equally when converted to its binary representation. Many markup tags, however, contain multiple characters to describe a particular format type. For example, using the tag (FONT FACE=“ARIAL” SIZE=2 COLOR=“#339966”) describes the font, its size and color for text to be displayed to the user. Each character within that markup tag would be translated to a binary representation that would be further compressed using conventional compression algorithms. However, using multiple characters to establish a particular format takes up more storage and requires a larger bandwidth to transmit such a document. Furthermore, many markup tags are necessary to properly display a document resulting in an increased amount of storage necessary to store such a document and to transmit the document.
Markup language tags having multiple characters to represent certain formats are said to exhibit a high degree of redundancy or a low information entropy. This leads to data representing character formats having the same “value” as the character information itself. This result produces a document with many formatting characters associated with a fewer number of information characters. Currently, there is no meaningful method for describing the markup language tags in proper proportion to the content data. For example, to show a bold character you would need the opening tag <B> the character data, and the closing tag </B>. Thus, describing character data in a bold format would require seven characters to do so.
Low bandwidth applications or transmission systems sending large amounts of information require efficient compression of information to the maximum extent possible. Current compression techniques suitable for applications that do not have bandwidth limitations may continue to be acceptable solutions for that environment. However, narrow bandwidth applications require additional methods and techniques to optimize compression of data and its associated formatting information to achieve efficient transmission. One such narrow bandwidth application is transmitting information via satellite where bandwidth space is limited.
Current compression techniques are also adequate to serve hardware configurations having large amounts of memory to store markup language documents. However, as more devices are introduced to receive markup language files and display such information to a user, space may become limited in such devices. These devices may include hand-held personal data assistants (PDAs) or a cellular telephones.
Accordingly, there exists a need to optimize the storage and transmission of documents formatted in markup languages. Moreover, there is a need for such a method that is more efficient, more economical and faster than conventional methods for compressing data in a markup language format.