Parsing is a process of extracting information from a document. The process can in general yield either a tree structure description of the document, or a logical chain of events. Tree-based parsers compile a document into an internal tree structure, providing a hierarchical model that applications are able to navigate. On the other hand, event-based parsers report parsing events, such as the start and end of elements, directly to the application for which the parsing is being performed.
Formatting is a process of preparing a document's information for output on human-readable media (e.g. a computer monitor or printed paper) according to preset specifications. Formatting always builds on parsing results.
A markup language provides a way of structuring document information. Examples of such languages include HyperText Markup Language (HTML), Extensible HyperText Markup Language (XHTML), Scalable Vector Graphics (SVG) and eXtensible Markup Language (XML) in general. In turn, a tree structure provides a natural model for representation of structural documents, and that model is suitable for processing by a computer application. Each markup element defines a node in a tree in this model.
Style sheets provide a mechanism for adding styles (e.g. fonts, colors, and spacing) to structured documents. The World-Wide Web (www) consortium has actively promoted the use of style sheets on the Web since the consortium was founded in 1994. The consortium has produced several recommendations including Cascading Style Sheets (CSS), XML Path Language (XPATH), and Extensible Stylesheet Language Transformations (XSLT). CSS language has been widely adopted on the Web. The CSS language considers elements in relation to other elements (e.g., parent-child, sibling-sibling, ancestor-descendant). To format elements, their relationships with other elements of a structural document must be known. Again, the tree structure is well suited to describe such types of relationships.
Whatever parsing method is used, a tree structure is essential for formatting of markup-language documents. Even if an event-based parser does not require an internal tree structure, the formatting process needs a full or partial document tree representation. To format a structured document, the tree structure is traversed from the upper nodes to the lower nodes. When a node having character data is reached, relevant style information is retrieved from style rules and applied to the node element. In this manner, document elements are prepared for output. Node relationships with other document nodes are encoded in the tree structure. This is described in greater detail in relation to FIG. 11.
FIG. 11 is a flow diagram illustrating steps of a process 1100 for parsing and formatting of markup documents. In step 1110, the process 1100 parses a document and builds an internal representation of the document in a tree structure. The step 1110 is also responsible for building a representation of CSS rules, which are found in the document. Following the step 1110, the step 1112 traverses the document tree from the upper nodes to the lower nodes. When a node having character data is reached, relevant style information is retrieved from the style rules and applied to the node element, thus preparing document elements for output. The next step 1114 traverses the document tree again and lays out and outputs document nodes to a target media (e.g., screen, paper, and the like). Steps 1112 and 1114 maybe combined to avoid additional traversals of the document tree.
The described process of parsing and formatting a document has several significant limitations and/or disadvantages.
One disadvantage is that the tree structure places a strain on system resources. The amount of memory required to store a full document tree is theoretically unlimited. Further, the memory requirements depend not only on document size, but also on document structure complexity. This constitutes a significant disadvantage, because the process 1100 requires significant amounts of memory to work successfully.
Yet another disadvantage of the process 1100 of FIG. 11 is that this process does not allow streaming processing. The process 1100 cannot layout a document and output the laid-out document to a target media until the full document is available. This is important for Internet applications when document downloading time may be noticeable for a user. Output of requested information should desirably start as soon as possible, before all data is copied from a remote server and available locally.
Another disadvantageous limitation of the process 1100 is that there is not much room remaining for recovery from memory allocation failures. If the process is unable to allocate memory for a next node in a tree, parsing of the document cannot continue. Thus, the process 1100 fails step 1110. Because steps 1110 and 1114 require successful completion of the step 1112, the entire process 1100 fails to accomplish its task.
Thus, a need clearly exists for an improved technique of parsing and formatting marking-language documents, which is advantageously adapted for a memory-constrained environment.