The present invention relates to representing and processing structured documents.
The Internet is a global network that uses a common communication protocol, the Transfer Control Protocol/Internet Protocol (“TCP/IP”), to transmit data from one computer to another. In order to use the transmitted data, computer applications adopt communication standards. For example, the World Wide Web (“Web”) is a system that includes applications supporting Hyper Text Markup Language (“HTML”) documents.
An HTML document includes content, e.g., text, and corresponding instructions, typically about how to format the content (i.e., the document is “marked up” with formatting instructions). To include an instruction, a tag is added in the document. The tag has a name that can be used to identify the corresponding instruction, and identifiers that mark the content for which the tag applies. For example, the tag can have opening and closing elements that bracket the marked content in the document. The marked content can include one or more additional tags, called child tags. Child tags can include their own children to form a hierarchical structure of the tags.
In markup languages such as Standard Generalized Markup Language (“SGML”) and eXtensible Markup Language (“XML”), generalized tags can also be used to represent structure (and not just formatting) in the content of an electronic document (and, more generally, in any type of text data). For example, a generalized <name> tag can be used to mark up all names in a document, and optionally a separate file, e.g., in eXtensible Stylesheet Language (“XSL”), can describe how tagged names should be formatted. A definition file can be used to specify the generalized tags and their relations to each other, e.g., by using Document Type Definition (“DTD”) or XML Schema languages. For example, a definition file can specify what are the allowed tags, which tags can have children, or how many and what type of children a particular tag can have. The generalized markup language document becomes self descriptive when combined with the corresponding definition file.