The Internet has brought into being a number of new applications for various purposes. Underlying the rapid growth of the Internet is the HyperText Markup Language (HTML) standard for definition of documents in digital form. HTML is a subset of Standard Generalized Markup Language (SGML). Within SGML there is another, rapidly growing family of definitions, called extensible Markup Language (XML). Furthermore, there is Wireless Markup Language WML, which is especially designed for use in mobile communications. Both HTML and WML are subsets of XML.
HTML is used for an enormous number of documents published in the Internet. These documents are usually available to the public and provide a highly diverse source of information. The information in digital form is often referred to as content.
Unlike HTML, WML is designed particularly for wireless terminals. The amount of content in the form of WML is, as of yet, very limited compared to that in the form of HTML. A wireless terminal supporting only WML (an WML terminal) cannot use the content in HTML.
In order to make interesting HTML-formatted content available for use in WML terminals there are two options. Firstly, the content documents can be re-written in WML. Secondly, a network relaying HTML content to an WML terminal can perform an automatic conversion from HTML to WML when the terminal requests such content. This can be arranged by using between the WML terminal and the Internet a gateway server, which has the capability of converting content from HTML to WML.
Both HTML and XML are under constant development. HTML is converging towards XML, or HTML is becoming one instance of the XML language family.
The generation and reconstruction of HTML and XML documents is next described. A document is first broken up so that its formatting and meaning (actual content) are stored separately in different mark-up tags, or simply tags. In HTML documents, the tags are in a sequence and hence they are sequentially transmitted. Some tags contain structural information for defining the structure of a document the content defines, whilst some other tags contain the meaning in clips of information to be output to a user according to the defined structure. Typically, these clips are text.
A terminal receiving HTML, XML, or WML content receives a series of tokens. For reconstructing a document, or content, in a mark-up language, an assembler is required. The assembler puts back together the formatting and the meaning of the document. In the core of the assembler there is a parser that parses the data according to certain parse rules and a parsing language, that is grammar. The parser is typically a program controlling a processor of the terminal. The parser receives input in the form of sequential mark-up tags (interleaved with character data tokens) and breaks the input up into parts (for example, the nouns (objects), verbs (methods), and their attributes or options) that can then be managed by other software components. The parser may also check to see that all input has been provided that is necessary. In this context, the parser breaks the input into tokens and builds the structure according to the tokens. The tokens are typically a parser's internal representation of tags or textual data (character data token).
A parser is needed for a plurality of different HTML or XML processing applications, such as gateways, HTML-browsers, mobile terminals, authoring tools, and in some occasions, web servers.
Unfortunately, the constant development of HTML and XML results in a need to frequently update the equipment used for conversion between these languages in order to deal with new documents. Therefore, the parser contained in the equipment should be updated frequently to cope with different dialects of the languages. This has usually been carried out by building a new version of the assembler whenever required. For this purpose there are at least two tool programs, namely the “yacc” (Yet Another Compiler Compiler) and “lex”. While these tool programs greatly facilitate building of the new parser, a syntax of the language to be parsed needs first to be described in a dedicated language. The syntax defines how the parsing should be carried out. When the new parser is ready, the syntax description (the dedicated language) is processed with filtering tools to generate a source code representation of the parser. The filtering tools typically generate C programming language. The produced parser is a monolithic piece of software, which combines the syntax rules and the parser logic. Finally, the generated parser source code is compiled and linked with the application code to produce an executable program with parser functionality. The drawback of this procedure is the high amount of labour required to adapt the equipment to changes in input language (HTML, XML).
Typically modem XML documents identify a grammatical definition that will be searched from a network and thus dynamically replaceable grammatical definition is required. Typically a particular reference in the document is used for the identification of the grammar definition.
U.S. Pat. No. 5,687,378 provides an alternative procedure that allows the dedicated syntax description language to be changed without recompiling or re-linking the assembler. This is based on the use of switchable syntax modules each comprising a different set of parse rules. The parse rules define the grammar used by the parser. In this way, the actual parser engine is separated from the rules used and the rules can be easily changed. While the parsing rules can thus be changed and the adaptation to new description languages or dialects has become easier, certain problems remain.
The parsers generated with standard tools become rather complex and memory consuming, since they provide complex syntax description languages for covering more descriptive languages than XML. In the case of XML, a less descriptive syntax description language would suffice. If the parser were optimised for XML, it would be smaller. Furthermore, a great number of the pages present in the Internet are deficient. The defects in these pages hinder and/or slow down the parsing. It is also known, that some content providers may deliberately generate certain errors in order to prevent or harm the use of certain applications, such as certain WWW-browsers.