1. Field of the Invention
The present invention relates to a technique for analyzing tokens and the syntax of programs and documents written in accordance with a predetermined rules, and for correcting errors.
2. Related Art
As techniques for correcting errors in programs or documents written in accordance with predetermined rules, there are a number that detect errors by analyzing program or document tokens and the syntaxes of data strings. A conventional technique of this type uses the following two methods to handle errors.
According to the first method, a warning is issued upon the detection of an error, and the analysis is either halted or it is resumed at a synchronization point following the location of the error. That is, according to this method, errors are not aggressively corrected to recover to a normal state. This method is widely employed for cases wherein the processing system in a programming language does not permit the analyzation of errors in documents, or for analysis systems that are developed as general applications that are not part of specific programs.
The XML (Extensible Markup Language) processing system will now be explained. For XML, general structure analysis systems that do not depend on an application are provided by several vendors (for example, OpenXML by Open XML Corp. or XML4J by IBM Corp.). When an error is detected in a document, these processing systems either abandon the analysis, or ignore the error in the token and resume the process at a synchronization point at a location following the error. Before the processing is continued, an external module ErrorHandler can receive error location data and an explanatory message through an interface (SAX: Simple API for XML) that operates the parsing system. However, this module only receives a warning, and does not provide a function for changing the state of the parsing system and the output results.
According to the second method, analysis results are output based on a special recovery method corresponding to the application that is employing the analysis system, and the analysis is continued. That is, according to this method, as part of an error reduction process, not only are errors detected, but also, corrective action is initiated to correct the errors and to provide a program or document that is free of syntax rule violations. This method is employed for a case wherein the person who reads a source document is not the person who created it, and wherein the reader is tasked with generating output results regardless of whether the document contains errors.
The HTML (HyperText Markup Language) processing system will now be explained. Since the person who creates an HTML document for a web page on the Internet usually differs from the person who browses it, when a syntax rule error is present in an HTML document, merely noting the presence of the error in the document that is to be read by a browser is insufficient, and a state must be attained wherein a user who browses the HTML document does not have to contend with errors. Therefore, some web browsers (web browsing software applications) include functions for analyzing tokens and syntax and for correcting HTML rule errors, and are thus able to provide errorless documents for users.
Assume that in a predetermined HTML document there is a portion <P>str0<B>str1<I>str2</B>str3</I>str4</P>. Since tags <B></B> and <I></I> in this portion are not nested structures, this is a syntax rule error. In order for a web browser to display this portion, the error must be corrected, so that the parsing means of the web browser can generate data for output.
Netscape Navigator by Netscape Communications Corp., which is a representative web browser, corrects the above portion as follows:
 <P>str0<B>str1<I>str2</I>str3</B>str4</P>
That is, a nested structure is fabricated by exchanging </B> and </I>, which provides the output results shown in FIG. 14 (str1 and str3 are bold and str2 is italic and bold).
On the other hand, Internet Explorer by Microsoft Corp., which is another representative web browser, corrects the above portion as follows:<P>str0<B>str1<I>str2</I></B><I>str3</I>str4</P>That is, a nested structure is fabricated by inserting </I> before </B> and <I> after </B>, which provides the output results shown in FIG. 15 (str1 is bold, str2 is italic and bold and str3 is italic).
FIG. 12 is a diagram showing the configuration of a conventional parsing system, and FIG. 13 is a flowchart for explaining the parsing process performed by the parsing system in FIG. 12.
In FIG. 12, a parsing system 120 comprises: a lexical analyzer 121, for receiving a predetermined stream included in an input document and analyzing tokens; a parser 122, for analyzing the syntax of the token obtained by the lexical analyzer 121, and for generating and outputting an abstract syntax tree (AST) that describes the structure of the input document; and a node generator 123, which is used to generate the abstract syntax tree. The lexical analyzer 121 includes a buffer 121a, which is used for the token analysis, and a token recovery unit 121b, for correcting token errors. The parser 122 includes a buffer 122a, which is used for syntax analysis, a context pointer and a syntax recovery unit 122b, for correcting syntax rule errors. When the process is initiated by the parser 122, a grammar information object 124 is generated that is used for the parsing.
As is shown in FIG. 13, when the parsing process is initiated, first, the parser 122 is initialized (step 1301). For this initialization, the following three steps are performed: {circle around (1)} the document type of the input document is analyzed, and a grammar information object 124 is generated; {circle around (2)} the buffer 122a is emptied; and {circle around (3)} the context pointer is used to represent the root node of the abstract syntax tree. Note that before the parser 122 is initialized the input of the token stream and the token analysis are completed.
Then, the parser 122 extracts the token from the buffer 122a as a token t to be processed (step 1302). When the buffer 122a is empty (it is always empty immediately after the initialization at step 1301), a token is requested from the lexical analyzer 121, and the obtained token is defined as the token t. When the token t is the terminal one of the input document, the generated abstract syntax tree is output and the processing is thereafter terminated (step 1303).
When the token t is not the end of the input document, the parser 122 inquires, of the grammar information object 124, whether the token t grammatically matches the context pointer. When the token t matches, the token t is added to the context pointer (Yes at step 1304). This addition is performed in the following manner. First, a node n, which is a non-terminal symbol, is generated by the node generator 123 and is added to the context pointer (step 1305). Then, the destination of the context pointer is shifted to the non-terminal symbol node n that has newly been added (step 1306). When the non-terminal symbol node n pointed to by the context pointer has obtained all the child nodes, the context pointer is shifted to the parent node (steps 1307 and 1308). If the non-terminal symbol node n indicated by the context pointer has not obtained all the child nodes, or if the context pointer is shifted to the parent node at step 1308, program control returns to step 1302, whereat the next token is obtained and the previous processing is repeated.
When, at step 1304, the token t does not grammatically match the context pointer, the parser 122 outputs an error message (step 1309) And after a predetermined error process has been performed (step 1310), program control is returned to step 1302 and the next token is obtained and processed. The error process includes processing whereby the pertinent token t is skipped and a subsequent token is processed, and recovery processing that employs a fixed method. For the recovery processing, the parser 122 calls the syntax recovery unit 122b to correct the error, so that the token t grammatically matches the context pointer. Thereafter, program control is returned to step 1302.
This recovery process can be performed because parsing systems 120 (such as the parsing means provided for a web browser) have been especially developed for HTML and for applications. Example parsing systems are Ark, by Just System Corp., and W3C Tidy, by W3C (World Wide Web Consortium).
In the above example operation, while the lexical analyzer 121 analyzes a token, the token recovery unit 121b corrects errors in the token. Since this is a simple process of replacing a token in the input stream with an appropriate token fabricated in accordance with a predetermined rule, no explanation for this will be given.