1. Field of the Invention
The present invention relates to an automatic language recognition method, and more specifically, to the automatic recognition of a language in which digital data is received, in particular by a terminal of a computer system.
2. Discussion of Related Art
Various languages or formats exist into which information to be reproduced by a plotter or a printer can be interpreted in the form of digital data transmitted from a host computer. The received data must be interpreted or decoded by means of an interpretation module specific to each language. Interpretation serves to transform data into a form that is directly usable for printing, independently of the language used, and in particular in the form of a bitmap image, i.e. an image that is fully described by dots.
A given printing device may receive digital data encoded in different languages. This applies when a single user chooses to use different languages depending on the tasks to be performed, or when a plurality of users using different languages have access to a common printing device over a network. To enable the received data to be processed, it is then necessary to select the interpretation module that corresponds to the code language used.
The stream of digital data is assumed to be made up of a succession of drawing files. Each drawing file uses a language that is defined in a list. If a drawing file does not have an explicit end, then different drawing files can be distinguished by detecting loss of synchronization or a change of code. If a file contains errors, the language must nevertheless be capable of being detected correctly, providing the number of errors remains reasonable.
Amongst files using the languages of the above list, there may be text files that do not use languages and that are referred to below as "texts". Drawings may be separated by texts in a special format that is referred to below as a "banner" format. A banner is thus defined as all of the coded digital data (or characters) in which no language in a predefined list of languages has been recognized. The languages that are identified may belong to various classes: languages having signatures; languages having keywords or synchronization characters; languages using mnemonics. A mnemonic may be considered as a set of encoded digital data of predetermined size, e.g. a run of two significant characters.
Various methods have been proposed for automatically recognizing a language on the basis of at least a portion of the received data. The term "automatic recognition" is used herein to designate any process which not only avoids any need for physical intervention by a user to perform selection at the printing device, but also avoids any need to add special control sequences or headers to the data normally generated by means of a language.
One known method consists in using all of the interpretation modules to process the received digital data, and then in retaining the module that generates the fewest errors. A method of that type is described in document EP-A-0 556 059. Although very reliable, such a method cannot be adopted in most cases because of the time it requires and the need to store all of the received data.
Another known method, described in U.S. Pat. No. 5,293,466 consists of initially producing samples of data encoded using different languages and in analyzing them statistically so as to deduce characteristics that are specific to each language in the form of data groups that are stored. Thereafter, the initial portion of digital data received by the printing device is abstracted for comparing with the stored data groups, and the language in use is deduced therefrom. The difficulty here lies in determining suitable characteristics to limit the error rate in the recognition.
It is also known, from document EP-A-0 558 804, to analyze the syntax of a received data block and, for each language, to identify "FOR" and "AGAINST" keys in the data block, to weight the keys, and to sum the results obtained in order to select the best-placed candidate amongst all of the languages. Again, this is a relatively lengthy process, and there is once more the difficulty of selecting keys and weighting factors for minimizing errors and uncertainties in recognition.