The invention relates to a method of identifying a language in which a text is composed in the form of a string of characters, and also to a method of controlling a speech synthesis unit and to a communication device.
At the user interfaces of communication devices, that is to say of terminal devices used in a communication network, such as for example mobile phones or PCs (personal computers), which have a speech reproduction unit for reproducing texts, it is necessary for the reproduction of texts, in particular for the reproduction of received texts or messages, such as for example short messages (SMS), e-mails, traffic information and the like, for the language of the received text or message to be known in order to reproduce the text of the message with the correct pronunciation.
To make possible the correct pronunciation of a name by means of a speech synthesis unit, EP 0 372 734 B1 proposes a method of identifying the language of a name in which a spoken name to be reproduced is broken down into groups of letters of 3 letters each and for each of the 3-letter groups the probability of the respective 3-letter group belonging to a certain language is established, in order then to ascertain from the sum of the probabilities of all the 3-letter groups the association with a language or a language group.
In a known method (GB 2 318 659 A) of identifying a language in which a document is written, the words of a language that are used most frequently are selected for each of a multiplicity of languages available and are stored in respective word tables of the language. In order to identify the language of a document, words of the documents are compared with the most frequently used words of the various languages, the number of matches being counted. The language for which the greatest number of matches is obtained in the word-for-word comparison is then established as the language of the document.
In a further known method of identifying a language on the basis of 3-letter groups (U.S. Pat. No. 5,062,143), a text is broken down into a multiplicity of 3-letter groups in such a way that at least some of the 3-letter groups overlap neighbouring words, that is to say are given a space in the middle. The 3-letter groups obtained in this way are compared with key sets of 3-letter groups of various languages, in order to ascertain the language of a text from the ratio of groups of letters of the text matching the 3-letter groups of a key set in relation to the total number of 3-letter groups of the text.
The invention is based on the object of providing a further method of identifying a language which makes it possible with little expenditure to identify reliably the language in which the text is composed, even in the case of short texts. In addition, the invention is based on the object of providing a method of controlling a speech synthesis unit and a communication device with which correct speech reproduction is possible for various languages with little expenditure.
This object is achieved by the methods according to claims 1 and 14 and by the communication device according to claim 15.
Thus, according to the invention, a frequency distribution of letters in a text of which the language is sought is ascertained. This frequency distribution is compared with corresponding frequency distributions of available languages, in order to establish similarity factors which indicate to what extent the ascertained frequency distribution coincides with the frequency distributions of each available language. The language for which the ascertained similarity factor is the greatest is then established as the language of the text. In this case, it is expedient if the language is established only if the greatest similarity factor ascertained is greater than a threshold value.
Thus, according to the invention, the statistical distribution of letters, that is to say of individual letters, groups of 2 letters or groups of more than 2 letters, in a text to be analysed is established and compared with corresponding statistical distributions of the languages respectively available. This procedure requires relatively low computer capacities and relatively little storage space in its implementation.
In an advantageous development of the invention, it is provided that the ascertained frequency distribution is stored as the frequency distribution of a new language or is added to a corresponding frequency distribution of a language if, in response to an inquiry, a language to which the ascertained frequency distribution is to be assigned is indicated. In this way, it is made possible in a self-learning process for frequency distributions to be produced for further languages or, if a frequency distribution for this language has already been stored, to increase its statistical reliability.
In an advantageous development of the invention, it may be provided that the ascertained frequency distribution is added to the corresponding frequency distribution of the language established. As a result, the statistical reliability of stored frequency distributions of available languages can be automatically further improved, without the user needing to intervene.
In order to facilitate the processing of the text when ascertaining the frequency distribution of letters and groups of letters in the text, it is provided in an advantageous development of the invention that all non-letter characters, apart from spaces, are removed from the string of characters of the text, in order to ascertain from the string of characters thus obtained frequency distributions of letters and groups of letters in the text.
In another development of the invention, it is provided that the length of the text is established and, depending on the length of the text, one, two or more frequency distributions of letters and groups of letters in the text are ascertained, the length of the text being established as the number of letters in the text and the number of letters in the text being compared with the number of letters in an alphabet, in order to determine which frequency distributions are ascertained.
In this way, the computing effort in ascertaining the frequency distribution or frequency distributions and in the subsequent comparison of the frequency distributions for establishing similarity factors can be reduced, without significantly impairing the reliability of the language identification, since only the ascertainment of those frequency distributions of which the statistical significance would be only extremely low is omitted.
In particular, it is expedient that the frequency distributions of groups of letters with three letters, of groups of letters with two letters and of individual letters are ascertained if the number of letters in the text is greater than the square of the number of letters in the alphabet. Thus, if the number of letters in the text is very great, it is advantageous if not only the frequency distributions of individual letters and of 2-letter groups but also the frequency distribution of 3-letter groups are ascertained, whereby the statistical reliability of the overall finding is significantly increased.
If there is a reduced number of letters in the text, which is greater than the number of letters in the alphabet but less than its square, the frequency distributions of groups of letters with 2 letters and of individual letters are ascertained. If the number of letters in the text is less than the number of letters in the alphabet, expediently only the frequency distribution of individual letters is ascertained, since the statistical significance of the frequency distributions of groups of letters is then practically no longer assured in the method of evaluation according to the invention.
A particularly expedient development of the invention is distinguished by the fact that a complete alphabet is used, including special letters of various languages based on Latin letters. The use of a complete alphabet, that is to say an alphabet which contains not only the Latin letters common to all languages using Latin letters but also the special letters based on Latin letters, such as for example xc3xa4, xc3x6, xc3xc in German, xc3xa9, xc3xa7 in French or {dot over (a)} in Swedish, means that every text to be analysed can be processed in the same way, without the letters first having to be investigated for special letters, in order to choose the corresponding alphabet. As a result, a significant simplification of the method according to the invention can thus be achieved.
To speed up the identification of the language in which the text is composed, it is expedient if the letters present in the text are investigated for special letters, in order to select according to the presence or absence of special letters characteristic of certain languages the languages which are to be taken into consideration in the comparison of the ascertained frequency distribution with corresponding frequency distributions of available languages.
In addition, it may be provided that, after establishing the language, the letters present in the text are investigated for special letters which are characteristic of the language established and of languages not established, in order to confirm the language established. By comparing special letters present in the text to be analysed with the special letters of a language established, it can be established in a simple way to what extent the language established for the text is plausible for it.
The method according to the invention can be used particularly expediently for identifying a language in a method of controlling a speech synthesis unit, in which the language established in the language identification according to the invention is transmitted to a speech synthesis unit, in which the pronunciation rules of the language established are selected and used for the synthetic speech reproduction of the text by a speech synthesis module of the speech reproduction unit.
In a communication device according to the invention, which has not only a receiving module and a speech synthesis module but also a language identification module, it is provided that a text to be output by the speech synthesis module can be fed to the language identification module for identifying the language in which the text to be output is composed, and that the language identification module is connected to the speech synthesis module for transmitting a language established for this text.
It is expediently provided in this case that pronunciation rules for various languages are stored in the speech synthesis module, a pronunciation-rules selection circuit being provided in the speech synthesis module, which circuit is connected to the language identification module and, depending on the language transmitted by the language identification module, selects the corresponding pronunciation rule, so that it can be used by a speech synthesis unit of the speech synthesis module.
In order to be able to carry out a language identification simply and effectively in the communication device according to the invention, it is provided that the language identification module comprises a filter circuit, in order to remove all non-letter characters, apart from spaces, from a string of characters of a text.
Furthermore, it is expedient if the language identification module comprises a statistics circuit, in order to ascertain a frequency distribution of letters in the text, the statistics circuit having first, second and third computing circuits, in order to ascertain frequency distributions of individual letters, of groups of letters with two letters and of groups of letters with three letters.
An expedient development of the invention is distinguished by the fact that the language identification module has a comparator circuit, in order to compare for the ascertainment of similarity factors for a text ascertained frequency distributions of letters with corresponding stored frequency distributions of available languages, the language identification module comprising an evaluation circuit, to which the similarity factors can be fed by the comparator circuit in order to establish the language for which the ascertained similarity factor is greatest as the language of the text.