The invention relates to the improvement of voice-controlled systems with text-based speech synthesis, in particular with the improvement of the synthetic reproduction of a stored trail of characters whose pronunciation is subject to certain peculiarities.
The use of speech to operate technical devices is becoming increasingly important. This applies to data and command input as well as to message output. Systems that utilize acoustic signals in the form of speech to facilitate communication between users and machines in both directions are called voice response systems. The utterances output by such systems can be prerecorded natural speech or synthetically created speech, which is the subject of the invention described in this document. There are also devices known in which such utterances are combinations of synthetic and prerecorded natural language.
A few general explanations and definitions of speech synthesis will be provided in the following to gain a better understanding of the invention.
The object of speech synthesis is the machine transformation of the symbolic representation of an utterance into an acoustic signal that is sufficiently similar to human speech that it will be recognized as such by a human.
Systems used in the field of speech synthesis are divided into two categories:
1) A speech synthesis system produces spoken language based on a given text.
2) A speech synthesizer produces speech based on certain control parameters.
The speech synthesizer therefore represents the last stage of a speech synthesis system.
A speech synthesis technique is a technique that allows you to build a speech synthesizer. Examples of speech synthesis techniques are direct synthesis, synthesis using a model and the simulation of the vocal tract.
In direct synthesis, parts of the speech signal are combined to produce the corresponding words based on stored signals (e.g. one signal is stored per phoneme) or the transfer function of the vocal tract used by humans to create speech is simulated by the energy of a signal in certain frequency ranges. In this manner vocalized sounds are represented by the quasi-periodic excitation of a certain frequency.
The term xe2x80x98phonemexe2x80x99 mentioned above is the smallest unit of language that can be used to differentiate meanings but that does not have any meaning itself. Two words with different meanings that differ by only a single phoneme (e.g. fish/wish, woods/wads) create a minimal pair. The number of phonemes in a language is relatively small (between 20 and 60). The German language uses about 45 phonemes.
To take the characteristic transitions between phonemes into account, diphones are usually used in direct speech synthesis. Simply stated, a diphone can be defined as the space between the invariable part of the first phoneme and the invariable part of the second phoneme.
Phonemes and sequences of phonemes are written using the International Phonetic Alphabet (IPA). The conversion of a piece of text to a series of characters belonging to the phonetic alphabet is called phonetic transcription.
In synthesis using a model, a production model is created that is usually based on minimizing the difference between a digitized human speech signal (original signal) and a predicated signal.
The simulation of the vocal tract is another method. In this method the form and position of each organ used to articulate speech (tongue, jaws, lips) is modeled. To do this, a mathematical model of the airflow characteristics in a vocal tract defined in this manner is created and the speech signal is calculated using this model.
Short explanations of other terms and methods used in conjunction with speech synthesis will be given in the following.
The phonemes or diphones used in direct synthesis must first be obtained by segmenting the natural language. There are two approaches used to accomplish this:
In implicit segmentation only the information contained in the speech signal itself is used for segmentation purposes.
Explicit segmentation, on the other hand, uses additional information such as the number of phonemes in the utterance.
To segment an utterance, features must first be extracted from the speech signal. These features can then be used as the basis for differentiating between segments.
These features are then classified.
Possible methods for extracting features are spectral analysis, filter bank analysis or the linear prediction method, amongst others.
Hidden Markov models, artificial neural networks or dynamic time warping (a method for normalizing time) are used to classify the features, for example.
The Hidden Markov Model (HMM) is a two-stage stochastic process. It consists of a Markov chain, usually with a low number of states, to which probabilities or probability densities are assigned. The speech signals and/or their parameters described by probability densities can be observed. The intermediate states themselves remain hidden. HMMs have become the most widely used models due to their high performance and robustness and because they are easy to train when used in speech recognition.
The Viterbi algorithm can be used to determine how well several HMMs correlate.
More recent approaches use multiple self-organizing maps of features (Kohon maps). This special type of artificial neural network is able to simulate the processes carried out in the human brain.
A widely used approach is the classification into voiced/unvoiced/silence in accordance with the various excitation forms arising during the creation of speech in the vocal tract.
Regardless of which of the synthesis techniques are used, a problem still remains with text-based synthesis devices. The problem is that even if there is a relatively high degree of correlation between the pronunciation of a text or stored train of characters, there are still words in every language whose pronunciation cannot be determined from the spelling of the word if no context is given. In particular, it is often impossible to specify general phonetic pronunciation rules for proper names. For example, the names of the cities xe2x80x9cItzehoexe2x80x9d and xe2x80x9cLaboexe2x80x9d have the same ending, even though the ending for Itzehoe is pronounced xe2x80x9coexe2x80x9d and the ending for Laboe is pronounced xe2x80x9cxc3x6xe2x80x9d. If these words are provided as trains of characters for synthetic reproduction, then the application of a general rule would lead to the endings of both city names in the example above being pronounced either xe2x80x9cxc3x6xe2x80x9d or xe2x80x9coexe2x80x9d, which would result in an incorrect pronunciation when the xe2x80x9cxc3x6xe2x80x9d versionxe2x80x9d is used for Itzehoe and when the xe2x80x9coexe2x80x9d version is used for Laboe. If these special cases are to be taken into consideration, then it is necessary to subject the corresponding words of that language to special treatment for reproduction. However, this also means that it is not possible anymore to use pure text-based input for any words intended to be reproduced later on.
Due to the fact that giving certain words in a language special treatment is extremely complex, announcements to be output by voice-controlled devices are now made up of a combination of spoken and synthesized speech. For example, for a route finder, the desired destination, which is specified by the user and which often displays peculiarities in terms of its pronunciation as compared to other words in the corresponding language, is recorded and copied to the corresponding destination announcement in voice-controlled devices. For the destination announcement xe2x80x9cItzehoe is three kilometers awayxe2x80x9d, this would cause the text written in cursive to be synthesized and the rest, the word xe2x80x9cItzehoexe2x80x9d, to be taken from the user""s destination input. The same set of circumstances also arises when setting up mail boxes where the user is required to input his or her name. In this case, in order to avoid these complexities the announcement played back when a caller is connected to the mailbox is created from the synthesized portion xe2x80x9cYou have reached the mailbox ofxe2x80x9d and the original text, e.g. xe2x80x9cJohn Smithxe2x80x9d, which was recorded when the mailbox was set up.
Apart from the fact that combined announcements of the type just described leave a more or less unprofessional impression, they can also lead to problems when listening to the announcement due to the inclusion of recorded speech in the announcement. We only need to point out the problems arising in conjunction with inputting speech in noisy environments. That is why the invention is the result of the task of specifying a reproduction process for voice-controlled systems with text-based speech synthesis in which the disadvantages inherent in the current state of the art are to be eliminated.
This task will be accomplished using the features of the present invention. Advantageous extensions and expansions of the invention are also provided. If, in accordance with the present invention, there is actual spoken speech input corresponding to a stored string of characters and a train of characters that has been described phonetically according to general rules and converted to a purely synthetic form is compared to the spoken speech input before the actual reproduction of the converted train of characters, and the converted train of characters are actually reproduced only after a comparison of this train of characters with the actual spoken speech input results in a deviation that is below a threshold value, then the use of the original recorded speech for reproduction, corresponding to the current state of the art, is superfluous. This even applies when the spoken word deviates significantly from the converted train of characters corresponding to the spoken word. It must only be ensured that at least one variation is created from the converted train of characters, and that the variation created is output instead of thexe2x80x94original-converted train of characters if this variation displays a deviation below the threshold value when compared to the original speech input.
If the method of the present invention is performed, then the amount of computational and memory resources required remains relatively low. The reason for this is that only one variation must be created and examined.
If at least two variations are created in accordance with the present invention and the variation with the lowest deviation from the original speech input is determined and selected, then, in contrast to performing the method of the present invention as described above, there is always at least one synthetic reproduction of the original speech input possible.
Performing the method is made easier when the speech input and the converted train of characters or the variations created from it are segmented. Segmentation allows segments in which there are no deviations or in which the deviation is below a threshold value to be excluded from further treatment.
If the same segmenting approach is used, the comparison becomes especially simple because there is a direct association between the corresponding segments.
As per the present invention, different segmenting approaches can be used. This has its advantages, especially when examining the original speech input, because the information contained in the speech signal, which can only be obtained in a very complex step, must be used in any case for segmentation, while the known number of phonemes in the utterance can simply be used to segment trains of characters.
The method of the present invention becomes very efficient when segments with a high degree of correlation are excluded, and only the segment of the train of characters that deviates from its corresponding segment in the original speech input by a value above the threshold value is altered by replacing the phoneme in the segment of the train of characters with a replacement phoneme.
The method of the present invention is especially easy to perform when for each phoneme there is at least one replacement phoneme similar to the phoneme that is linked to it or placed in a list.
The amount of computation is further reduced when the peculiarities arising in conjunction with the reproduction of the train of characters for a variation of a train of characters determined to be worthy of reproduction are stored together with the train of characters. In this case the special pronunciation of the corresponding train of characters can be accessed in memory immediately when used later or without much additional effort.