In text-to-speech (TTS) systems, a portion of an inputted text (or a text file) is converted into audio speech (or an audio speech file). Such systems are used in a wide variety of applications such as electronic games, e-book readers, e-mail readers, satellite navigation, automated telephone systems, and automated warning systems. For example, some instant messaging (IM) systems use TTS synthesis to convert text chat to speech. This can be very useful for people who have difficulty reading, people who are driving, or people who simply do not want to take their eyes off whatever they are doing to change focus to the IM window.
Another recent area for application of the TTS systems are “personal assistants”. These personal assistants are implemented as either software integrated into a device (such as SIRI™ assistant provided with APPLE™ devices) or stand-alone hardware devices with the associated software (such as AMAZON™ ECHO™ device). The personal assistants provide an utterance-based interface between the electronic device and the user. The user is able to issue commands by voice (for example, by saying “What is the weather today in New York, USA?”).
The electronic device is configured to capture the utterance, convert the utterance to text and to process the user-generated command. In this example, the electronic device is configured to execute a search and determine the current weather forecast for New York. The electronic device is then configured to generate a machine-generated utterance representative of a response to the user query. In this example, the electronic device may be configured to generate a spoken utterance: “It is 5 degrees centigrade with the winds out of North-East”.
One of the main challenges associated with the TTS systems is generation of the machine utterances that are “naturally sounding”. In other words, the main challenge is making the machine generated utterance to sound as close as possible to the way a human would sound. Typically, the TTS systems employ Machine Learning Algorithms (MLAs) that are trained to generate the machine utterances for a given text that needs to be processed into a machine-generated utterance using a corpus of pre-recorded utterances.
These utterances are pre-recorded by a human (typically an actor with a good diction). The MLA is then configured to “cut and paste” various pieces of the corpus of pre-recorded utterances to generate the required machine utterance. Put another way, the MLA of the TTS system generates synthesized speech by “concatenating” pieces of recorded speech that are stored in a database.
For example, if the portion of the text to be process is “ma”, the MLA picks the most appropriate piece of the pre-recorded utterances to generate the associated portion of the machine-generated utterance. One can easily appreciate that if a human were to pronounce the utterance “ma” it can sound differently depending on a plethora of reasons and circumstances—surrounding phonemes (i.e. the “context”), whether it is part of the stressed syllable or not, whether it is at a beginning of a word or at an end, etc. Thus, a given corpus of the pre-recorded utterances may have a number of utterances representing the text “ma” some of them sounding very different from the others of them and, thus, some of them being more (or less) suitable for generating a particular instance of the machine-generated utterance representing “ma”.
Therefore, one of the challenges in this approach is to determine which pieces of the pre-recorded utterances to use for the given machine-generated utterance to make it as naturally sounding as possible. There are two parameters that are typically used to select a given piece for inclusion into the currently generated machine utterance—a target cost and a join (concatenation) cost.
Generally speaking, the target cost is indicative of whether a given piece of pre-recorded utterances is suitable for processing a given text portion. The join cost is indicative of how well two neighbouring pieces (of the potential selection of the neighbouring pieces) of the pre-recorded utterances will sound together (i.e. how naturally the transition between one of the pre-recorded utterance to the next neighbouring utterance sounds).
The target cost can be calculated using Formula 1:
                                          C            t                    ⁡                      (                                          t                i                            ,                              u                i                                      )                          =                              ∑                          j              =              1                        p                    ⁢                                          ⁢                                    ω              j              t                        ⁢                                          C                j                t                            ⁡                              (                                                      t                    i                                    ,                                      u                    i                                                  )                                                                        Formula        ⁢                                  ⁢        1            
In other words, the target cost can be calculated as a weighted sum of differences in features of the text portion to be processed into the machine-generated utterance and the specific one of the pre-recorded utterances to be used to process such the text portion. The features that can be processed by the MLA for determining the target cost include: frequency of the main tone, duration, context, position of the element in the syllable, the number of the stress syllables in the phrase, etc.
The joint cost can be calculated using Formula 2:
                                          C            c                    ⁡                      (                                          u                                  i                  -                  1                                            ,                              u                i                                      )                          =                              ∑                          j              =              1                        q                    ⁢                                          ⁢                                    ω              j              c                        ⁢                                          C                j                c                            ⁡                              (                                                      u                                          i                      -                      1                                                        ,                                      u                    i                                                  )                                                                        Formula        ⁢                                  ⁢        2            
In other words, the joint cost is calculated as a weighted sum of features of two potentially neighbouring elements of the pre-recorded utterances.
The total cost can be calculated using Formula 3:
                                          C            ⁡                          (                                                t                  1                  n                                ,                                  u                  1                  n                                            )                                =                                                    ∑                                  i                  =                  1                                n                            ⁢                                                          ⁢                                                C                  t                                ⁡                                  (                                                            t                      i                                        ,                                          u                      i                                                        )                                                      +                                          ∑                                  i                  =                  2                                n                            ⁢                                                          ⁢                                                C                  c                                ⁡                                  (                                                            u                                              i                        -                        1                                                              ,                                          u                      i                                                        )                                                      +                                          C                c                            ⁡                              (                                  S                  ,                                      u                    1                                                  )                                      +                                          C                c                            ⁡                              (                                                      u                    n                                    ,                  S                                )                                                    ⁢                                  ⁢                              C            ⁡                          (                                                t                  1                  n                                ,                                  u                  1                  n                                            )                                =                                                    ∑                                  i                  =                  1                                n                            ⁢                                                          ⁢                                                ∑                                      j                    =                    1                                    p                                ⁢                                                                  ⁢                                                      ω                    j                    t                                    ⁢                                                            C                      j                      t                                        ⁡                                          (                                                                        t                          i                                                ,                                                  u                          i                                                                    )                                                                                            +                                          ∑                                  i                  =                  2                                n                            ⁢                                                          ⁢                                                ∑                                      j                    =                    1                                    q                                ⁢                                                                  ⁢                                                      ω                    j                    c                                    ⁢                                                            C                      j                      c                                        ⁡                                          (                                                                        u                                                      i                            -                            1                                                                          ,                                                  u                          i                                                                    )                                                                                            +                                          C                c                            ⁡                              (                                  S                  ,                                      u                    1                                                  )                                      +                                          C                c                            ⁡                              (                                                      u                    n                                    ,                  S                                )                                                                        Formula        ⁢                                  ⁢        3            
The total cost can be calculated using the totality of the target costs and the join cost associated with the given element of the pre-recorded utterances. Therefore, in order to process the text to be processed into the machine utterances, the server executing the MLA needs to select a set U1, U2, . . . , UN such that the total cost calculated according to Formula 3 is minimized.
U.S. Pat. No. 7,308,407 (published on Dec. 11, 2007 to IBM) discloses a method for generating synthetic speech can include identifying a recording of conversational speech and creating a transcription of the conversational speech. Using the transcription, rather than a predefined script, the recording can be analyzed and acoustic units extracted. Each acoustic unit can include a phoneme and/or a sub-phoneme. The acoustic units can be stored so that a concatenative text-to-speech engine can later splice the acoustic units together to produce synthetic speech.
U.S. Pat. No. 5,809,462 (published on Sep. 15, 1998 to Ericsson Messaging Systems Inc.) discloses an automated speech recognition system converts a speech signal into a compact, coded representation that correlates to a speech phoneme set. A number of different neural network pattern matching schemes may be used to perform the necessary speech coding. An integrated user interface guides a user unfamiliar with the details of speech recognition or neural networks to quickly develop and test a neural network for phoneme recognition. To train the neural network, digitized voice data containing known phonemes that the user wants the neural network to ultimately recognize are processed by the integrated user interface. The digitized speech is segmented into phonemes with each segment being labelled with a corresponding phoneme code. Based on a user selected transformation method and transformation parameters, each segment is transformed into a series of multiple dimension vectors representative of the speech characteristics of that segment. These vectors are iteratively presented to a neural network to train/adapt that neural network to consistently distinguish and recognize these vectors and assign an appropriate phoneme code to each vector. Simultaneous display of the digitized speech, segments, vector sets, and a representation of the trained neural network assist the user in visually confirming the acceptability of the phoneme training set. A user may also selectively audibly confirm the acceptability of the digitization scheme, the segments, and the transform vectors so that satisfactory training data are presented to the neural network. If the user finds a particular step or parameter produces an unacceptable result, the user may modify one or more of the parameters and verify whether the modification effected an improvement in performance. The trained neural network is also automatically tested by presenting a test speech signal to the integrated user interface and observing both audibly and visually automatic segmentation of the speech, transformation into multidimensional vectors, and the resulting neural network assigned phoneme codes. A method of decoding such phoneme codes using the neural network is also disclosed.