1. Field of the Invention
The invention relates to the field of emotion synthesis in which an emotion is simulated e.g. in a voice signal, and more particularly aims to provide a new degree of freedom in controlling the possibilities offered by emotion synthesis systems and algorithms.
In the case of an emotion to be conveyed on voice data, the latter can be intelligible words or unintelligible vocalisations or sounds, such as babble or animal-like noises.
Such emotion synthesis finds applications in the animation of communicating objects, such as robotic pets, humanoids, interactive machines, educational training, systems for reading out texts, the creation of sound tracks for films, animations, etc., among others.
2. Discussion of the Background
FIG. 1 illustrates the basic concept of a classical voiced emotion synthesis system 2 based on an emotion simulation algorithm.
The system receives at an input 4 voice data Vin, which is typically neutral, and produces at an output 6 voice data Vout which is an emotion-tinted form of the input voice data Vin. The voice data is typically in the form of a stream of data elements each corresponding to a sound element, such as a phoneme or syllable. A data element generally specifies one or several values concerning the pitch and/or intensity and/or duration of the corresponding sound element. The voice emotion synthesis operates by performing algorithmic steps modifying at least one of these values in a specified manner to produce the required emotion.
The emotion simulation algorithm is governed by a set of input parameters P1, P2, P3, . . . , PN, referred to as emotion-setting parameters, applied at an appropriate input 8 of the system 2. These parameters are normally numerical values and possibly indicators for parameterising the emotion simulation algorithm and are generally determined empirically.
Each emotion E to be portrayed has its specific set of emotion-setting parameters. In the example, the values of the emotion-setting parameters P1, P2, P3, . . . , PN are respectively C1, C2, C3, . . . , CN for calm, A1, A2, A3, . . . , AN for angry, H1, H2, H3, . . . , HN for happy, S1, S2, S3 . . . , SN for sad.
There also exist emotion simulation algorithm systems that are entirely generative, inasmuch as they do not convert an input stream of voice data, but generate the emotion-tinted voice data Vout internally. These systems also use sets of parameters P1, P2, P3, . . . , PN analogous to those described above to determine the type of emotion to be generated.
Whatever the emotion simulation algorithm system, while these parameterisations can effectively synthesize the corresponding emotions, there is a need in addition to be able to associate a magnitude to a synthesized emotion E. For instance, it is advantageous to be able to produce for a given emotion E a range of quantity of emotion portrayed in the voice data Vout, e.g. from mild to intense.
One possibility would be to create empirically-determined additional sets of parameters for a given emotion, each corresponding to a degree of emotion portrayed. However, such an approach suffers from important drawbacks:
the elaboration of the additional sets would be extremely laborious,
their storage in an application would occupy a portion of memory that could be penalizing in a memory-constrained device such as a small robotic pet,
the management and processing of the additional sets consume significant processing power,
and, from the point of view of performance, it would not allow to envisage embodiments that create smooth changes in the quantity of emotion.