1. Field of the Invention
The present invention relates to a method and to a system for providing speech synthesis on a user terminal over a communications network. In particular, the present invention relates to a service architecture for providing speech synthesis on user terminals with limited memory availability, such as mobile phones, PDAs (Personal Digital Assistant), personal organizers and digital cameras.
The invention has been developed with particular attention paid to its possible use in wireless telecommunications networks, for providing enhanced text-to-speech (TTS) services to mobile terminals having embedded a speech synthesizer module based on the concatenation of speech waveforms stored in a database.
2. Description of the Related Art
Speech synthesis based on concatenation technique is well known in the art, i.e. from patent application WO 00/30069 or from the paper “A concatenative speech synthesis method using context dependent phoneme sequences with variable length as search units”, NHK (Nippon Hoso Kyokai; Japan Broadcasting Corp.) Science and Technical Research Laboratories, 5th ISCA Speech Synthesis Workshop, Pittsburgh, USA, June 2004.
Document WO 00/30069 discloses a speech synthesizer based on concatenation of digitally sampled speech units from a large database.
The paper “A concatenative speech synthesis method using context dependent phoneme sequences with variable length as search units” provides a method of dividing an input text into context dependent phoneme sequences and a method of selection of a proper voice waveform database from a static speech database. The speech quality increases when a large speech database is used.
The inventors have observed that the quality of such a speech synthesis system, when embedded on a mobile terminal, is intrinsically limited by the maximum database size, which cannot be increased at will on a limited resources terminal.
Document EP 1471499 A1 illustrates a method of distributed speech synthesis, performing a text to speech conversion based on a distributed processing between a remote server and a user terminal. In particular, the synthesis of speech segments is performed by the server. The user terminal downloads synthesized speech segments and concatenates them by means of server rules. Moreover, the user terminal performs a cache mechanism according to the rules provided by the server.
The inventors have observed that, although high quality speech synthesis can be achieved using a distributed speech synthesis system, in such systems it is not feasible to perform speech synthesis without an active network connection, thus limiting effectiveness of some user terminals, e.g. PDAs.
Document US 2004/0054534 illustrates an example of speech synthesis customization based on user preferences. The user selects voice criteria at a local user terminal. The voice criteria represent characteristics that the user desires for a synthesized voice. The voice criteria are communicated to a server. The server generates a set of synthesized voice rules based on the voice criteria and sends them to the local user terminal. The synthesised voice rules represent prosodic aspects of the synthesised voice.
The inventors have observed that the speech synthesis quality of above mentioned speech synthesis systems is, as a general rule, directly related to the size of the database of speech waveforms used.
The inventors have tackled the problem of obtaining a significant increase in quality of speech synthesis on systems which are embedded on mobile terminals, without affecting too much the memory requirements of the speech waveforms database. In particular, the inventors have tackled the problem of dynamically customizing a speech synthesis system based on concatenation technique, achieving the same quality of a static solution based on a database of speech waveforms so huge that it cannot be stored in portable user terminals.