1. Field of the Invention
The present invention relates to computer-generated text-to-speech conversion, and, more particularly, to updating a Concatenative Text-To-Speech (CTTS) system with a speech database from a base version to a new version.
2. Description of the Related Art
Natural speech output is one of the key elements for a wide acceptance of voice enabled applications and is indispensable for interfaces that can not make use of other output modalities, such as plain text or graphics. Recently, major improvement in the field of text-to-speech synthesis has been made by the development of so-called “corpus-based” methods: systems such as the IBM trainable text-to-speech system or AT&T's NextGen system make use of explicit or parametric representations of short segments of natural speech, referred to herein as “synthesis units,” that are extracted from a large set of recorded utterances in a preparative synthesizer training session, and which are retrieved, further manipulated, and concatenated during a subsequent speech synthesis runtime session.
In more detail, and with a particular focus on the disadvantages of prior art, such methods for operating a CTTS system include the following features:                a) The CTTS system uses natural speech—stored in either its original form or any parametric representation—obtained by recording some base text, which is designed to cover a variety of envisaged applications;        b) In a preparative step (synthesizer construction) the recorded speech is dissected by a respective computer program into synthesis units, which are stored in a base speech database;        c) The synthesis units are distinguished in the base speech database with respect to their acoustic and/or prosodic contexts, which are derived from and thus are specific for said base text; and        d) Synthetic speech is constructed by a concatenation and appropriate modification of the synthesis units.        
FIG. 1 depicts a prior art schematic block diagram CTTS system. According to FIG. 1, prior art speech synthesizers 10 basically execute a run-time conversion from text to speech, where speech is shown by audio arrow 15. For that purpose, a linguistic front-end component 12 of system 10 performs text normalization, text-to-phone unit conversion (baseform generation), and prosody prediction, i.e. creation of an intonation contour that describes energy, pitch, and duration of the required synthesis units. Intonation and pauses for the text are specified at this pre-processing stage.
The pre-processed text, the requested sequence of synthesis units, and the desired intonation contour are passed to a back-end concatenation module 14 that generates the synthetic speech in a synthesis engine 16. For that purpose, a back-end database 18 of speech segments is searched for units that best match the acoustic/prosodic specifications computed by the front-end. The back-end database 18 stores an explicit or parametric representation of the speech data.
Synthesis units, such as phones, sub-phones, diphones, or syllables, are well known to sound different when articulated in different acoustic and/or prosodic contexts. Consequently, a large number of these units have to be stored in the synthesizer's database in order to enable the system to produce high quality speech output across a broad variety of applications or domains. For combinatorial and performance reasons, it is prohibitive to search all instances of a required synthesis unit during runtime. Accordingly, a fast selection of suitable candidate segments is generally performed based upon to previously established criterion, and not performed based upon the entirety of synthesis units in the synthesizer's database.
With reference to FIG. 2 in state-of-the-art, conventional systems this is usually achieved by taking into consideration the acoustic and/or prosodic context of the speech segments. For that purpose, decision trees for the identification of relevant contexts are created during system construction 19. The leaves of these trees represent individual acoustic and/or prosodic contexts that significantly influence the short term spectral and/or prosodic properties of the synthesis units, and thus their sound. The traversal of these decision trees during runtime is fast and restricts the number of segments to consider in the back-end search to only a few out of several hundreds or thousands.
While concatenative text-to-speech synthesis is able to produce synthetic speech of remarkable quality, it is also true that such systems sound most natural for applications and/or domains that have been thoroughly covered by the recording script (i.e., the above-mentioned base text) and are thus present in the speech database. Different speaking styles and acoustic contexts are only two reasons that help to explain this observation.
Since it is impossible to record speech material for all possible applications in advance, both the construction of synthesizers for limited domains and adaptation with additional, domain-specific prompts, have been proposed in the literature. Limited domain synthesis constructs a specialized synthesizer for each individual application. Domain adaptation adds speech segments from a domain-specific speech corpus to an already existing, general synthesizer.
Referencing FIG. 3, when an existing CTTS system is to be updated in order to either adapt it to a new domain or to deal with changes made to existing applications (e.g. a re-design of the prompts to be generated by a conversational dialog system), in prior art methods and systems a step is performed of specifying a new, domain/application specific text corpus 31, which usually is not covered by the basic speech database. Disadvantageously, the new text 31 must be read by a professional human speaker in a new recording session 32, and the system construction process (shown in FIG. 2) needs to be carried out in order to generate a speech database 18 adapted to the new application.
Therefore, while both approaches, limited domain synthesis and domain adaptation, can help to increase the quality of synthetic speech for a particular application, these methods are disadvantageously time-consuming and expensive, since a professional human speaker (preferably the original voice talent) has to be available for the update speech session, and because of the need for expert phonetic-linguistic skills in the synthesizer construction step (shown in FIG. 2).