Machine-generated speech can be produced in many different ways and for many different applications. The most popular and practical approach towards speech synthesis from text is the so-called concatenative speech synthesis technique in which segments of speech extracted from recorded speech messages are concatenated sequentially, generating a continuous speech signal.
Many different concatenative synthesis techniques have been developed, which can be classified by their features:                The type of the smallest speech segments (diphones, demi-phones, phones, syllables, words, phrases . . . )        The number of prototypes for each speech segment class (one prototype per speech segment vs. many prototypes per speech segment)        The signal representation of the basic speech units (prosody modification vs. no prosody modification)        Prosody modification techniques (LPC, TD-PSOLA, HNM . . . )        
A common method for generating speech waveforms is by a speech segment composition process that consists of re-sequencing and concatenating digital speech segments that are extracted from recorded speech files stored in a speech corpus, thereby avoiding substantial prosody modifications.
The quality of segment resequencing systems depends among other things on appropriate selection of the speech units and the position where they are concatenated. The synthesis method can range from restricted input domain-specific “canned speech” synthesis where sentences, phrases, or parts of phrases are retrieved from a database, to unrestricted input corpus-based unit selection synthesis where the speech segments are obtained from a constrained optimization problem that is typically solved by means of dynamic programming.
Table 1 establishes a typology of TTS engines depending on several characteristics.
TABLE 1DomainGeneralSpecificPurposeCanned speechcorpus-basedCorpus-BasedQuality/naturalnessTransparentHighMediumSelection complexityTrivialComplexVery complexUnit Size after selectionDeterminedVariableVariableNumber of unitsSmallMediumLargeSegmental and ProsodicLowLowHighRichnessVocabularyStrictly LimitedLimitedUnlimitedFlexibilityLowLowLimitedFootprintApplicationMediumLargedependentAll the technologies mentioned in Table 1 are currently available in the TTS market. The choice of TTS integrators in different platforms and products is determined by a compromise between processing power needs, storage capacity requirements (footprint), system flexibility, and speech output quality.
In contrast to corpus-based unit selection synthesis, canned speech synthesis can only be used for restricted input domain-specific applications where the output message set is finite and completely described by means of a number of indices that refer to the actual speech waveforms.
While canned speech synthesizers use large units such as phrases (described in E. Klabbers, “High-Quality Speech Output Generation Through Advanced Phrase Concatenation,” Proc. of the COST Workshop on Speech Technology in the Public Telephone Network: Where are we today?, Rhodes, Greece, pages 85-88, 1997), words (described in H. Meng, S. Busayapongchai, J. Glass, D. Goddeau, L. Hetherington, E. Hurley, C. Pao, J. Polifroni, S. Sene, and V. Zue, “WHEELS: A Conversational System In The Automobile Classifieds Domain,” in Proc. ICSLP '96, Philadelphia, Pa., October 1996, pp. 542-545), and morphemes, corpus-based speech synthesizers use smaller units such as phones (described in A. W. Black, N. Campbell, “Optimizing Selection Of Units From Speech Databases For Concatenative Synthesis,” Proc. Eurospeech '95, Madrid, pp. 581-584, 1995), diphones (described in P. Rutten, G. Coorman, J. Fackrell & B. Van Coile, “Issues in Corpus-based Speech Synthesis,” Proc. IEE symposium on state-of-the-art in Speech Synthesis, Savoy Place, London, April 2000), and demi-phones (described in M. Balestri, A. Pacchiotti, S. Quazza, P. L. Salza, S. Sandri, “Choose The Best To Modify The Least: A New Generation Concatenative Synthesis System,” Proc. Eurospeech '99, Budapest, pp. 2291-2294, September 1999).
Both types of applications use a different unit size because the size of the database grows exponentially with the size of the unit under the condition of full coverage. Canned speech synthesis is widely used in domain specific areas such as announcement systems, games, speaking clocks, and IVR systems.
Corpus-based speech synthesis systems make use of a large segment database. A large segment database refers to a speech segment database that references speech waveforms. The database may directly contain digitally sampled waveforms, or it may include pointers to such waveforms, or it may include pointers to parameter sets that govern the actions of a waveform synthesizer. The database is considered “large” when, in the course of waveform reference for the purpose of speech synthesis, the database commonly references many waveform candidates, occurring under varying linguistic conditions. In this manner, most of the time in speech synthesis, the database will likely offer many waveform candidates from which a single waveform is selected. The availability of many such waveform candidates can permit prosodic and other linguistic variation in the speech output stream.
Speech resequencing systems access an indexed database composed of natural speech segments. Such a database is commonly referred as the speech segment database. Besides the speech waveform data, the speech segment database contains the locations of the segment boundaries, possibly enriched by symbolic and acoustic features that discriminate the speech segments. The speech segments that are extracted from this database to generate speech are often referred in speech processing literature as “speech units” (SU). These units can be of variable length (e.g. polyphones). The smallest units that are used in the unit selector framework are called basic speech units (BSUs). In corpus-based speech synthesis, these BSUs are phonetic or sub-word units. If part of a synthesized message is constructed from a number of BSUs that are adjacent in the speech corpus (i.e. convex sequence of BSUs), then the concatenation step can be avoided between these units. We will use the term Monolithic Speech Unit (MSU) when it's necessary to emphasize that a given speech unit corresponds to a convex sequence of BSUs.
A corpus-based speech synthesizer includes a large database with speech data and modules for linguistic processing, prosody prediction, unit selection, segment concatenation, and prosody modification. The task of the unit selector is to select from a speech database the ‘best’ sequence of speech segments (i.e. speech units) to synthesize a given target message (supplied to the system as a text).
The target message representation is obtained through analysis and transformation of an input text message by the linguistic modules. The target message is transformed to a chain of target BSU representations. Each target BSU representation is represented by a target feature vector that contains symbolic and possibly numeric values that are used in the unit selection process. The input to the unit selector is a single phonetic transcription supplemented with additional linguistic features of the target message. In a first step, the unit selector converts this input information into a sequence of BSUs with associated feature vectors. Some of the features are numeric, e.g. syllable position in the phrase. Others are symbolic, such as BSU identity and phonetic context. The features associated with the target diphones are used as a way to describe the segmental and prosodic target in a linguistically motivated way. The BSUs in the speech database are also labeled with the same features.
For each BSU in the target description, the unit selector retrieves the feature vectors of a large number of BSU candidates (e.g. diphones as illustrated in FIG. 1). Each BSU candidate is described by a speech unit descriptor that consists of a speech unit feature vector and a reference to the speech unit waveform parameters that is sometimes referred to as a segment identifier. This is shown in FIG. 2. FIG. 3 shows how the speech unit feature vector can be split into an acoustic part and a linguistic part.
Each of these candidate BSUs is scored by a multi-dimensional cost function that reflects how well its feature vector matches the target feature vector—this is the target cost. A concatenation cost is calculated for each possible sequence of BSU candidates. This too is calculated by a multi-dimensional cost function. In this case the cost reflects the cost of joining together two candidate BSUs. If the prosodic or spectral mismatch at the segment boundaries of two candidates exceeds the hearing threshold, concatenation artifacts occur.
In order to reduce and preferably avoid concatenation artifacts, masking functions (as defined in G. Coorman, J. Fackrell, P. Rutten & B. Van Coile, “Segment selection in the L&H Realspeak laboratory TTS system”, Proceedings of ICSLP 2000, pp. 395-398) that facilitate the rejection of bad segment combinations in the unit selection process are introduced. A dynamic programming algorithm is used to find the lowest cost path through all possible sequences of candidate BSUs, taking into account a well-chosen balance between target costs and concatenation costs. The dynamic programming assesses many different paths, but only the BSU sequence that corresponds with the lowest cost path is retained and converted to a speech signal by concatenating the corresponding monolithic speech units (e.g. polyphones as illustrated in FIG. 1).
Although the quality of corpus-based speech synthesis systems is often very good, there is a large variance in the overall speech quality. This is mainly because the segment selection process as described above is only an approximation of a complex perceptual process.
FIG. 1 depicts a typical corpus-based synthesis system. The text processor 101 receives a text input, e.g., the text phrase “Hello!” The text phrase is then converted by the linguistic processor 101 which includes a grapheme to phoneme converter into an input phonetic data sequence. In FIG. 1, this is a simple phonetic transcription—#′hE-lO#. In various alternative embodiments, the input phonetic data sequence may be in one of various different forms.
The input phonetic data sequence is converted by the target generator 111 into a multi-layer internal data sequence to be synthesized. This internal data sequence representation, known as extended phonetic transcription (XPT), contains mainly the linguistic feature vectors (including phonetic descriptors, symbolic descriptors, and prosodic descriptors) such as those in the speech segment database 141.
The unit selector 131 retrieves from the speech segment database 141 descriptors of candidate speech units that can be concatenated into the target utterance specified by the XPT transcription. The unit selector 131 creates an ordered list of candidate speech units by comparing the XPTs of the candidate speech units with the target XPT, assigning a target cost to each candidate. Candidate-to-target matching is based on symbolic feature vectors, such as phonetic context and prosodic context, and numeric descriptors, and determines how well each candidate fits the target specification. Poorly matching candidates may be excluded at this point.
The unit selector 131 determines which candidate speech units can be concatenated without causing disturbing quality degradations such as clicks, pitch discontinuities, etc. Successive candidate speech units are evaluated by the unit selector 131 according to a quality degradation cost function. Candidate-to-candidate matching uses frame-based information such as energy, pitch and spectral information to determine how well the candidates can be joined together. Using dynamic programming, the best sequence of candidate speech units is selected for output to the speech waveform concatenator 151.
The speech waveform concatenator 151 requests the output speech units (e.g. diphones and/or polyphones) from the speech unit database 141 for the speech waveform concatenator 151. The speech waveform concatenator 151 concatenates the speech units selected forming the output speech that represents the target input text.
It has been reported that the average quality of unit selection synthesis is increased if the application domain is closer to the domain of the recordings. Canned speech synthesis, which is a good example of domain specific synthesis, results in high quality and extremely natural synthesis beyond the quality of current corpus-based speech synthesis systems. The success of canned speech synthesis lies in the size of the speech segments that are being used. By recording words and phrases in prosodic contexts similar to the ones in which they will be used, a very high naturalness can be achieved. Because the segments used in canned speech applications are large, they embed detailed linguistic and paralinguistic information. It is not straightforward to embed this information in synthesized speech waveforms by concatenating smaller segments such as diphones or demi-phones using automatic algorithms.
The quality of domain-specific unrestricted input TTS can be further increased by combining canned speech synthesis with corpus-based speech synthesis into carrier-slot synthesis. Carrier-slot speech synthesis combines carrier phrases (i.e. canned speech) with open slots to be filled out by means of corpus-based concatenative synthesis. The corpus-based synthesis can take into account the properties of the boundaries of the carriers to select the best unit sequences.
Canned speech synthesis systems work with a fixed set of recorded messages that can be combined to create a finite set of output speech messages. If new speech messages have to be added, new recordings are required. This also means that the size of the database grows almost linearly with the number of messages that can be generated. Similar remarks can be made about corpus-based synthesis. Whatever speech unit is used in the database, it is desirable that the database offers sufficient coverage of the units to make sure that an arbitrary input text can be synthesized with a more or less homogeneous quality. In practical circumstances it is difficult to achieve full coverage. In what follows we will refer to this as the data scarcity problem.
A common approach to increase the number of messages that can be synthesized with high quality is to add more speech data to the speech unit database until the average quality of the system saturates. This approach has several drawbacks such as:                Long production cycle (recording/segmentation/annotation/validation)        Large databases, consuming lots of memory        Slowdown of the unit selection process because of increased search space        Speaker's timbre may change over time        
The speech segment database development procedure starts with making high quality recordings in a recording studio followed by auditory and visual inspection. Then an automatically generated phonetic transcription is verified and corrected in order to describe the speech waveform correctly. Automatic segmentation results and prosodic annotation are manually verified and corrected. The acoustic features (spectral envelope, pitch, etc.) are estimated automatically by means of techniques well known in the art of speech processing. All features which are relevant for unit selection and concatenation are extracted and/or calculated from the raw data files.
Single speaker speech compression at bit rates far below the bit rates of traditional coding systems can be accomplished by resequencing speech segments. Such coders are referred to as very low bit rate (VLBR) coders. Initially, VLBR coding was achieved by modeling speech as a sequence of acoustically segmented variable-length speech segments.
Phonetic vocoding techniques can achieve lower bit rates by extracting more detailed linguistic knowledge of the information embedded in the speech signal. The phonetic vocoder distinguishes itself from a vector quantization system in the manner in which spectral information is transmitted. Rather than transmitting individual codebook indices, a phone index is transmitted along with auxiliary information describing the path through the model.
Phonetic vocoders were initially speaker specific coders, resulting in a substantial coding gain because there was no need to transmit speaker specific parameters. The phonetic vocoder was later on extended to a speaker independent coder by introducing multiple-speaker codebooks or speaker adaptation. The voice quality was further improved where the decoding stage produced PCM waveforms corresponding to the nearest templates and not based on their spectral envelope representation. Copy synthesis was then applied to match the prosody of the segment prototype appropriately to the prosody of the target segment. These prosodically modified segments are then concatenated to produce the output speech waveform. It was reported that the resulting synthesized speech had a choppy quality, presumably due to spectral discontinuities at the segment boundaries.
The naturalness of the decoded speech was further increased by using multiple segment candidates for each recognized segment. In order to select the best sounding segment combination, the decoder performs a constrained optimization similar to the unit selection procedure in corpus-based synthesis.
Extremely low bit rates were achieved by combining an ASR system with a TTS system. But these systems are very error prone because they depend on two processes that introduce significant errors.