1. Field of the Invention
The present invention is related to the field of electronic speech processing, and, more particularly, synthetic speech generation.
2. Description of the Related Art
Synthetic speech can be generated using various techniques. For example, one well-established technique for generating synthetic speech is a data-driven approach which, based on a textual guide, splices samples of actual human speech together to form a desired text-to-speech (TTS) output. This splicing technique for generating TTS output is sometimes referred to as a concatenative text-to-speech (CTTS) technique.
CTTS techniques require a set of phonetic units, called a CTTS voice, that can be spliced together to form CTTS output. A phonetic unit can be any defined speech segment, such as a phoneme, an allophone, and/or a sub-phoneme. Each CTTS voice has acoustic characteristics of a particular human speaker from which the CTTS voice was generated. A CTTS application can include multiple CTTS voices to produce different sounding CTTS output. That is, each CTTS voice is language specific and can generate output simulating a single speaker so that if different speaking voices are desired, different CTTS voices are necessary.
A large sample of human speech called a CTTS speech corpus can be used to derive the phonetic units that form a CTTS voice. Due to the large quantity of phonetic units involved, automatic methods are typically employed to segment the CTTS speech corpus into a multitude of labeled phonetic units. Each phonetic unit is verified and stored within a phonetic unit data store. A build of the phonetic data store can result in the CTTS voice.
Unfortunately, the automatic extraction methods used to segment the CTTS speech corpus into phonetic units can occasionally result in errors due to misaligned phonetic units. A misaligned phonetic unit is a labeled phonetic unit containing significant inaccuracies. Common misalignments include the mislabeling of a phonetic unit and improper boundary establishment for a phonetic unit. Mislabeling occurs when the identifier or label associated with a phonetic unit is erroneously assigned. For example, if a phonetic unit for an “M” sound is labeled as a phonetic unit for “N” sound, then the phonetic unit is a mislabeled phonetic unit. Improper boundary establishment occurs when a phonetic unit has not been properly segmented so that its duration, starting point and/or ending point is erroneously determined.
Since a CTTS voice constructed from misaligned phonetic units can result in low quality synthesized speech, it is desirable to exclude misaligned phonetic units from a final CTTS voice build. Unfortunately, manually detecting misaligned units is typically unfeasible due to the time and effort involved in such an undertaking. Conventionally, technicians remove misaligned units when synthesized speech output produced during CTTS voice tests contains errors. That is, the technicians attempt to “test out” misaligned phonetic units, a process that can correct the most grievous errors contained within a CTTS voice builder. There remains, however, a need for more efficient, more rapid techniques for performing such “voice cleanings,” both with respect to CTTS voices and other synthetically generated voices based upon a phonetic data store.