1. Technical Field
The present invention relates to the field of synthetic speech and, more particularly, to the detection of misaligned phonetic units for a concatenative text-to-speech voice.
2. Description of the Related Art
Synthetic speech generation via text-to-speech (TTS) applications is a critical facet of any human-computer interface that utilizes speech technology. One predominant technology for generating synthetic speech is a data-driven approach which splices samples of actual human speech together to form a desired TTS output. This splicing technique for generating TTS output can be referred to as a concatenative text-to-speech (CTTS) technique.
CTTS techniques require a set of phonetic units, called a CTTS voice, that can be spliced together to form CTTS output. A phonetic unit can be any defined speech segment, such as a phoneme, an allophone, and/or a sub-phoneme. Each CTTS voice has acoustic characteristics of a particular human speaker from which the CTTS voice was generated. A CTTS application can include multiple CTTS voices to produce different sounding CTTS output.
A large sample of human speech called a CTTS speech corpus can be used to derive the phonetic units that form a CTTS voice. Due to the large quantity of phonetic units involved, automatic methods are typically employed to segment the CTTS speech corpus into a multitude of labeled phonetic units. Each phonetic unit is verified and stored within a phonetic unit data store. A build of the phonetic data store can result in the CTTS voice.
Unfortunately, the automatic extraction methods used to segment the CTTS speech corpus into phonetic units can occasionally result in errors or misaligned phonetic units. A misaligned phonetic unit is a labeled phonetic unit containing significant inaccuracies. Two common misalignments can include the mislabeling of a phonetic unit and improper boundary establishment for a phonetic unit. Mislabeling occurs when the identifier or label associated with a phonetic unit is erroneously assigned. For example, if a phonetic unit for an “M” sound is labeled as a phonetic unit for “N” sound, then the phonetic unit is a mislabeled phonetic unit. Improper boundary establishment occurs when a phonetic unit has not been properly segmented so that its duration, starting point and/or ending point is erroneously determined.
Since a CTTS voice constructed from misaligned phonetic units can result in low quality synthesized speech, it is desirable to exclude misaligned phonetic units from a final CTTS voice build. Unfortunately, manually detecting misaligned units is typically unfeasible due to the time and effort involved in such an undertaking. Conventionally, technicians remove misaligned units when synthesized speech output produced during CTTS voice tests contains errors. That is, the technicians attempt to “test out” misaligned phonetic units, a process that can usually only correct the most grievous errors contained within a CTTS voice build.