The present disclosure relates to transformation of scalars or vectors, for example, using a Gaussian Mixture Model (GMM) based technique for the generation of a voice conversion function. Voice conversion is the adaptation of characteristics of a source speaker's voice, (e.g., pitch, pronunciation) to those of a target speaker. In recent years, interest in voice conversion systems and applications for the efficient generation of other related conversion models has risen significantly. One application for such systems relates to the user of voice conversion in individualized text-to-speech (TTS) systems. Without voice conversion technology and efficient transformations of speech vectors from different speakers, new voices could only be created with time-consuming and expensive processes, such as extensive recordings and manual annotations.
Well-known GMM based vector transformation can be used in voice conversion and other transformation applications, by generating joint feature vectors based on the feature vectors of source and target speakers, then by using the joint vectors to train GMM parameters and ultimately create a conversion function between the source and target voices. Typical voice conversion systems include three major steps: feature extraction, alignment between the extracted feature vectors of source and target speakers, and GMM training on the aligned source and target vectors. In typical systems, the vector alignment between the source vector sequence and target vector sequence must be performed before training the GMM parameters or creating the conversion function. For example, if a set of equivalent utterances from two different speakers are recorded, the corresponding utterances must be identified in both recordings before attempting to build a conversion function. This concept is known as alignment of the source and target vectors.
Conventional techniques for vector alignment are typically either performed manually, for example, by human experts, or automatically by a dynamic time warping (DTW) process. However, both manual alignment and DTW have significant drawbacks that can negatively impact the overall quality and efficiency of the vector transformation. For example, both schemes rely on the notion of “hard alignment.” That is, each source vector is determined to be completely aligned with exactly one target vector, or is determined not to be aligned at all, and vice versa for each target vector.
Referring to FIG. 1, an example of a conventional hard alignment scheme is shown between a source vector sequence 110 and a target vector sequence 120. Vector sequences 110 and 120 contain sets of feature vectors x1-x16, and y1-y12, respectively, where each feature vector (speech vector) may represent, for example, a basic speech sound in a larger voice segment. These vector sequences 110 and 120 may be equivalent (i.e., contain many of the same speech features), such as, for example, vector sequences formed from audio recordings of two different people speaking the same word or phrase. As shown in FIG. 1, even equivalent vector sequences often contain different numbers of vectors, and may also have equivalent speech features (e.g., x16 and y12) in different locations in the sequence. For example, the source speaker may pronounce certain sounds slower than the target speaker, or may pause slightly longer between sounds than the target speaker, etc. Thus, the one-to-one hard alignment between the source and target vectors often results in discarding certain feature vectors (e.g., x4, x5, x10, . . . ), or in duplication or interpolation of feature vectors to create additional pairs for alignment matching. As a result, small alignment errors may be magnified into larger errors, and the entire alignment process may become more complex and expensive. Finally, hard alignment may simply be impossible in many instances. Feature vectors extracted from human speech often cannot be perfectly aligned even by the best human experts or any DTW automation. Thus, hard alignment implies a certain degree of error even if performed flawlessly.
As an example of alignment error magnification resulting from a hard alignment scheme, FIG. 2 shows a block diagram of a source sequence 210 and target sequence 220 to be aligned for a vector transformation. The sequences 210 and 220 are identical in this example, but have been decimated by two on distinct parities. Thus, as in many real-world scenarios, perfect one-to-one feature vector matching is impossible because perfectly aligned source-target vector pairs are not available. Using a hard alignment scheme, each target vector has been paired with its nearest source vector and the pair is assumed thereafter to be completely and perfectly aligned. Thus, alignment errors might not be detected or taken into account because other nearby vectors are not considered in the alignment process. As a result, the hard alignment scheme may generate introduce noise into the data model, increase alignment error, and result in greater complexity for the alignment process.
Accordingly, there remains a need for methods and systems of aligning data sequences for vector transformations, such as GMM based transformations for voice conversion.