Segmentation of continuous speech into segments is beneficial for many applications including speech analysis, automatic speech recognition (ASR) and speech synthesis. However, for example, manually determining phonetic transcriptions and segmentations requires expert knowledge and this process is laborious and expensive for large databases. Thus, many automatic segmentation and labeling methods have been proposed in the past to tackle this problem.
Proposed methods include [1] S, Dusan and L. Rabiner, “On the relation between maximum spectral transition positions and phone boundaries,” in Proc. of ICSLP, 2006 (hereinafter “Reference [1]”; [2] v. Qiao, N, Shimomura, and N, Minematsu, “Unsupervised optimal phoneme segmentation: objectives, algorithm and comparisons,” in Proc. of ICASSP, 2008 (hereinafter “Reference [2]”); [3] F. Brugnara, D, Falavigna, and M, Omologo, “Automatic segmentation and labeling of speech based on hidden markov models,” Speech Communication, vol. 12, no, 4, pp, 357-370, 1993 (hereinafter “Reference [3]”); [4] A. Sethy and S, S, Narayanan, “Refined speech segmentation for concatenative speech synthesis,” in Proc. of ICSLP, 2002 (hereinafter “Reference [4]”); and [5] v. Estevan, V, Wan, and 0, Scharenborg, “Finding maximum margin segments in speech,” in Proc. of ICASSP, 2007 (hereinafter “Reference [5]”).
These proposed methods correspond to references [1, 2, 3, 4, 5] cited in a phoneme segmentation paper entitled “Automatic Phoneme Segmentation Using Auditory Attention Features” by Ozlem Kalinli, INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association, Portland, Oreg., USA, Sep. 9-13, 2012, which is incorporated herein by reference.
A first group of proposed segmentation methods require transcriptions, which is not always available. When the transcription is not available, one may consider using a phoneme recognizer for the segmentation. However, speech recognition techniques like HMMs cannot place phone boundaries accurately since they are optimized for the correct identification of the phone sequence. See Reference [4]. A second group of methods does not require any prior knowledge of transcription or acoustic models of phonemes. But, usually their performance is limited.
It is within this context that aspects of the present disclosure arise.