Time alignment of orthographic transcriptions with speech data is important for effectively using speech corpora. An exemplary corpus includes approximately 2500 conversations collected over long-distance telephone lines. Each conversation includes 5 to 10 minutes of unscripted speech signals from multiple interlocutors. Unscripted speech is spontaneously produced speech, as compared to speech read from a predetermined text. Each channel is recorded separately so that speech data can be separated or combined at will.
For speech research, it is frequently desirable to process each file in such a corpus to determine the timing of each speaker's turn and the timing of each word in the orthographic transcription. Sufficient precision is desired so that words of interest can be localized to a small segment of the speech data. Such timing information is useful in developing, training, and testing of systems relating to speaker identification, word spotting, and large-vocabulary speech recognition.
Some previous techniques for automatic time alignment of speech use independently-supplied manually generated phonetic transcriptions. However, the manual generation of phonetic transcriptions is cumbersome, expensive and frequently impractical, especially where the corpus size is large. To avoid having to manually generate phonetic transcriptions, other previous techniques have attempted to determine time alignment directly from orthographic transcriptions, attempting to achieve correct time alignment at the phonetic level on relatively tractable data such as speech data collected from individuals reading sentences recorded in a controlled environment.
Nevertheless, such previous techniques for time alignment typically fail to adequately address a number of challenges presented by a corpus of speech data that realistically represents characteristics of real world conversations. For example, such previous techniques typically fail to adequately address challenges presented by a corpus of speech data in which each speech data file is recorded continuously for several (at least five) minutes of time. Also, such previous techniques typically fail to adequately address challenges presented by speech data having instances of simultaneous speech, and by unscripted speech signals having a wide range of linguistic material characteristic of spontaneous speech. Further, such previous techniques typically fail to adequately address challenges presented by speech signals from multiple interlocutors, and by speech data collected over the publicly switched telephone network or from air traffic control radio transmissions.
Thus, a need has arisen for a method and system for time aligning speech for a corpus of speech data that realistically represents characteristics of real world conversations. Also, a need has arisen for a method and system for effectively time aligning speech, despite a corpus of speech data in which each speech data file is recorded continuously for several minutes of time. Further, a need has arisen for a method and system for effectively time aligning speech, despite speech data having instances of simultaneous speech. Moreover, a need has arisen for a method and system for effectively time aligning speech, despite unscripted speech signals having a wide range of linguistic material characteristic of spontaneous speech. Additionally, a need has arisen for a method and system for effectively time aligning speech, despite speech signals being from multiple interlocutors. Finally, a need has arisen for a method and system for effectively time aligning speech, despite speech data being collected over the publicly switched telephone network or from air traffic control radio transmissions.