Speech Recognition systems are used in several applications often rely on the use of automatic speech recognition (ASR). Examples of systems that rely on automatic speech recognition output are Automatic Speech-to-Text transcription, Speech-to-Speech translation, Topic Detection and Tracking, etc. Speech recognizers use recorded or live speech as input and attempt to generate a transcript of the spoken speech in the form of text. Such recorded speech data is available on the web, especially in the form of news which is accompanied by the transcripts. Though certain attempts in the past have been made to access and develop a well transcribed speech corpus, however, there are certain limitations to this process including (a) limited speaker variability (number of speakers), (b) limited environment (recording environment) and (c) limited domain.
Hence, it's difficult to create a phonetically balanced corpus from already available data on the web, and provide reasonable variability in terms of environment, gender, age and accent.
A speech recognizer in general constitutes a pattern recognition program and some reference models. These reference models are generated using a language specific speech corpus.
There are two primary types of reference models, (i) the acoustic model and (ii) the language model. The acoustic models may contain a set of models to represent the various sounds, or models representing complete words; these are built using the speech that has various sounds. The acoustic model is assisted by a lexicon which contains the phonetic transcription of the domain and dictionary words. The language models aid in determining the occurrence of words and sequence of words in speech, by applying known patterns of occurrence of said words. The language models could be generated using a text corpus representing the actual spoken speech to be recognized.
FIG. 1 is a prior art illustrating a sample lexicon and text corpus for language model. A speech corpus is required in order to generate acoustic models. A typical speech corpus is a set of speech files and its associated transcriptions. Availability of a speech corpus for a specific language is an essential requirement to build a speech recognition engine and speech recognition based solutions thereof in the respective language. The process of creating a speech corpus in any language is a laborious, expensive and time consuming process. The usual process of speech corpus creation starts with a linguist determining the language specific idiosyncrasies and then a textual corpus is built to take care of the uniform distribution of the phonemes in the language (also called phonetically balanced corpus). Subsequently a target speaker age, accent and gender distribution is computed leading to the recruitment phase where the speakers are recruited.
The actual speech recording is then undertaken from the recruited speakers, in predetermined environments. Typically, the text corpus is created by keeping the underlying domain in mind for which the speech recognition is going to be used. For spontaneous conversational speech like Telephone calls and Meetings, the process of speech corpus creation may start directly from the speaker recruitment phase. Once the speech data is collected, the speech is carefully heard by a human who is a native speaker of said language and transcribed manually.
The complete set of the speech data and the corresponding transcription together forms the speech corpus. This is quite an elaborate process, which means several languages do not have a speech corpus available especially when the languages do not have commercial speech recognition based solution viability.
Thus there exists a long felt need for an effortless and inexpensive method and system that enables creation of a speech corpus.