The present disclosure relates to multi-core processing for parallel speech-to-text processing.
Speech-to-text systems generate a text transcript from audio content. Speech-to-text techniques typically use speech recognition to identify speech from audio. Speech-to-text can be used for several speech recognition applications including, for example, voice dialing, call routing, data entry, and dictation.
A speech-to-text recognition system typically digitizes an audio signal into discreet samples. Those discreet samples are generally processed to provide a frequency domain analysis representation of the original input audio signal. With the frequency domain analysis of the signal, a recognition system maps the frequency domain information into phonemes. Phonemes are the phonetic sounds that are the basic blocks used to create words in every spoken language. For example, the English written language has an alphabet of 26 letters. However, the vocabulary of English phonemes is typically a different size. The mapping provides a string of phonemes mapped to the frequency domain analysis representation of the original input signal. Speech detection processing resolves the phonemes using a concordance or a dictionary.
A typical parallel processing technique includes a split function that physically divides an audio file into roughly equal portions. The split function intelligently divides the audio file, e.g., so that the division does not split words. The split points occur in intervals with no sound or during any intervals a signal classifier identifies as non-dialogue. The split function accepts an optional exclusion interval file to identify and filter non-dialogue from the audio file. The split function separates the entire audio file into portions having approximately the same amount of transcription data. The processes complete at approximately the same time. The portions are processed into text files.
Once text files are generated, a merge function that accepts partial speech-to-text transcripts merges separate text files from the processed portions into one transcription file. The merge function uses a master time portion index of start and end times for each portion. The split method generates the master portion index, which is used to sequence time codes for text files being merged. When the last portion has been processed to the last text file, the initiating master process then invokes the merge function to recombine the results. The output from this process is a single textual transcription of the original input signal.