Historically, books and other literary works have been expressed in the form of text. Given the growing use of computers, text is now frequently represented and stored in electronic form, e.g., in the form of text files. Accordingly, in the modern age, users of computer devices can obtain electronic copies of books and other literary works.
Frequently text is read aloud so that the content of the text can be provided to one or more people in an oral, as opposed to written, form. The reading of stories to children and the reading of text to the physically impaired are common examples where text is read aloud. The commercial distribution of literary works in both electronic and audio versions has been commonplace for a significant period of time. The widespread availability of personal computers and other computer devices capable of displaying text and playing audio files stored in electronic form has begun to change the way in which text versions of literary works and their audio counterparts are distributed.
Electronic distribution of books and other literary works in the form of electronic text and audio files can now be accomplished via compact discs and/or the Internet. Electronic versions of literary works in both text and audio versions can now be distributed far more cheaply than paper copies. While the relatively low cost of distributing electronic versions of a literary work provide authors and distributors an incentive for distributing literary works in electronic form, consumers can benefit from having such works in electronic form as well.
In the case of recordings of literary and other text documents which are read aloud and recorded for commercial purposes, a single reader is often responsible for the reading of the entire text. The reader is often carefully chosen by the company producing the audio version of the literary work for proper pronunciation, inflection, general understandability and overall accuracy. In addition, audio recordings of books and other literary works are normally generated in a sound controlled environment designed to keep background noise to a minimum. Thus commercial audio versions of books or other literary works intended to be offered for sale, either alone or in combination with a text copy, are often of reasonably good quality with a minimum of background noise. Furthermore, they tend to accurately reflect the punctuation in the original work and, in the case of commercial audio versions of literary works, a single individual may be responsible for the audio versions of several books or stories since commercial production companies tend to use the same reader to produce the audio versions of multiple literary works, e.g., books.
Consumers may wish to switch between audio and text versions of a literary work. For example, in the evening an individual may wish to read a book. However, on their way to work, the same individual may want to listen to the same version of the literary work from the point, e.g., sentence or paragraph, where they left off reading the night before. Consumers attempting to improve their reading skills can also find text and audio versions in the form of electronic files beneficial. For example, an individual attempting to improve his/her reading skills may wish to listen to the audio version of a book while having text corresponding to the audio being presented highlighted on a display device. Also, many vision-impaired or hearing-impaired readers might benefit from having linked audio and text versions of a literary work.
While electronic text and audio versions of many literary works exist, relatively few of these works include links between the audio and text versions needed to support the easy accessing of the same point in both versions of a work. Without such links between the text and audio versions of a work, it is difficult to easily switch between the two versions of the work or to highlight text corresponding to the portion of the audio version being played at a given moment in time.
Links or indexes used to synchronize audio and text versions of the same work may be manually generated via human intervention. However, such human involvement can be costly and time consuming. Accordingly, there is a need for methods and apparatus for automating the synchronization of electronic text and audio versions of a work.
Previous attempts to automate the synchronization of electronic text files and audio files of the same work have focused primarily on the indexing of audio files corresponding to radio and other broadcasts with electronic text files representing transcripts of the broadcasts. Such indexing is designed to allow an individual viewing an excerpt from a transcript over the Internet to hear an audio clip corresponding to the excerpt. In such applications, the precision required in the alignment is often considered not to be critical and an error in alignment of up to 2 seconds is considered by some to be acceptable.
Once particular known attempt at synchronizing audio files with text transcripts of programs relies on the use of an iterative speech recognition and text alignment process. In the known system, both a language model and an acoustic model are used for speech recognition purposes. During each iteration of the speech recognition and alignment process, the results of the speech recognition process are used to segment the audio and text files at points where the files are aligned based on the recognition results. During each subsequent iteration of the speech recognition process, each audio and corresponding text segment is processed separately thereby reducing the size of the segments being processed. While the acoustic model is not altered during each iteration of the speech recognition process in the known system, the language model is made more restrictive during each iteration so that only those words in the corresponding text segment are considered as words which may be recognized. Thus, the known system attempts to break the text/audio alignment process into steps of aligning smaller and smaller segments using ever more restrictive language models without attempting to improve or modify the acoustic model being used.
In the case of attempting to synchronize text and audio versions of literary works, given the above discussed uses of such files, accurately synchronizing the starting points of paragraphs and sentences is often more important than being able to synchronize individual words within sentences.
In view of the above discussion, it is apparent that there is a need for new methods and apparatus which can be used to accurately synchronize audio and text files. It is desirable that at least some methods and apparatus be well suited for synchronizing text and audio versions of literary works. It is also desirable that the methods and apparatus be capable of synchronizing the starting points of sentences and/or paragraphs in audio and text files with a high degree of accuracy.
To the extent that methods and/or apparatus for synchronizing audio and text files involves the use of speech recognition, there is also a need for speech recognition methods and apparatus which are well suited to use in an audio/text file synchronization process.