Historically, books and other literary works have been expressed in the form of text. Given the growing use of computers, text is now frequently represented and stored in electronic form, e.g., in the form of text files. Accordingly, in the modern age, users of computer devices can obtain electronic copies of books and other literary works.
Frequently text is read aloud so that the content of the text can be provided to one or more people in an oral, as opposed to written, form. The reading of stories to children and the reading of text to the physically impaired are common examples where text is read aloud. The commercial distribution of literary works in both electronic and audio versions has been commonplace for a significant period of time. The widespread availability of personal computers and other computer devices capable of displaying text and playing audio files stored in electronic form has begun to change the way in which text versions of literary works and their audio counterparts are distributed.
Electronic distribution of books and other literary works in the form of electronic text and audio files can now be accomplished via compact discs and/or the Internet. Electronic versions of literary works in both text and audio versions can now be distributed far more cheaply than paper copies. While the relatively low cost of distributing electronic versions of a literary work provide authors and distributors an incentive for distributing literary works in electronic form, consumers can benefit from having such works in electronic form as well.
Consumers may wish to switch between audio and text versions of a literary work. For example, in the evening an individual may wish to read a book. However, on their way to work, the same individual may want to listen to the same version of the literary work from the point, e.g., sentence or paragraph, where they left off reading the night before. Consumers attempting to improve their reading skills can also find text and audio versions in the form of electronic files beneficial. For example, an individual attempting to improve his/her reading skills may wish to listen to the audio version of a book while having text corresponding to the audio being presented highlighted on a display device. Also, many vision-impaired or hearing-impaired readers might benefit from having linked audio and text versions of the literary work.
While electronic text and audio versions of many literary works exist, relatively few of these works include links between the audio and text versions needed to support the easy accessing of the same point in both versions of a work. Without such links between the text and audio versions of a work, it is difficult to easily switch between the two versions of the work or to highlight text corresponding to the portion of the audio version being played at a given moment in time.
Links or indexes used to synchronize audio and text versions of the same work may be manually generated via human intervention. However, such human involvement can be costly and time consuming. Accordingly, there is a need for methods and apparatus for automating the synchronization of electronic text and audio versions of a work.
Previous attempts to automate the synchronization of electronic text files and audio files of the same work have focused primarily on the indexing of audio files corresponding to radio and other broadcasts with electronic text files representing transcripts of the broadcasts. Such indexing is designed to allow an individual viewing an excerpt from a transcript over the Internet to hear an audio clip corresponding to the excerpt. In such applications, the precision required in the alignment is often considered not to be critical and an error in alignment of up to 2 seconds is considered by some to be acceptable.
While the task of aligning audio files corresponding to TV and radio broadcasts and text transcripts of the broadcasts is similar in nature to the task of aligning text files of books or other literary works with audio versions made there from, there are important differences between the two tasks which arise from the differing content of the files being aligned and the ultimate use of the aligned files.
In the case of recordings of literary and other text documents which are read aloud and recorded for commercial purposes, a single reader is often responsible for the reading of the entire text. The reader is often carefully chosen by the company producing the audio version of the literary work for proper pronunciation, inflection, general understandability and overall accuracy. In addition, audio recordings of books and other literary works are normally generated in a sound controlled environment designed to keep background noise to a minimum. Thus commercial audio versions of books or other literary works intended to be offered for sale, either alone or in combination with a text copy, are often of reasonably good quality with a minimum of background noise. Furthermore, they tend to accurately reflect the punctuation in the original work and, in the case of commercial audio versions of literary works, a single individual may be responsible for the audio versions of several books or stories since commercial production companies tend to use the same reader to produce the audio versions of multiple literary works, e.g., books.
In the case of transcripts produced from, e.g., radio broadcasts, television broadcasts, or court proceedings, multiple speakers with different pronunciation characteristics, e.g., accents, frequently contribute to the same transcript. Each speaker may contribute to only a small portion of the total recording. The original audio may have a fair amount of background noise, e.g., music or other noise. In addition, in TV and radio broadcasts, speech from multiple speakers may overlap, making it difficult to distinguish the end of a sentence spoken by one speaker and the start of a sentence from a new speaker. Furthermore, punctuation in the transcript may be less accurate then desired given that the transcript may be based on unrehearsed conversational speech generated without regard to how it might later be transcribed using written punctuation marks.
In the case of attempting to synchronize text and audio versions of literary works, given the above discussed uses of such files, accurately synchronizing the starting points of paragraphs and sentences is often more important than being able to synchronize individual words within sentences.
In view of the above discussion, it is apparent that there is a need for new methods and apparatus which can be used to accurately synchronize audio and text files. It is desirable that at least some methods and apparatus be well suited for synchronizing text and audio versions of literary works. It is also desirable that the methods and apparatus be capable of synchronizing the starting points of sentences and/or paragraphs in audio and text files with a high degree of accuracy.