1. Field of the Invention
The present invention relates to the field of speech processing technology, and in particular, relates to a method and apparatus for aligning texts, a method for automatically archiving multimedia resources, and a method for automatically searching multimedia resources.
2. Description of Related Art
At present, with the development of information technology, the size of repository for storing multimedia resources has become more and more bulky. For example, in news agency and television stations, there are normally voluminous broadcast news resources typically including program videos and broadcast manuscripts that need to be queried and managed. These historic program videos are typically not integrated with metadata for querying contents and thus are inconvenient for query and management. However, broadcast manuscripts which are in text form provide a natural interface for querying program videos because the contents therein are easy to query.
Manual query and management of these broadcast news resources is time and energy consuming and is often impossible. Thus, it is desirable to enable automatic alignment between program videos and broadcast transcripts. It is further desirable to enable automatic integration of program videos and broadcasts into a search-friendly multimedia resource. It is desirable that a search engine can automatically search a broadcast manuscript for a word or phrase to be queried and play back the queried content from a video file aligned to the broadcast manuscript.
For another example, currently, video or audio is often recorded during a meeting or a speech. These meeting minutes in video/audio form may be saved on a server for future browsing. A manuscript used in a meeting or speech, for example, a PPT (Powerpoint) manuscript, provides a natural interface for browsing the meeting minutes. In the case of browsing the manuscript while playing back the meeting minutes, it is required to synchronize the textual content in the manuscript and speech content in the meeting minutes in video/audio form.
Current methods must first predict the corresponding video/audio and reference text pairs, then use a speech recognition engine to decode audio data, and get the recognition result. Dynamic programming algorithm is used to make the character maximum match in order to realize sentence level alignment. These methods are affected by the recognition rate and accuracy of the reference text. In the case of low recognition rate or error existing in the reference text, the alignment effect is poor, or even worse, the alignment result might not be output. Besides, these methods cannot get accurate time information.
There are still other methods in the prior art which use a phoneme-based forced alignment to align voice in the video/audio and the reference text. However, these methods, affected by the precision of sentence level alignment, maybe cannot output the alignment result; and on the other hand, a reference document containing error also restrains alignment effect. Additionally, the forced alignment method is based on a phoneme-based acoustic model, which has a considerable calculation load. Detailed content on forced alignment is found, for example, in E. F. Lussier, “A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition”. Lecture Notes in Computer Science, 2003, 2705: 38-77.
U.S. Pat. No. 5,649,060A1, “Automatic Indexing and Aligning of Audio and Text Using Speech Recognition”, discloses a method, wherein a speech recognition result is produced through a speech recognizer, and then time information is transmitted to a correct text through aligning the recognition result and the correct text, thereby realizing automatic edition and search of audios. However, this method realizes alignment mainly through sameness of words, thus its alignment effect greatly relies on the speech recognition effect, and this method cannot be applied to aligning audio and error-containing reference text.
United States patent application publication No. US2008294433A1 provides a text-speech mapping tool, This method is accomplished by using a VAD (Voice Activity Detection) to obtain a candidate sentence ending point, then obtaining the best match between an audio and the sentence through forced alignment, and then aligning a next sentence, and so forth, to obtain all mapping relationships, thereby finally realizing word level alignment. As mentioned above, the forced alignment is based on an acoustic model, which requires a considerable calculation load and has a poor alignment effect under a complex context.
The paper “Automatic Align between Speech Records and Their Text Transcriptions for Audio Archive Indexing and Searching”, INFOS2008, Mar. 27-29, 2008 Cairo-Egypt, by Jan Nouza, et al, discloses a method, wherein an associated language model associated is first obtained through a text, and then a recognition result Hi with a relatively better quality is obtained through the language model, and further a standard text is divided into small segments through the method of text alignment, and then the segments which have not been accurately aligned are subject to forced alignment to obtain a best alignment result. The alignment effect is determined by the recognition result of an Automatic Speech Recognition (ASR) system, and forced alignment requires a considerable calculation load.
For programs such as xiangsheng (Chinese traditional crosstalk) or talk show, their languages are quite free with many accents, and thus their speech recognition effect is quite poor. The current alignment methods based on similarity of words are likely impossible to align programs and reference texts (for example, a xiangsheng manuscript or a play), and even impossible to output an alignment result. On the other hand, the calculation load for the method based on forced alignment may be considerable, because under this circumstance, it is hard to accurately segment sentences, while forced alignment for a longer speech segment requires a more considerable calculation load.
Therefore, it is desirable for an efficient method for aligning video/audio and reference text, which can quickly achieve a better alignment result for a lower accurate recognition result and an error-containing reference text.