Various systems and methods for reporting multimedia events have been known. For example, in the context of conventional court proceedings reporting, a court reporter uses a stenotype machine to document all spoken words as written transcript. While stenotype machines allow multiple keys to be pressed simultaneously in a single hand motion to record combination of letters representing syllables, words, or phrases, they can be tedious to use or difficult to master. Consequently, fewer and fewer qualified stenographers who can report with fast speed while maintaining high accuracy are available. Therefore, this method may not be suitable for event reporting in real-time.
Some reporting systems use voice recognition technology. Such systems typically have a recorder for collecting speech audio and generating a digitized audio file, and a speech recognition engine for transcribing the digital audio file into text. However, the accuracy of the text transcribed by the existing systems in the art is usually low so that human review or modification is often necessary to produce the text report with acceptable accuracy.
For example, speech recognition may not work well on speeches from the original speakers in an event due to a number of factors including imperfection of speakers' pronunciation, speakers' accent, their distance from the recorder, and the lack of training to properly use a speech recognition product. As such, the automatically generated reporting based on the original speeches will require further editing by a reporter at a much later time, often requiring concurrent playback of the recorded audio file to ensure accuracy.
In some reporting, a reporter is on site at the event and repeats verbatim the speaker's utterance into a recorder coupled to a speech recognition device. Such reporter is usually equipped with customized dictionaries containing context-dependent words or terminology to work more efficiently in specific types of reporting. However, the transcription accuracy of this method remains unsatisfactory, and subsequent editing is usually required to produce the report. Furthermore, the current automatic speech recognition technology generally does not allow real-time and flexible work flow and as a result has limitations in providing accurate real-time transcription, and cannot be easily adapted to meet the requirements for multimedia reporting in multiple languages.
U.S. Pat. No. 6,816,468 discloses a teleconferencing system for providing transcription and translation service during a teleconference. However, the disclosed system uses conventional speech recognition software, which cannot provide accurate transcription as official written report. Further, the machine translation is performed on the transcribed text and thus may further reduce the accuracy of the output text to user.
U.S. Pat. No. 6,385,586 discloses a language capture device that allows for translation into another language. This device converts the captured utterance into text, but requires a manual verification of the correctness of the text before performing the translation. If the converted text is incorrect, the speech needs to be repeated. As such, the disclosed device does not provide accurate and real-time conversion from speech to text and is not suitable for producing official reports in real-time.
Therefore, there remains a need for an improved system and method for multimedia event reporting with enhanced accuracy while meeting the requirements for reporting in real-time.