Business professionals are routinely using audio conferencing systems, rather than in-person meetings, to collaborate. Conference calls are now a mainstay of business life, and continue to grow in popularity. The functionality of conference calling is not only used on a “stand-alone” basis, but also as part of video calls and “web conferences.” Often times, conference calls are recorded and then transcribed, so that those that could not attend can review the conversation, or so that those that did attend have a written record of what was said. The transcription, usually performed by a human transcriptionist, is typically available hours or days after the conference call takes place.
There are a number of applications for real-time teleconference transcription, which converts the conference call conversation to text while the teleconference is occurring and makes it accessible via a display and computer network (such as a web browser over the Internet).
Using real-time teleconference transcription enables those with hearing impairments to participate. Latecomers could review what they had missed. An individual could readily monitor multiple conference calls by watching, rather than listening. Participants that needed to step away or were interrupted could easily catch up when they returned. Participants could refer back to earlier dialogue if they couldn't recollect what had been said. Internet “chat” (entered via keyboard) could easily be mixed with spoken conversation.
Unfortunately, conference call transcription has been hampered by high cost, since historically it has been very labor-intensive. Automated speech-to-text (also called automatic speech recognition, ASR) technology has been improving, and it shows increasing promise. However, there are challenges to using ASR for real-time conference call transcription. The technology generally does not perform well in the performance of double-talk (more than one party speaking at once) or with background noise. ASR generally lacks the ability to identify who is talking (it cannot recognize voices). Many ASR algorithms cannot run in real time (it can take the algorithm more than one minute to convert a minute of speech). And, it can be costly to run ASR (both in terms of the computer resources required and potential royalties that must be paid).
Therefore, what is needed is a solution that addresses the challenges of conference call transcription, some of which are described above.