People with hearing challenges frequently make use of speech to text conversion, or speech recognition, software. Their challenge is magnified when such software is applied to transcribe a group interaction such as a meeting or a panel discussion. Conventional speech recognition software focuses primarily on accuracy of transcription and not on differentiating incoming voice signals, and is often tuned to the characteristics of a particular speaker. Therefore, such software struggles to accurately transcribe the proceedings of a group interaction where several individuals interact unpredictably. In addition, the capability to identify a speaker for each utterance and capture this information in a compact transcript format that facilitates storage and management is highly desirable for this application.
It is further beneficial if the stored transcript can be used to regenerate the group interaction as audio data with some fidelity to the original. However, the output of existing text to speech conversion software is often monotonous, either because the transcript format does not record the emotional content of the speech, or the software cannot make use of such additional information.
In addition, group interactions often make use of, and generate information on, physical aids such as whiteboards. Conventional speech to text conversion software, by relying solely on the audio data, therefore neglects an important source of auxiliary information about the group interaction.