A normal audio or video file is in effect a serial access medium whereby in order to access certain audio (including speech) contained within it, it is necessary to listen to or to watch the file at its original speed (or very close to it) until that data is found. Thus, for someone tasked with listening to an audio file or watching a video file to search for certain words or phrases (e.g. a paralegal in a discovery process), such a task is time consuming and fatiguing. In contrast, for example, a paper transcript can be quickly skim read by a human at rates at a rate in excess of 700 words per minute, i.e. in a fraction of the time and the effort.
A human transcription of audio, whilst generally accurate, is time consuming, often taking 6 to 8 hours to transcribe one audio hour. Furthermore, whilst machine transcription of audio does exist, it is not perfect and even if it were, it is often difficult to make full sense of a machine transcription if the audio is not played at the same time to give context to the transcription.
It is known for lengthy machine or human transcripts to be provided with time stamps interspersed therein. For example, indicating when a conversation, part of a conversation or a paragraph begins and its duration.
It is also known from European Patent Application EP0649144A1 to analyse audio in order to aligning a written transcript with speech in video and audio clips; in effect, providing an index for random access to corresponding words and phrases in the video and audio clips.
Automated Speech Recognisers (ASR) receive audio information signal representative of spoken word and output a transcript of the recognised speech. However, the transcripts are grammatically unstructured and therefore it is not possible to gain any contextual understanding or derive other potentially important information of the spoken word from the transcript.
Moreover, determining and monitoring the context of the spoken word in, for example, telephone conversations is particularly problematic for automated systems because telephone conversations are more chopped and broken compared to the spoken word in, for example, presentations, dictations and face to face conversations. Also, when monitoring telephone conversations for unlawful or adverse practices, parties of the telephone conversation may use coded words or covert behaviour.
Whilst automated speech recognisers attempt to screen out variations in the pronunciation of a word so as to arrive at the same recognised word irrespective of the speaker and the mood or emphasis of the speaker downstream added value analysis of recognised speech benefits from the presence of such variations, such as in the recognition of emotional intensity. Is therefore desirable to provide a method and system which preserves audio information beyond an automated speech recognition phase of analysis.
It is therefore desirable in a number of industries for there to be a system and method for indexing and determining and monitoring the context of unstructured text representative of spoken word.
It is also desirable to optimise the transducing of sound energy, such as an audio information signal representative of the spoken word, into a digitised signal so as to optimise the use of a system and method of indexing recognised speech after automated speech recognition.