There has been a continuing requirement for captioning and subtitling services to display a text version of something which is spoken. Such captioning improves accessibility, allowing those suffering from hearing disabilities to follow a broadcast or performance by being able to read a transcript of a speech being delivered. Captioning is provided in a number of different contexts. In one context, films may be provided with captions which are synchronised to the verbal delivery of lines during the film. Such synchronisation is relatively easy to perform, because the script and timing are known in advance and synchronised captions can be prepared for delivery at the same time as the moving image of the file or its audio track. Captions have long been available as a service which can be accessed by people with a hearing disability. So-called ‘closed captions’ are viewed only when the service is selected by a viewer. More recently so-called smart glasses have become available by means of which captions can be delivered directly to smart glasses worn by members of a cinema audience. Subtitles are also known, for example for conveying a foreign language text of speech in the film.
In theatre, live performances represent particular challenges for captioning. At present, captioning services are triggered manually for certain accessible theatre performances. Predefined captions are created based on the script which is to be spoken. The captions capture dialogue for display to allow a deaf or hard of hearing person to follow the performance. The captions are manually triggered for display by a caption cuer whose task it is to follow the performance and to press a button to trigger the display of each caption synchronised with the oral delivery of the lines of the performance. In live theatre, the timing of captions needs to accommodate for variations in speech, timing, breaks and noises other than speech, which may be intentional in the performance or not. A person (caption cuer) can accommodate such variations by watching the performance and then providing the caption at the correct time.
Subtitles are also available in the case of live television broadcasts, again to assist the deaf or hard of hearing so that they may follow a broadcast even if they cannot fully hear the speaker delivering the broadcast. Text for the subtitles may be created by a human intermediary who follows the broadcast and dictates suitable summary language (including text and punctuation) to a speech recognition system provide a real-time transcript which can be displayed. Alternatively, machine shorthand can be used to generate the text. In such contexts, there are inevitably delays between the broadcast being delivered and the captions being displayed on a screen. Attempts have been made to play the broadcast soundtrack into a speech recognition system, the speech recognition system configured to provide a transcript of the speech.
The term speech recognition system is used herein to denote a system capable of picking up human speech and converting it into text to be displayed, the text corresponding to the speech which has been delivered. A speech recognition system might be used as part of a speech follower. Speech followers are used to synchronise an audio signal carrying speech to a corresponding script. This process may be performed by computer processing of a media file in order to time the script to the corresponding audio signal. The process may also be performed in real time to control the rate at which a script is displayed to a speaker, for example as a teleprompt. That is, a speaker is reading from a script in a live context, and a speech follower assists in making sure that the rate at which the script is displayed to the speaker matches the speaking rate of the speaker. These systems thus display part of a script corresponding to the location that they have detected the speaker has reached in the script.
Attempts have been made to provide a speech follower (using speech recognition) in the context of live theatre, to display a script in time with its delivery. The idea is to use a speech follower to follow the speech which is being spoken during the live performance, and to display the script at the correct time corresponding to the speech. However, it is very difficult to implement such speech followers in the context of live theatre, due to the many variables that can occur in live theatre. Previously, live speech follower systems have been used in studio contexts with good quality audio. They are not suited generally to live theatre, which poses a number of different challenges. Because of the theatre surroundings, rather than a studio context, there may be poor audio quality. The system has to cope with a number of different styles and speeds of speech and therefore cannot be trained to a standard style and speed of delivery. It is known that speech following systems behave more accurately when they can be trained to a standard style and speed of delivery. Theatres are subject to general background noise, which may be part of the performance itself, or may be unexpected. There may be long pauses between utterances on stage, while the action proceeds without dialogue. Utterances in theatres may consist not only of words but also other utterances such as exclamations or cries or whimpers, and may be unusually loud or quiet. Speech follower systems have previously generally been designed for a context where the speech (more or less) consists of words and is spoken at an even pace and at a reasonably even volume. Actors may have different accents, or deliberately be speaking in an affected way. The performance may consist of non-verbal information which nevertheless is part of the performance and which a person who is hard of hearing would like to know something about.
For all of these reasons, it has not been possible to date to successfully implement a speech following system to automatically cue captions for display in the generality of performances. So far, captioning services which have been provided, and which are increasingly in demand, have been manual.