This invention relates to methods of segmenting and transcribing recordings of speech and speech components of video, audio or multimedia files or transmissions, such recordings and components herein collectively being called “speech media”. “Speech” as used herein includes both spoken audio and any other forms of oral delivery which may be interpreted as utterances capable of being represented textually.
Audio and video media in their “raw” state are opaque in the sense that, in order to know what is in them, a person has to listen to the material (and watch in the case of video). Additional information can be associated with the audio or video, by tagging the media as an entity with titles, copyright, author, keywords and other information (for example, as is done with media metadata associated with MP3 files under the ID3 standard). In addition, timed information including speech text information, herein collectively being called “timed media metadata” can be associated with the media file which allows suitable systems to display information such as captions and subtitles (as well as other metadata if desired) at the correct time in the media.
Whether tagging the whole file, or whether providing information regarding timed events in the media, the associated timed media metadata may either be embedded with the media file itself (if suitable tools, formats and players exist), or separately in Timed Text files (with many different standardized formats such as the W3C Timed Text Markup Language (TTML, also Distribution Format Exchange Profile DFXP) or Synchronized Multimedia (SMIL) standard and proprietary standards) or in databases.
A timed portion of the speech media (which may also include temporal offset, playback rates and references to original media) in conjunction with textual and other metadata (which may also include detailed timing information at a shorter interval) is associated with a portion of the speech media herein collectively being referred to as a “segment”.
Media files and associated metadata may also be grouped into playlists or channels, which allow display, selection and playback of a group of media. Where such playlists can be associated with applicable timed media metadata, then the whole playlist effectively embodies the timing, textual and other metadata for applications of this method.
There is substantial value in enabling the location of media by search because effective search by search engines provides revenue opportunities from advertisers and sponsors. From the consumer's perspective (a consumer being anyone who is seeking to watch or listen to media), the ability to find suitable video and audio content through textual search of the contents, rather than reliance on any titles and overall media tags is substantial.
In addition, once the media is found, consumers may (with suitable players) search to particular time positions within the media playback, on the basis of text search within the timed media metadata, which allows the suitable player to commence playback at the appropriate position of interest (rather than the consumer needing to scrub through the playback). This allows the consumer to experience relevant portions of the media rather than watch irrelevant sections.
Currently, there are legislative requirements for broadcast media in many jurisdictions that require suitable assistance for Access, which result in timed text metadata being available as Closed Captions for example.
In the case of video and audio material which is delivered on the Web (for example, on the BBC iPlayer, Google's YouTube™ service, and other online video publishing services which support captions or subtitles), the prevalence of material which has associated metadata available is limited (as is the legislative position). This is despite the fact that the availability of this metadata is even more valuable than in the Broadcast situation, not only does it assist with Access, but also allows the media to be found more easily by search engines, as well as making it possible for the user to quickly locate within the media the section of relevance.
In addition, there are possibilities for rich and varied metadata delivery (e.g. associated images) with the timed media metadata which enhances its engagement and value for the user and makes it more likely for the user to absorb the desired message, or to “click-through” onto other places of relevance. Also, it is possible to associate the current textual segment metadata with the context for relevant advertisements. The timed association of materials also assists in a pedagogical context.
The main impediments to adding the rich metadata to audio and video material are the complexity and effort required to do so with current production and publishing systems.
There are a variety of current systems that assist with the production of Captions, subtitles and various timed text formats. For example, captions and markers can be added manually to the timeline of video/audio production systems; in the case of video, using systems such as Microsoft Expression® Encoder, Sony Vegas™ Pro, Apple® Final Cut Pro® or Adobe Premiere®. Alternatively, dedicated caption and subtitle systems can be used, requiring the user to mark the timing of events as well as adding the metadata such as the transcription that will make up the timed media metadata, and these can either be imported to the media production tools to create embedded timed media information, or else Timed Text files which are associated with the media in the players. Additionally, steganography requires special transcription terminals operable by trained transcribers, and is particularly suited for live captioning. Also, Automatic Speech Recognition (ASR) systems are able to produce timed text which is recognised according to various speech models and produced by generating the expected words. Due to the inaccuracy of ASR systems, one approach is to use ASR trained to an individual transcriber's patterns, and to have that individual re-read what is said in the audio/video material and insert the re-reading into the caption with resulting higher quality results because the recognition is superior.
It is also possible to derive timed captions from existing video material by use of Optical Character Recognition techniques. This relies of course on an existing transcript being already embedded in the material.
The current prevalence of speech media that have been associated with timed media metadata is low. This reflects the challenges of time and/or expense in using current systems. In the case of automated ASR, the quality of the resulting transcript is inadequate for many serious applications, and the re-reading approach is also time-consuming.