A media presentation, such as a broadcast of an event, may be understood as a stream of audio/video frames (live media stream). It is desirable to add information to the media stream to enhance the viewer's experience; this is generally referred to as annotating the media stream. The annotation of a media stream is a tedious and time-consuming task for a human. Visual inspection of text, players, balls, and field/court position is mentally taxing and error prone. Key-board and mouse entry are needed to enter annotation data but are also error prone and mentally taxing. Accordingly, systems have been developed to at least partially automate the annotation process.
Pattern Recognition Systems (PRS), e.g. computer vision or Automatic Speech Recognition (ASR), process media streams in order to generate meaningful metadata. Recognition systems operating on natural media streams always perform with less than absolute accuracy due to the presence of noise. Computer Vision (CV) is notoriously error prone and ASR is only useable under constrained conditions. The measurement of system accuracy requires knowledge of the correct PRS result, referred to here as Ground Truth Metadata (GTM). The development of a PRS requires the generation of GTM that must be validated by Human Annotators (HA). GTM can consist of positions in space or time, labeled features, events, text, region boundaries, or any data with a unique label that allows referencing and comparison.
The time stamp of a piece of GTM may not be very precise or may have to be estimated based on its time of arrival relative to a live broadcast. GTM with imprecise timestamps can't be directly compared to PRS output which does have precise timestamps.
A compilation of acronyms used herein is appended to this Specification.
There remains a need for a system that can reduce the human time and effort required to create the GTM.