In modern television systems the sound portion of television programs is frequently conveyed with the video signal via multiple channels, for example a typical system could include a video channel and left and right sound channels such as in a stereo television system. The well known intent of using left and right sound channels is to provide a spatially located sound to the viewer whereby sounds created by images at a given location on the television screen are perceived by the viewer as coming from that location.
The corresponding images and sounds are known as mutual events or MUEVs. When the audio and image MUEVs as perceived by the viewer do not properly correspond they are annoying as the sound is perceived to come from a different location than the image making the sound. This is especially true for dialogue (e.g. speech of a person in a one way or two way conversation with another) when the speaker is seen in a different location than the sound comes from. Consider for example a two way conversation between two newscasters, one on the right of the screen and one on the left. If the left and right sound channels are reversed, the right speaker's speech will appear to come from the left side of the screen and vice versa.
In systems including images and sound, it is important that mutual events or MUEVs in audio and video are perceived by the viewer as being spatially aligned. MUEVs are those events in the video and sound which have a high probability of occurring together, for example the instant change of direction of a thrown baseball and the crack of the bat hitting the ball. Other MUEVs include the shape and/or movement of a person's lips and the sound being created. The video lip shapes are referred to as visemes or the visual MUEV and the sounds as phonemes or the sound MUEV. MUEVs however are not just visemes and phonemes but encompass simultaneously occurring events which have a probability of being related, such as the above baseball direction and bat crack example.
In other systems, both audio only, for example such as radio and audio video, for example such as television, it is desired to convey dialogue in a particular channel or channels. Because sound signals in modern audio only and audio video acquisition and production systems are frequently recorded and carried by multiple sound channels, there is a possibility of the dialogue being misplaced, that is of the dialogue being carried by the wrong audio channel. It is also possible for dialogue to be lost entirely, for example when sound is acquired via a sound effects channel which is subsequently discarded.
As used in this specification and claims, If a system sound channel conveys the proper sound signal (e.g. dialogue in the proper channel(s) and/or leading the viewer to perceive sounds as properly corresponding to the image location), the sound channel or signal is said to (properly) track and if it does not convey the proper sound signal the channel or signal is said to mistrack. For example, if the left and right sound channel signals are reversed, that is the left channel carries the right sound signal and vice versa (sometimes called swapping), the sound signals mistrack. If the dialogue sound signal is missing from the dialogue sound channel(s), the sound signal mistracks.
As another example of multiple channel sound systems, the sound of the performers in the television program is conveyed via left and right sound dialogue channels whereas sound effects such as music and other non speech sounds are conveyed by left and right sound effects channels. Another example is 5.1 channel sound, sometimes referred to as 3-2 stereo, with a center dialogue channel, front left and right dialogue channels, rear left and right effects channels, and a low frequency effects channel.
Yet another example of a multiple channel sound system is the Japan Broadcasting Corporation (NHK) experimental Super Hi-Vision television having 22.2 sound channels. These channels are grouped relative to the viewer as 9 above the ear, 10 ear level, 3 below the ear and 2 low frequency effects channels. The various sound channels surround the viewer to provide a highly realistic audio sensation where the sound can be perceived as coming from anywhere within about 300 degrees vertically and 360 degrees horizontally, depending on the location of the viewer relative to the sound transducers (e.g. speakers).
Due to widespread audio processing, for example program conversion between different sound systems, and other problems such as poor microphone placement, incorrect wiring, equipment failures and operator error, the sound signals often find their way into the wrong sound channels. For example having the dialogue carried in the wrong channel can cause problems for the viewer ranging from annoying sound to loss of dialogue audio.
For example if the left and right channels in a two channel system are reversed the location of the sound does not match the location of the image, such as when a person on the left of the image frame is talking but the sound comes from the right sound transducer (speaker). As another example consider the NHK system where the sound which the viewer perceives is intended to come from various directions around the viewer including from ear level, higher and lower directions to correspond to the images which are displayed to the viewer (or previously or about to be displayed to the viewer). In this system if a sound signal is placed in the wrong channel various annoying effects can occur, such as a speaking person located to the viewer's lower right being heard behind, above, to the left or in some other direction different from where the viewer sees the image of the person speaking.
Also, it is important that sound that corresponds to images not displayed to the viewer or not yet or previously displayed to the viewer, be in the correct channel. For example consider a television scene of a person in the middle of the frame carrying on a conversation with an unseen person to the right side. If the center dialogue channel and the right front channel are reversed the conversation will appear unnatural.
As another example consider a television program conveys an airplane flying at low level from behind the viewer, to above the viewer and on to be displayed in front of the viewer. The sound will start from behind, progress to above and further progress to in front of the viewer. In this instance the sound from behind and from above will correspond to an image not yet seen by the viewer. Of course the opposite will happen if the aircraft is flying from the front of the viewer to behind the viewer. In this instance the sound from above and behind the viewer corresponds to an image previously displayed.
In all situations it is important to have the sound perceived by the viewer as corresponding to the location of the image creating the sound i.e. tracking the image location. This is true even when the image is not currently displayed. This is true even if the image is in a location that is not being displayed at the instant, such as behind the viewer.
It is of course possible that the image is never displayed but nevertheless the sound signals need to track. As an example similar to that above, consider a conversation between two people, one located in front of the viewer and seen on the image frame, the other located behind the viewer, walking from side to side, and never seen. If the second person's sound signal mistracks, the viewer could hear the sound from behind and to the viewer's right whereas he would see the first person looking toward the viewer's left. If the unseen person were walking about as he talked, the viewer would see the first person following the unseen person but if the unseen person's sound signals mistrack the visual signal and audio cues to the unseen person's location would be inconsistent.
In television, film and other systems which provide images to the viewer in more than one direction, such as wide screen (e.g. 16×9), specialized surround projection systems (e.g. IMAX), or systems providing images in three dimensional or simulated three dimensional systems (e.g. 3D-TV) it is likewise important that the sound matches the viewer's perceived image location. When the sound is not present in the correct sound channel this perception is negatively affected. Mistracking sound signals will cause conflicting audio and visual cues which can be annoying to the viewer.
As another example of problems with sound not being in the proper channel, when the dialogue audio is carried in the wrong channel or not carried in all the proper channels, a loss of dialogue can occur, for example when the television program is passed through equipment which is incapable of handling all of the audio channels and those containing dialogue are discarded. Such is the case when a television program having center, left and right dialog channels and rear effects channels is passed through an audio signal processing device that can only handle left and right dialogue channels. If the sound is only located in the center dialog channel and the audio signal processing device discards or otherwise never utilizes the center channel, the dialogue that was only in the center channel will be lost. Generally whenever there is a mistracking sound signal there is a risk of important sounds being lost.
In the prior art it is known to detect the presence of audio in one or more audio channels and sound an alarm if the channel is silent for a predetermined period of time. One such system is described by Basse in U.S. Pat. No. 7,424,160 wherein in FIG. 7 the flow diagram of an audio silence detector is shown. Basse's system does not distinguish the type of audio which is present and consequently missing dialogue in a dialogue channel which is carrying sound effects would not cause Basse's invention to catch the problem.
Basse does mention that system operators desire to monitor their systems to ensure quality audio and video reaches the viewers and relates prior systems such as cable TV systems where employees monitored the quality. Basse also points out that the cost of hiring employees to monitor every channel in a system can be expensive and notes several problems with utilizing employees to monitor modern systems consisting of as many as 800 TV channels.
Generally, as Basse suggests, in television, film and other systems using multiple channel sound it is desirable to have a human operator monitor the sound to ensure that each sound signal has been properly assigned to its corresponding channel. As the number of sound channels increases the task of monitoring becomes more difficult and as the number of systems to be monitored, such as in the aforementioned 800 TV channel systems, the number of operators required for proper monitoring increases dramatically.
Typically, due to the costs involved, proper dialogue presence and spatial sound location monitoring is not performed in modern systems. The monitoring task falls to a single operator who performs occasional checking. The use of occasional checking leads to errors not being discovered promptly and in some systems they may not be discovered for an entire program.
What is needed is an automated system which can monitor particular sound types such as dialogue to ensure it is carried properly and monitor sound's spatial location to ensure it properly matches the corresponding video location.