This invention relates to the field of multimedia temporal information streams, such as video and audio, and databases of such streams. More specifically, the invention relates to the field of video and audio processing for detecting known reference streams in target streams and for retrieving streams from databases based upon measures of correlation or similarity between two streams. The invention also relates to similarity or correlation measurements of temporal media sequences.
Traditional databases handle structured parametric data. Such databases, for example, can contain a lists of employees of a company, their salaries, home addresses, years of service, etc. Information is very easily retrieved from such databases by formulating parametric queries, e.g., xe2x80x98retrieve the salary of employee Xxe2x80x99 or xe2x80x98how many employees with x less than salary less than y live in city Y.xe2x80x99
Beyond data that can be represented in machine readable tabular form and, of course, machine readable text documents, many other forms of media are transitioning to machine readable digital form. For example, audio data such as speech and music and visual data such as images and video are more and more produced in digital form or converted to digital form. Large collections and catalogues of these media objects need to be organized similarly to structured traditional parametric data using database technology enhanced with new technologies that allow for convenient search and retrieval based on visual or audio content of the media. Such collections of media are managed using multimedia databases where the data that are stored are combinations of numerical, textual, auditory and visual data.
Audio and video are a special, peculiar type of data objects in the sense that there is a notion of time associated with this data. This type of data are referred to as streamed information, streamed multimedia data or temporal media. When transporting this data from one location to some other location for viewing and/or listening purposes, it is important that the data arrives in the right order and at the right time. In other words, if frame n of a video is displayed at time t, frame n+1 has to be at the viewing location at time t plus {fraction (1/30)}th of a second. Of course, if the media are moved or transported for other purposes, there is no such requirement.
Similarly to text documents, which can be segmented into sections, paragraphs and sentences, temporal media data can be divided up into smaller more or less meaningful time-continuous chunks. For video data, these chunks are often referred to as scenes, segments and shots, where a shot is the continuous depiction of space by a single camera between the time the camera is switched on and switched off, i.e., it is an image of continues space-time. In this disclosure, we refer to these temporal, time-continuous (but not necessarily space-continuous) chunks of media as media segments or temporal media segments. These media segments include video and audio segments and, in general, information stream segments. Examples of media segments are commercial segments (or groups) broadcast at regular time intervals on almost every TV channel; a single commercial is another example of a media segment or video segment.
Multimedia databases may contain collections of such temporal media segments in addition to non-streamed media objects such as images and text documents. Associated with the media segments may be global textual or parametric data, such as the director of the video/music (audio) or the date of recording. Searching these global keys of multimedia databases to retrieve temporal media segments can be accomplished through traditional parametric and text search techniques. Multimedia databases may also be searched on data content, such as the amount of green or red in images or video and sound frequency components of audio segments. The databases have to be then preprocessed and the results have to be somehow indexed so that these data computations do not have to be performed by a linear search through the database at query time. Searching audio and video databases on semantic content, the actual meaning (subjects and objects) of the audio and video segments, on the other hand, is a difficult issue. For audio, a speech transcript may be computed using speech recognition technology; for video, speech may be recognized, but beyond that, the situation is much more complicated because of the rudimentary state of the art in machine-iterpretation of visual data.
Determining whether a given temporal media segment is a member or segment, or is similar to a member or segment, of a plurality of temporal media streams or determining whether it is equal or similar to a media segment or equal or similar to a sub segment in a multimedia database is another important multimedia database search or query. A variant here is the issue of determining if a given temporal input media stream contains a segment which is equal or similar to one of a plurality of temporal media stream segments or determining if the input stream contains a segment which is equal or similar to a media segment in a multimedia database. To achieve this one needs to somehow compare a temporal media segment to a plurality of temporal media stream segments or databases of such segments. This problem arises when certain media segments need to be selected or deselected in a given temporal media input stream or in a plurality of temporal media input streams. An example here is the problem of deselecting or suppressing repetitive media segments in a television broadcast program. Such repetitive media segments can be commercials or commercial segments or groups which are suppressed either by muting the sound channel or by both muting the sound channel and blanking the visual channel.
Much of the prior art is concerned with the issue of commercial detection in temporal video streams. The techniques for detecting commercials or other specific program material can be more or less characterized by the method of commercial representation, these representations are: 1) global representations, 2) static frame-based representations, 3) dynamic sequence-based representations. Three examples that use global representation or properties of commercials are:
An example a method and apparatus for detection and identification of portions of temporal video streams containing commercials is described in U.S. Pat. No. 5,151,788 to Blum. Here, a blank frame is detected in the video stream, a timer is set for a given period after detection of a blank frame, and the video stream is tested for xe2x80x9cactivityxe2x80x9d (properties such as sound level, brightness level and average shot length) during the period representative of a commercial advertisement. Here the property of commercials that they start with a blank frame and that the activity of a commercial is different from the surrounding video material are used as global properties of commercials. U.S. Pat. No. 5,696,866 to Iggulden et al. extend the idea of detecting a blank frame, to what they call xe2x80x9cflatxe2x80x9d frame which has a constant signal throughout a frame or within a window within the frame. In addition to a frame being flat at the beginning and end of a commercial, Iggulden et al. include that the frame has to be silent, i.e., there should be no audio signal during the flat frame. Further, a commercial event is analyzed with respect to surrounding events to determine whether this segment of the video stream is part of a commercial message or part of the regular program material.
U.S. Pat. No. 5,151,788 to Blum and U.S. Pat. No. 5,696,866 to Iggulden et al. detect commercials and commercial groups based on representations of commercials which are coarse and determined by examining global properties of commercials and, most importantly, the fact that a commercial is surrounded by two blank frames. Additional features of the video signal are used in U.S. Pat. No. 5,343,251 to Nafeh. Here features such as changes in the audio power or amplitude and changes in brightness of the luminance signal between program and commercial segments are used to train an artificial neural network which ideally has as output +1 for detected program and xe2x88x921 for detected commercial. If the output of the trained neural network is xe2x88x921, the broadcast audio and video is discerned.
Two examples that use static frame-based representations for commercials are: U.S. Pat. No. 5,708,477 to S. J. Forbes et al. uses the notion of an abbreviated frame for representing commercial video segments. An abbreviated frame is an array of digital values representing the average intensities of the pixels in a particular portion of the video frame. Each commercial is represented by an abbreviated frame, extracted from the first shot (scene) of the commercial, and the duration of the commercial. These abbreviated frames are stored in memory in a linked list where the ordering is determined by the total brightness of all the pixels in the video frame portion. Upon detection of a scene change in the live video stream, an abbreviated frame is computed along with the average intensity. An abbreviated frame in memory is found that has total brightness close to the computed abbreviated frame within a predetermined threshold. The best matching abbreviated commercial frame is found by traversing the linked list both in order of increasing total brightness and in order of decreasing total brightness. If a stored abbreviated commercial frame is found that matches the abbreviated frame in the live video stream within a predetermined threshold, the TV set is muted and/or blanked for the duration of the commercial. Note that finding a stored abbreviated frame which is close in average brightness to a current abbreviated frame is independent of the number of stored commercials; however, retrieving the identity of a commercial (a commercial with the same or close abbreviated frame as the current abbreviated frame, if this is even uniquely possible) will take search time which is in the order of the number of commercials stored in the database.
Many techniques described in commercial seeking literature, reduce videos to a small set of representative frames, or keyframes, and then use well-known image matching schemes to match the keyframes. An example of such a technique is presented in reference:
J. M. Sanchez, X. Binefa, J. Vitria, and P. Radeva,
Local color analysis for scene break detection applied to TV commercial recognition,
Third International Conference, Visual""99, Amsterdam, June 1999, pp. 237-244.
This reference is incorporated herein in its entirety.
Each commercial in the database is represented by a number of color histograms, a color histogram of a representative frame for each shot in the commercial. The shots of a commercial are detected by some shot boundary detection algorithm (finding scene breaks). Commercials are detected in a live video stream by comparing all the color histograms of all the commercials to a color histogram representing a shot in video stream. A color histogram is a vector of length n (n the number of colors) and comparing histograms amounts to measuring the distance between two vectors. In order to decrease the computation time for computing the distance between the two vectors, the eigenvectors of the covariance matrix of a large number M of vectors representing color histograms are computed. These eigenvectors form an orthogonal basis which span the color vector space. Rotating the color vectors and projecting the rotated vectors on the m-dimensional subspace spanned by the first m eigenvectors captures the principal components of the color vectors and does not change the amount of information about colors much. Hence comparing two color vectors of length n is close to comparing two rotated color vectors of length m. If, during the commercial seeking process a color vector computed from a shot in the live video is determined to be close to a color vector of some commercial A, it is assumed that this commercial is present in the live stream. This is verified by checking if the following color vectors representing shots in the live stream also are close to color vectors representing shots of commercial A.
Three examples that use dynamic sequence-based representations are:
U.S. Pat. No. 4,677,466 to Lert, Jr. et al. which uses signatures that are stored and used to determine the identity of a commercial. The video and audio signals are converted to digital samples by A/D converters. Video signatures are average values of consecutive samples of the video envelope, audio signatures are average values of consecutive samples of band-passed audio signals. Multiple events are defined which are used as potential start points for computing a signature from a video stream, either for storing reference signatures or for computing target signatures that are to be compared to reference signatures. Such events include the occurrence of a scene change in the video stream, a blank frame having silent audio followed by a non-blank frame, a predetermined magnitude change in the audio signal or color component of the video signal, and predetermined time periods after previously determined events. Upon detecting a stability condition, a comparison of samples of a frame and counterpart samples of a subsequent frame, after such an event, a video signature is extracted. This signature is compared to stored reference signatures using a technique which the author calls xe2x80x98variable length hash code search.xe2x80x99 Sequentially, the absolute difference between a first reference signature and extracted signature is calculated and when this is greater than a predetermined threshold, the variable range hash code search is continued. When the absolute difference is less than the threshold, a correlation coefficient between extracted and reference signature is computed. When this correlation is significantly high a possible match is recorded, otherwise the variable range hash code search is continued. Even if the correlation coefficient is sufficiently high but lower than a predetermined threshold, the variable range hash code search is further continued. Eventually, a commercial group will give rise to multiple signatures; the commercials are then identified by determining the sequential ordering of the matched signatures and predefined decision rules to recognize a particular repetitive broadcast. Note that this is a sequential process that compares reference features one at a time.
U.S. Pat. No. 5,504,518 to Ellis et al. also concerns itself with the detection of broadcast segments of interest (commercials) using dynamic key signatures to represent these segments. Key signatures include eight 16-bit match words derived from eight frames of broadcast video stream, where these frames are selected from the video segment to be identified according to specific rules. Thirty-two different areas within a frame are selected, each area paired with another area. The average luminance values of the 16 paired areas are compared producing a 1 or a 0 based on the average luminance values of the first set being xe2x80x98greater or equalxe2x80x99 or xe2x80x98less thanxe2x80x99 those of the paired set, producing a 16-bit word. A 16-bit mask is also produced for each frame signature, where each bit indicates the susceptibility of the corresponding bit to noise (based on the magnitude of the absolute value of the difference of average luminance values). The keywords (or signatures) along with the match word and offset information are stored as representation of the broadcast segment.
For broadcast segment recognition, the received signal is digitized and 16-bit frame signatures are computed the same way as explained above. Each of these signatures is assumed to correspond to the first 16-bit word of one of the previously stored eight-word key signatures and compared to all key signatures beginning with that word. Then, using the offset information, subsequently computed words are compared with the corresponding stored words to determine if a match exists or not.
To increase the speed by which a stored key signature can be compared to a segment signature of a newly received broadcast is reduced by using a keyword lookup data reduction method. For this, one frame is selected from the frames corresponding to the key signature, in accordance with a set of predetermined rules. This frame is a key frame with an associated keyword. The key signature has still eight 16-bit words but the offset is measured with respect to the keyword. A keyword also may have multiple key signatures associated with it. This keyword is used with the lookup table find a smaller number of signatures that contain this keyword and thereby significantly reduce the number of signatures that have to be compared. Certain values of video frame signatures (keywords) occur more often than others which has two effects: 1) for those keywords many more signatures have to be compared, 2) these signatures also become closer together and the signature comparison (correlator) may report a false match. A video preprocessing step is introduced to produce video frame signatures which are more uniformly together. Note, however, that the introduction of a lookup table decreases the signature search time, this time is still linearly dependent on the number of commercials in the database although the slope of the linear dependency is reduced significantly.
Audio signatures are also computed by utilizing the fast Fourier transform. The audio signatures are handled similarly to the video signatures.
A method for matching and similarity searching of arbitrary video sequences, including commercials, is described in:
R. Mohan, xe2x80x9cVideo Sequence Matchingxe2x80x9d,
International Conference on Acoustics, Speech and Signal Processing, (ICASSP),
Seattle, May 1998.
This reference and the patents cited above are incorporated herein by reference in their entirety.
Mohan defines that there is a match between a given video sequence and some segment of a database video sequence if each frame in the given sequence matches the corresponding frame in the database video segment. That is, the matching sequences are of the same temporal length; matching slow-motion sequences is performed by temporal sub-sampling of the database segments. The representation of a video segment is a vector of representations of the constituent frames in the form of a ordinal measure of a reduced intensity image of each frame. Before matching, the database is prepared for video sequence by computing the ordinal measure for each frame in each video segment in the database. Finding a match between some given action video sequence and the databases then amount to sequentially matching the input sequence against each sub-sequence in the database and detecting minimums.
Some of the problems with the prior art are now presented.
1. For many solutions, the representations that are used to characterize and match target video to stored reference videos are global video sequence properties. This reduces the problem of finding repetitive segments from an identification problem to the problem of detecting two or more classes of media. That is, program segments of certain characteristics can be detected but not identified.
2. For individual video segment identification, the representation of the video segments that are to be identified is reduced to representing a sequence of keyframes or a sequence of otherwise characteristic frames or to one or more digital representations of a time interval of the analog video and/or audio signal (signatures of keyframes, of sets of frames, of analog signal). So, the program segments that are to be identified are partially represented which increases the likelihood of false reject (mistakenly rejecting a pertinent segment).
3. In almost all prior art, the computation of signatures is started when certain predefined events occur in the video stream. A simple example of a video event here is the presence of a blank frame. This is the case for both deriving representations or signatures for reference videos and for triggering the computation of signatures in the target video. This again increases the likelihood of false rejects.
4. Features used for representing video segments of interest are computed from one frame at a time. No features or signatures are derived from combined multiple frames, being it consecutive frames or frames which are spaced in time further apart. Therefore, no explicit motion information can be incorporated in the video segment representations.
5. The video signature store representations are not designed to handle the large numbers of features that can potentially be extracted from video sequence segments. This inherently limits the number of video segments that can be distinguished, i.e., the discrimination power of the signatures will not extend beyond a certain (not very large) number of different reference video segments.
6. The prior art does not scale well because of the fact that the search techniques used for searching the reference signature database are inherently sequential. In combination with item 5 above, this makes the prior art unusable for large number of reference segments, e.g., the number of samples a system searches for simultaneously.
7. The schemes proposed hitherto are tailored specifically for matching identical segments or for segment that have a number of identical shots in common. The schemes do not generalize to other types of similarity measurement (examples include, structural similarity of sequences, motion similarity of sequences etc), where matching segments are not detected based on exact signature (feature) values but rather on signatures that are similar in some sense.
8. The schemes are rigid and can recognize subsections of the known segments only as defined according to certain rules. For example, a certain number of signatures have to be present in a target segment in order to match a reference segment.
An object of this invention is an improved system and method for exact matching of multimedia time sequences.
An object of this invention is an improved system and method for similarity matching of multimedia time sequences for multimedia database indexing and searching.
An object of this invention is a system capable of performing similarity matching of multimedia time sequences using a multiplicity of similarity metrics.
An object of this invention is a scaleable system (search time increases sub linearly with the reference set) for performing exact or similarity matching between a large of reference media sequences and a query time sequence.
An object of this invention is a scaleable system (search time increases sub linearly with the reference set) for performing exact or similarity matching between a large database of reference temporal media sequences and a query time sequence.
An object of this invention is a scaleable system (search time increases sub linearly with the reference set) for performing similarity measurement between segments of reference media sequences stored in a large database and a segment of query time sequence.
An object of this invention is to optimize the coding scheme to provide the best discrimination between the target sequences.
The present invention is a representation scheme and a search scheme for detection and retrieval of known stream-oriented data. Examples include video, audio (media streams) stock price data, binary data from a computer disk (for purposes of detecting known virus patterns). The invention comprises an off-line indexing phase where the representations for a set of known video (information) reference segments that are computed and stored in a segment index structure. For each segment, a set of key frames or key intervals (key intervals, for short) is determined, which can either be regularly sampled in time or based on the information content. From each key interval, a set of features is extracted from a number of regions in the key intervals. These regions can be different for each feature and are indexed by domain codes. The extracted features are quantized according to quantization intervals, where each interval is represented by a feature code. Hence, each key interval is coded by a set of domain code, feature code pairs. These pairs are used to populate the segment index structure, one structure for each feature. For the first feature and the first key interval of all the known segments, this table is a two-dimensional array, where the columns represent domain codes and the rows feature codes. The domain code, feature code cells of this array are populated by segment identifiers of the corresponding set of code pairs. A second array is established for the second feature and the first key frame or interval, and so on. The use of multiple index structures corresponding to the features provides flexibility to the invention to operate in a wide variety of applications. For the second key interval, this process is repeated in a second structure of two-dimensional arrays, indexed by time or number for regular or content-based key interval selection. The result is a compact data structure that allows for fast similarity detection and search of the known segments.
In the search and detection phase, the process of computing domain code, feature code pairs is repeated in real time from a target media stream. Additionally, a segment counter is initialized for each of the known segments. The computed code pairs are used to index into the segment index table. In the case that a first key interval of a known segment is encountered, many, if not all, of the code pairs will point to cells in the segment index table that contain the particular segment identifier and the appropriate segment counter will be incremented for all domain codes and all features. Using the two-dimensional arrays representing the second key interval of the reference segments, the accumulation is continued for the second key interval in the target stream. This process is repeated for features and key intervals till sufficient evidence is accumulated in the segment counter that a known reference segment is present.