1. Technical Field
The invention is related to media stream identification and segmentation, and in particular, to a system and method for identifying and extracting repeating audio and/or video objects from one or more streams of media such as, for example, a media stream broadcast by a radio or television station.
2. Related Art
There are many existing schemes for identifying audio and/or video objects such as particular advertisements, station jingles, or songs embedded in an audio stream, or advertisements or other videos embedded in a video stream. For example, with respect to audio identification, many such schemes are referred to as xe2x80x9caudio fingerprintingxe2x80x9d schemes. Typically, audio fingerprinting schemes take a known object, and reduce that object to a set of parameters, such as, for example, frequency content, energy level, etc. These parameters are then stored in a database of known objects. Sampled portions of the streaming media are then compared to the fingerprints in the database for identification purposes.
Thus, in general, such schemes typically rely on a comparison of the media stream to a large database of previously identified media objects. In operation, such schemes often sample the media stream over a desired period using some sort of sliding window arrangement, and compare the sampled data to the database in order to identify potential matches. In this manner, individual objects in the media stream can be identified. This identification information is typically used for any of a number of purposes, including segmentation of the media stream into discrete objects, or generation of play lists or the like for cataloging the media stream.
However, as noted above, such schemes require the use of a preexisting database of pre-identified media objects for operation. Without such a preexisting database, identification, and/or segmentation of the media stream are not possible when using the aforementioned conventional schemes.
Therefore, what is needed is a system and method for efficiently identifying and extracting or segmenting repeating media objects from a media stream such as a broadcast radio or television signal without the need to use a preexisting database of pre-identified media objects.
An xe2x80x9cobject extractorxe2x80x9d as described herein automatically identifies and segments repeating objects in a media stream comprised of repeating and non-repeating objects. An xe2x80x9cobjectxe2x80x9d is defined to be any section of non-negligible duration that would be considered to be a logical unit, when identified as such by a human listener or viewer. For example, a human listener can listen to a radio station, or listen to or watch a television station or other media broadcast stream and easily distinguish between non-repeating programs, and advertisements, jingles, and other frequently repeated objects. However, automatically distinguishing the same, e.g., repeating, content automatically in a media stream is generally a difficult problem.
For example, an audio stream derived from a typical pop radio station will contain, over time, many repetitions of the same objects, including, for example, songs, jingles, advertisements, and station identifiers. Similarly, an audio/video media stream derived from a typical television station will contain, over time, many repetitions of the same objects, including, for example, commercials, advertisements, station identifiers, program xe2x80x9csignature tunesxe2x80x9d, or emergency broadcast signals. However, these objects will typically occur at unpredictable times within the media stream, and are frequently corrupted by noise caused by any acquisition process used to capture or record the media stream.
Further, objects in a typical media stream, such as a radio broadcast, are often corrupted by voice-overs at the beginning and/or end point of each object. Further, such objects are frequently foreshortened, i.e., they are not played completely from the beginning or all the way to the end. Additionally, such objects are often intentionally distorted. For example, audio broadcast via a radio station is often processed using compressors, equalizers, or any of a number of other time/frequency effects. Further, audio objects, such as music or a song, broadcast on a typical radio station are often cross-faded with the preceding and following music or songs, thereby obscuring the audio object start and end points, and adding distortion or noise to the object. Such manipulation of the media stream is well known to those skilled in the art. Finally, it should be noted that any or all of such corruptions or distortions can occur either individually or in combination, and are generally referred to as xe2x80x9cnoisexe2x80x9d in this description, except where they are explicitly referred to individually. Consequently, identification of such objects and locating the endpoints for such objects in such a noisy environment is a challenging problem.
The object extractor described herein successfully addresses these and other issues while providing many advantages. For example, in addition to providing a useful technique for gathering statistical information regarding media objects within a media stream, automatic identification and segmentation of the media stream allows a user to automatically access desired content within the stream, or, conversely, to automatically bypass unwanted content in the media stream. Further advantages include the ability to identify and store only desirable content from a media stream; the ability to identify targeted content for special processing; the ability to de-noise, or clear up any multiply detected objects, and the ability to archive the stream more efficiently by storing only a single copy of multiply detected objects.
As noted above, a system and method for automatically identifying and segmenting repeating media objects in a media stream identifies such objects by examining the stream to determine whether previously encountered objects have occurred. For example, in the audio case this would mean identifying songs as being objects that have appeared in the stream before. Similarly in the case of video derived from a television stream it can involve identifying specific advertisements, as well as station xe2x80x9cjinglesxe2x80x9d and other frequently repeated objects. Further, such objects often convey important synchronization information about the stream. For example the theme music of a news station conveys time and the fact that the news report is about to begin or has just ended.
For example, given an audio stream which contains objects that repeat and objects that do not repeat, the system and method described herein automatically identifies and segments repeating media objects in the media stream, while identifying object endpoints by a comparison of matching portions of the media stream or matching repeating objects. Using broadcast audio, i.e. radio, as an example, xe2x80x9cobjectsxe2x80x9d that repeat may include, for example, songs on a radio music station, call signals, jingles, and advertisements.
Examples of objects that do not repeat may include, for example, live chat from disk jockeys, news and traffic bulletins, and programs or songs that are played only once. These different types of objects have different characteristics that for allow identification and segmentation from the media stream. For example radio advertisements on a popular radio station are generally less than 30 seconds in length, and consist of a jingle accompanied by voice. Station jingles are generally 2 to 10 seconds in length and are mostly music and voice and repeat very often throughout the day. Songs on a xe2x80x9cpopularxe2x80x9d music station, as opposed to classical, jazz or alternative, for example, are generally 2 to 7 minutes in length and most often contain voice as well as music.
In general, automatic identification and segmentation of repeating media objects is achieved by comparing portions of the media stream to locate regions or portions within the media stream where media content is being repeated. In a tested embodiment, identification and segmentation of repeating objects is achieved by directly comparing sections of the media stream to identify matching portions of the stream, then aligning the matching portions to identify object endpoints. In a related embodiment segments are first tested to estimate whether there is a probability that an object of the type being sought is present in the segment. If so, comparison with other segments of the media stream proceeds; but if not further processing of the segment in question can be neglected in the interests of improving efficiency.
In another embodiment, automatic identification and segmentation of repeating media objects is achieved by employing a suite of object dependent algorithms to target different aspects of audio and/or video media for identifying possible objects. Once a possible object is identified within the stream, confirmation of an object as a repeating object is achieved by an automatic search for potentially matching objects in an automatically instantiated dynamic object database, followed by a detailed comparison between the possible object and one or more of the potentially matching objects. Object endpoints are then automatically determined by automatic alignment and comparison to other repeating copies of that object.
Specifically, identifying repeat instances of an object includes first instantiating or initializing an empty xe2x80x9cobject databasexe2x80x9d for storing information such as, for example, pointers to media object positions within the media stream, parametric information for characterizing those media objects, metadata for describing such objects, object endpoint information, or copies of the objects themselves. Note that any or all of this information can be maintained in either a single object database, or in any number of databases or computer files. The next step involves capturing and storing at least one media stream over a desired period of time. The desired period of time can be anywhere from minutes to hours, or from days to weeks or longer. However, the basic requirement is that the sample period should be long enough for objects to begin repeating within the stream. Repetition of objects allows the endpoints of the objects to be identified when the objects are located within the stream.
As noted above, in one embodiment, automatic identification and segmentation of repeating media objects is achieved by comparing portions of the media stream to locate regions or portions within the media stream where media content is being repeated. Specifically, in this embodiment, a portion or window of the media stream is selected from the media stream. The length of the window can be any desired length, but typically should not be so short as to provide little or no useful information, or so long that it potentially encompasses too many media objects. In a tested embodiment, windows or segments on the order of about two to five times the length of the average object of the sought class or so was found to produce good results. This portion or window can be selected from either end of the media stream, or can even be randomly selected from the media stream.
Next, the selected portion of the media stream is directly compared against similar sized portions of the media stream in an attempt to locate a matching section of the media stream. These comparisons continue until either the entire media stream has been searched to locate a match, or until a match is actually located, whichever comes first. As with the selection of the portion for comparison to the media stream, the portions which are compared to the selected segment or window can be taken sequentially beginning at either end of the media stream, or can even be randomly taken from the media stream.
In this tested embodiment, once a match is identified by the direct comparison of portions of the media stream, identification and segmentation of repeating objects is then achieved by aligning the matching portions to locate object endpoints. Note that because each object includes noise, and may be shortened or cropped, either at the beginning or the end, as noted above, the object endpoints are not always clearly demarcated. However, even in such a noisy environment, approximate endpoints are located by aligning the matching portions using any of a number of conventional techniques, such as simple pattern matching, aligning cross-correlation peaks between the matching portions, or any other conventional technique for aligning matching signals. Once aligned, the endpoints are identified by tracing backwards and forwards in the media stream, past the boundaries of the matching portions, to locate those points where the two portions of the media stream diverge. Because repeating media objects are not typically played in exactly the same order every time they are broadcast, this technique for locating endpoints in the media stream has been observed to satisfactorily locate the start and endpoints of media objects in the media stream.
Alternately, as noted above, in one embodiment, a suite of algorithms is used to target different aspects of audio and/or video media for computing parametric information useful for identifying objects in the media stream. This parametric information includes parameters that are useful for identifying particular objects, and thus, the type of parametric information computed is dependent upon the class of object being sought. Note that any of a number of well-known conventional frequency, time, image, or energy-based techniques for comparing the similarity of media objects can be used to identify potential object matches, depending upon the type of media stream being analyzed. For example, with respect to music or songs in an audio stream, these algorithms include, for example, calculating easily computed parameters in the media stream such as beats per minute in a short window, stereo information, energy ratio per channel over short intervals, and frequency content of particular frequency bands; comparing larger segments of media for substantial similarities in their spectrum; storing samples of possible candidate objects; and learning to identify any repeated objects.
In this embodiment, once the media stream has been acquired, the stored media stream is examined to determine a probability that an object of a sought class, i.e., song, jingle, video, advertisement, etc., is present at a portion of the stream being examined. Once the probability that a sought object exists reaches a predetermined threshold, the position of that probable object within the stream is automatically noted within the aforementioned database. Note that this detection or similarity threshold can be increased or decreased as desired in order to adjust the sensitivity of object detection within the stream.
Given this embodiment, once a probable object has been identified in the stream, parametric information for characterizing the probable object is computed and used in a database query or search to identify potential object matches with previously identified probable objects. The purpose of the database query is simply to determine whether two portions of a stream are approximately the same. In other words, whether the objects located at two different time positions within the stream are approximately the same. Further, because the database is initially empty, the likelihood of identifying potential matches naturally increases over time as more potential objects are identified and added to the database.
Once the potential matches to the probable object have been returned, a more detailed comparison between the probable object and one or more of the potential matches is performed in order to more positively identify the probable object. At this point, if the probable object is found to be a repeat of one of the potential matches, it is identified as a repeat object, and its position within the stream is saved to the database. Conversely, if the detailed comparison shows that the probable object is not a repeat of one of the potential matches, it is identified as a new object in the database, and its position within the stream and parametric information is saved to the database as noted above.
Further, as with the previously discussed embodiment, the endpoints of the various instances of a repeating object are automatically determined. For example if there are N instances of a particular object, not all of them may be of precisely the same length. Consequently, a determination of the endpoints involves aligning the various instances relative to one instance and then tracing backwards and forwards in each of the aligned objects to determine the furthest extent at which each of the instances is still approximately equal to the other instances.
It should be noted that the methods for determining the probability that an object of a sought class is present at a portion of the stream being examined, and for testing whether two portions of the stream are approximately the same both depend heavily on the type of object being sought (e.g., music, speech, advertisements, jingles, station identifications, videos, etc.) while the database and the determination of endpoint locations within the stream are very similar regardless of what kind of object is being sought.
In still further modifications of each of the aforementioned embodiments, the speed of media object identification in a media stream is dramatically increased by restricting searches of previously identified portions of the media stream, or by first querying a database of previously identified media objects prior to searching the media stream.
Further, in a related embodiment, the media stream is analyzed by first analyzing a portion of the stream large enough to contain repetition of at least the most common repeating objects in the stream. A database of the objects that repeat on this first portion of the stream is maintained. The remainder portion of the stream is then analyzed by first determining if segments match any object in the database, and then subsequently checking against the rest of the stream.
In addition to the just described benefits, other advantages of the system and method for automatically identifying and segmenting repeating media objects in a media stream will become apparent from the detailed description which follows hereinafter when taken in conjunction with the accompanying drawing figures.