This invention relates generally to media identification systems, and in particular to the identification of unknown media items from a database of known media items that may have portions of common content.
Digital fingerprinting is a process that can be used to identify unknown digital media samples, such as audio or video samples. In an example media identification system, digital fingerprints are generated for each of a number of known media samples, which may be obtained from data files, broadcast programs, streaming media, or any of a variety of other media sources. Each digital fingerprint may comprise a data segment that contains characteristic information about a sample of the media from which it was generated. U.S. Pat. No. 7,516,074, which is incorporated by reference in its entirety, describes embodiments for generating characteristic digital fingerprints from a data signal.
The reference fingerprints are then stored in a database, or repository, and indexed in a way that associates the reference fingerprints with their corresponding media samples and/or metadata related to the media samples. U.S. Pat. No. 7,516,074 also discloses embodiments for indexing reference fingerprints in a database. The database of reference fingerprints can be used to identify an unknown media sample. To identify an unknown media item, a test fingerprint is generated from a sample of the media item. The test fingerprint is then matched against the database of reference fingerprints and, if a match is found, the unknown media sample is declared to be media sample associated with the matching reference fingerprint. Various exact matching and fuzzy matching algorithms and criteria for declaring a valid match may be used.
Due to the large number of reference fingerprints in a practical application, the reference fingerprints may be stored in a large-scale distributed database. Because the distributed database may include a large number of items (e.g., reference fingerprints) stored on multiple servers, the database may contain duplicates and different versions of the same or similar reference fingerprints. While exact duplicates can be detected and removed from the database, the database may still include a lot of partial duplicates that share some common parts but represent different media objects. For example, different episodes of the same TV or radio program usually have a few common portions, such as the introduction, the opening music, and the credits. Another example is a set of movies produced by the same movie company, which although they may be completely unrelated, they usually have the identical company logo and music displayed in their beginning frames. Some broadcast streams may also contain significant number of repeating fragments (e.g., commercials, promos, or jingles) even though they represent different broadcast streams.
Multimedia search engines often employ techniques to reduce the database size and speed up the search process. For example, a multimedia search engine may use an indexing scheme to identify quickly a set of candidate reference fingerprints, which are then compared against a test fingerprint to verify a match. The fingerprint index is usually stored in computer memory (e.g., RAM), which makes the candidate selection process fast and efficient. On the contrary, the candidate verification process involves loading additional information (usually, a complete fingerprint) from a storage memory (e.g., hard disk drive) into RAM. This storage memory input/output is significantly slower than RAM access, and a large number of slow storage memory input/output operations required to verify candidates can significantly degrade the system performance.
While the number of false candidates can be reduced by improving the fingerprint indexing technology and tuning the search discriminating properties, this does not change the number of reference media items that have common content, which number may be significant. Previous techniques use methods of candidate verification that are based on comparison of multiple fingerprint blocks around the initial candidate matching point. Although these methods may enable finding a proper target media object (i.e., the longest match) among all candidates, they require verification and evaluation of all found candidates, including all partial candidates. For example, if a database contains 1000 episodes of the same media program, and all of these episodes contain the same introduction (or logo or overture), all 1000 candidate episodes must be verified to find the best match.
It would be desired to provide a search method that can more efficiently identify unknown media items using a database of known media items that may have portions of common content.