The present invention relates generally to the extraction of a characteristic thumbprint from a data signal, such as an audio data file, and further to the comparison or matching of such thumbprints.
Because of the variations in file formats, compression technologies, and other methods of representing data, the problem of identifying a data signal or comparing it to others raises significant technical difficulties. For example, in the case of digital music files on a computer, there are many formats for encoding and compressing the songs. In addition, the songs are often sampled into digital form at different data rates and have slightly different characteristics. These minor differences make direct comparison of such files a poor choice for efficient file or signal recognition or comparison. Direct file comparison also does not allow comparison of media encoded in different formats (e.g., comparing the same song encoded in MP3 and WAV).
For these reasons, identifying and tracking media and other content, such as that distributed over the Internet, is often done by attaching metadata, watermarks, or some other code that contains identification information for the media. However, this attached information is often incomplete, incorrect, or both. For example, metadata is rarely complete, and filenames are even more rarely uniform. In addition, approaches such as watermarking are invasive, altering the original file with the added data or code. Another drawback of these approaches is that they are vulnerable to tampering. Even if every media file were to include accurate identification data such as metadata or a watermark, the files could be “unlocked” (and thus pirated) if the information were successfully removed.
Accordingly, other methods have been developed based on the concept of analyzing the content of a data signal itself. These method, however, fail because of their significant limitations and lack of robustness. Moreover, many of these techniques rely on knowing the beginning and ending of a signal. As a result, they cannot identify a signal whose beginning and end points are not defined, as in the case of streaming media provided over a broadcast network like the Internet. However, signal identification in streaming media is very desirable, for example, to independently determine which audio data files have been broadcast over the Internet.
One example of a content-based approach, U.S. Pat. No. 5,918,223, issued Jun. 29, 1999, entitled “Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information,” describes a method for identifying “sounds” that fit a particular set of attributes (e.g., sounds that are “scratchy” versus sounds that are “bright”). This technique is adapted it for use in song recognition applications, but the algorithm does not allow for the identification of streaming signal sources, nor does the algorithm work with other types of data signals apart from audio. Moreover, the algorithm described in the '223 patent generates large 1000-character thumbprints that are not well suited to client/server applications and other large volume applications. Lastly, the algorithm relies on the Fast Fourier Transform (FFT) to process the audio signals, a process that is resource-intensive and is thus not very efficient.
Accordingly, there exists a need to overcome existing limitations that current signal recognition techniques have failed to solve.