With the rapid growth of digital content, there is an increasing demand for properly managing and locating the content. One prospective application is identifying an excerpt of audio or video within a repository of known content. This can be useful in monitoring illegal download/transfer of content on the Internet. It can also be useful in providing intelligent access to content that users are listening to or watching and are interested in, but for which they do not know the title or author information. The user can, for example, place a cell phone against the loudspeaker currently playing a song, and let the cell phone operator's software find out the song's title, its artist and album information, etc.
Such identification capability is generally implemented by first generating feature information (called fingerprints) designed to identify uniquely audio and video signals, and then performing some form of pattern matching search between the fingerprints from the repository/database and those from the excerpt in the search query. Such a database is often implemented in the form of a search tree, although other data structures are also possible. Generally, each fingerprint corresponds to a certain segment of audio or video. So a two second audio fingerprint would correspond to a two second segment of audio. A fingerprint is usually implemented as a concatenation of small blocks of feature information that are often called signatures. A two second fingerprint may, for example, consist of two hundred 10 millisecond (ms) long signatures, wherein each signature is computed from 10 ms of new audio or video information. The pattern matching of fingerprints is therefore a process of comparing corresponding signatures. This is illustrated in FIG. 1 that shows the creation of a fingerprint from signatures.
In order to perform proper pattern matching between a fingerprint from a query excerpt and those from the database, it is crucial to have proper time alignment between the two during a comparison. To ensure that, usually all fingerprints starting at every possible time offset are added to the database to guarantee that at least one of them will have time alignment that is close enough to the query fingerprint. If a signature is 10 ms long, then a two second fingerprint is shifted every 10 ms over a two second sliding window and then added to the database. This is also illustrated in FIG. 1, and in this case it creates a 99.5% overlap between successive fingerprints, but such redundancy is usually required to ensure good search performance. For any remaining time misalignment less than 10 ms (or in general the duration of each signature), a well-designed fingerprint generation method should select whichever signature is closer in terms of timing to better match the corresponding signature from the query. In short, the goal of a fingerprint search system is to find a query fingerprint's corresponding counterpart fingerprint, and the counterpart should minimize its time misalignment with the query fingerprint, if such misalignment exists at all.
Because the query excerpt may have undergone some editing or processing steps, such as recapturing sound played from a loudspeaker using a cell phone, there may be some distortions in the captured audio/video signals. As a result, the resulting fingerprints may also change slightly with respect to their counterparts in the database, assuming there is a counterpart.
The possibility of distortions in the excerpt means the best match in such a search is often not an identical match, but a form of closest match. To define the closest match requires the definition of a measure of difference between two fingerprints. For example, a commonly used measure of difference is the Hamming distance, that is, the number of differing bits between the fingerprint from the query excerpt and that from the database. With this definition of measure of difference, the corresponding criterion for the closest match is thus the fingerprint from the database that has the minimum Hamming distance from the fingerprint from the query excerpt. The Hamming distance between two fingerprints divided by the number of bits in a fingerprint is often referred to as the bit error rate (BER). The BER is an example of measure of relative difference. The minimum Hamming distance criterion works well when the BER between the fingerprint from the excerpt and its counterpart is small. However, as BER increases, the search result producing the minimum Hamming distance increasingly does not find the actual counterpart. Fortunately, in most fingerprint search applications, it is only necessary to identify the correct audio/video piece, but not necessarily the corresponding segment. But when BER increases further, the search result may even find the wrong audio/video piece, let alone the correct segment within that piece. The BER depends on both the level of distortion in the query excerpt, and the robustness of the fingerprint extraction method with respect to such distortions.
Furthermore, it is possible that an excerpt does not belong to any piece in the database. For example, the excerpt may be the recording of a new composition of music. Because no search algorithm can know beforehand (without being told) whether an excerpt belongs to the database or not, the best it can do is still apply the same criterion of minimum Hamming distance, but expecting that the minimum Hamming distance found in such cases will be much more different (preferably higher) than that of an excerpt originally from the database, and use some threshold to determine whether the excerpt is from the database.
Therefore, there are three possible outcomes after a single search operation (wherein only one query fingerprint is used to search the database), before applying any thresholds (for example in terms of BER):                1. The excerpt belongs to the database, and the search returns the correct audio/video piece (finding the correct piece is enough, here it is not necessary to find the correct counterpart segment).        2. The excerpt belongs to the database, and the search returns the wrong audio/video piece.        3. The excerpt does not belong to the database, and because the search always returns some audio/video piece, the answer will always be wrong.        
FIG. 2 shows an example of BER distribution for three different possible outcomes of a single search. Each of these outcomes would generate a corresponding probability-density-function (PDF) distribution of the BER. For a well-designed fingerprint extraction algorithm, the BER of the first outcome should generally be smaller than the BER of the second and third outcomes, as illustrated in FIG. 2.
However, if the BER of the second and third outcomes have very similar PDF distributions, it would be difficult to distinguish between an excerpt belonging to the database but having a wrong search result, and an excerpt that doesn't belong there. Furthermore, for pieces originally from the database, after applying common audio/video distortions such as codec compression, the search result in a typical implementation is correct (in terms of identifying the correct piece) usually ranges from 90 to 99%, depending on the fingerprint duration and the type of distortions, before applying any BER threshold. This is good but a higher level of accuracy is certainly desirable, and after applying BER thresholding (say at BER=0.2 in FIG. 2), the ratio of correct search results only decreases slightly because the tail of the BER distribution of outcome one is discarded to avoid falsely picking up too much of the head distribution of outcome two. This means tweaking the BER threshold alone cannot lead to very high accuracy (say 99.9%) in a single search.