Digital audio signals offer many advantages over conventional media in terms of audio quality and ease of transmission. With the ever-increasing popularity of the Internet, digital audio clips have become a mainstay ingredient of is the Web experience, buoyed by such advances as the increasing speed at which data is carried over the Internet and improvements in Internet multimedia technology for playing such audio clips. Everyday, numerous digital audio clips are added to Web sites around the world.
An audio “clip” indicates an audio signal (or bit stream), in whole or part. A clip may be stored and retrieved, transmitted and received, or the like.
As audio clip databases grow, the needs for indexing them and protecting copyrights in the audio clips are becoming increasingly important. The next generation of database management software will need to accommodate solutions for fast and efficient indexing of digital audio clips and protection of copyrights in those digital audio clips.
A hashing technique is one probable solution to the audio clip indexing and copyright protection problem. Hashing techniques are used in many areas such as database management, querying, cryptography, and many other fields involving large amounts of raw data. A hashing technique maps a large block of data (which may appear to be raw and unstructured) into relatively small and structured set of identifiers (the identifiers are also referred to as “hash values” or simply “hash”). By introducing structure and order into raw data, the hashing technique drastically reduces the size of the raw data into short identifiers. It simplifies many data management issues and reduces the computational resources needed for accessing large databases.
Thus, one property of a good hashing technique is the ability to produce small-size hash values. Searching and sorting can be done much more efficiently on smaller identifiers as compared to the large raw data. For example, smaller identifiers can be more easily sorted and searched using standard methods. Thus, hashing generally yields greater benefits when smaller hash values are used.
Unfortunately, there is a point at which hash values become too small and begin to lose the desirable quality of uniquely representing a large mass of data items. That is, as the size of hash values decreases, it is increasingly likely that more than one distinct raw data can be mapped into the same hash value, an occurrence referred to as “collision”. Mathematically, for an alphabet of cardinality A of each hash digit and a hash value length l, an upper bound of all possible hash values is Al. If the number of distinct raw data is larger than this upper bound, collision will occur.
Accordingly, another property of a good hashing technique is to minimize the probability of collision. However, if considerable gain in the length of the hash values can be achieved, it is sometimes justified to tolerate collision. The length of the hash value is thus a trade off with probability of collision. A good hashing technique should minimize both the probability of collision and the length of the hash values. This is a concern for design of both hashing techniques in compilers and message authentication codes (MACs) in cryptographic applications.
Good hashing techniques have long existed for many kinds of digital data. These functions have good characteristics and are well understood. The idea of a hashing technique for audio clip database management is very useful and potentially can be used in identifying audio clips for data retrieval and copyrights protection.
Unfortunately, while there are many good existing functions, digital audio clips present a unique set of challenges not experienced in other digital data, primarily due to the unique fact that audio clips are subject to evaluation by human listeners. A slight pitch or phase shifting of an audio clip does not make much difference to the human ear, but such changes appear very differently in the digital domain. Thus, when using conventional hashing functions, a shifted version of an audio clip generates a very different hash value as compared to that of the original audio clip, even though the audio clips sound essentially identical (i.e., perceptually same).
Another example is the deletion of a short block of time from an audio clip. If the deleted block is short and in an otherwise quiet portion of the clip, most people will not recognize this deletion in the audio clip itself, yet the digital data is altered significantly if viewed in the data domain.
Human ears are rather tolerant of certain changes in audio clips. For instance, human ears are less sensitive to changes in some ranges of frequency components of an audio clip than other ranges of frequency components. Human ears are also unable to catch small stretching and shrinking of short segments in audio clips.
Many of these characteristics of the human auditory system can be used advantageously in the delivery and presentation of digital audio clips. For instance, such characteristics enable compression schemes, like MP3, to compress audio clips with good results, even though some of the audio clip data may be lost or go unused. There are many audio clip restoration/enhancement algorithms available today that are specially tuned to the human auditory system. Commercial sound editing systems often include such algorithms.
At the same time, these characteristics of the human auditory system can be exploited for illegal or unscrupulous purposes. For example, a pirate may use advanced audio processing techniques to remove copyright notices or embedded watermarks from an audio clip without perceptually altering the audio clip. Such malicious changes to the audio clip are referred to as “attacks”, and result in changes at the data domain.
Unfortunately, a human is unable to perceive these changes, allowing the pirate to successfully distribute unauthorized copies in an unlawful manner. Traditional hashing techniques are of little help because the original audio clip and pirated copy hash to very different hash values, even though the audio clips sound the same.
Common Attacks. The standard set of plausible attacks is itemized in the Request for Proposals (RFP) of IFPI (International Federation of the Phonographic Industry) and RIAA (Recording Industry Association of America). The RFP encapsulates the following security requirements:                two successive D/A and A/D conversions,        data reduction coding techniques such as MP3,        adaptive transform coding (ATRAC),        adaptive subband coding,        Digital Audio Broadcasting (DAB),        Dolby AC2 and AC3 systems,        applying additive or multiplicative noise,        applying a second Embedded Signal, using the same system, to a single program fragment,        frequency response distortion corresponding to normal analogue frequency response controls such as bass, mid and treble controls, with maximum variation of 15 dB with respect to the original signal, and        applying frequency notches with possible frequency hopping.        
Accordingly, there is a need for a hashing technique for digital audio clips that allows slight changes to the audio clip which are tolerable or undetectable (i.e., imperceptible) to the human ear, yet do not result in a different hash value. For an audio clip hashing technique to be useful, it should accommodate the characteristics of the human auditory system and withstand various audio signal manipulation processes common to today's digital audio clip processing.
A good audio hashing technique should generate the same unique identifier even though some forms of attacks have been done to the original audio clip, given that the altered audio clip is reasonably similar (i.e., perceptually) to a human listener when comparing with the original audio clip. However, if the modified audio clip is audibly different or the attacks cause irritation to the listeners, the hashing technique should recognize such degree of changes and produce a different hash value from the original audio clip.
Content Categorization
Like anti-piracy, semantic categorizing of the audio content of audio clips often requires subjective comparisons to other existing audio works. Works of a similar nature are grouped into the same category. The content of audio clips may be semantically classified into any number of categories, such as classical music, conversation, hard rock, easy listening, polka, lecture, country, and the other such semantic categories.
Typically, such semantic categorization is subjectively determined by manual (i.e., human) subjective analysis of a work so that it may be grouped with an existing category. No such technique exists for automatically (i.e., without substantial human involvement) analyzing and categorizing the semantic audio content of audio clips.