Recent years have seen a marked increase in the use of automatic recognition of media such as music or other audio (collectively and generically referred to herein as “content” or “content signals”) generated from a variety of sources. For example, owners of copyrighted works or advertisers can apply automatic content recognition techniques to obtain data on the frequency of broadcast of their material. Music tracking services can provide playlists of major radio stations in large markets. Consumers can identify content such as songs, television shows, movies, advertising, etc., broadcast on the radio or television, streamed over the Internet, played from a CD or DVD, etc., and rendered (i.e., played) via a loudspeaker. Once identified, consumers can purchase or gain access to new and interesting music or other products and services, as well as access meta-data (e.g., artist, song title, show title, episode, etc., corresponding to the content).
Content recognition techniques commonly rely upon various content fingerprinting algorithms to compute or derive one or more “fingerprints” that characterize a content signal. As commonly understood, the “fingerprint” of a content signal represents one or more salient features of that content signal at or near a particular anchor or landmark therein. Within the field of content recognition, it is commonly understood that a “salient feature” of a content signal is an intrinsic characteristic of the content signal and not to extrinsic features (e.g., title, identification number, author, publication date, etc.) which may describe or otherwise be assigned to or associated with the content signal. Recognition of a sampled content signal is carried out by identifying one or more fingerprints derived from a known content signal that sufficiently corresponds to, or matches, one or more fingerprints derived from the sampled content signal.
Frequently, content signals are sampled as they are rendered so as to be present within the ambient, aural environment. However, the aural environment in which a content signal is rendered may undesirably contain ambient noise (e.g., people talking, coffee grinders grinding, espresso machines brewing, doors slamming, sirens blaring, etc.), acoustic reflections, reverberations, etc., that can be captured with the sampled content signal and incorporated into the derived fingerprint (or otherwise recorded as a fingerprint) for the sample. The presence of such environment-influenced fingerprints can undesirably affect accurate and reliable identification of the sampled content signal. Similarly, rendering a content signal below a certain “loudness” or sound pressure level (either in absolute terms, or relative to other sounds present within the aural environment), then conventional content recognition techniques may have problems accurately and reliably identifying the content signal. Further, the manner in which the content signal is rendered or sampled can introduce temporal distortion (e.g., time scaling) in a manner that can undesirably affect accurate and reliable identification of the sampled content signal. Thus conventional content recognition techniques can exhibit undesirably low robustness in the presence of degradation sources such as background noise, acoustic reflections, and channel distortion. It was a recognition of these and other problems associated with conventional content recognition techniques that formed the impetus of the embodiments exemplarily disclosed herein.