The background of the present disclosure and the illustrative embodiments disclosed herein are described in the context of identifying known audio recordings encountered during an outbound telephone call, for example during a call placed from a contact center. However, the present invention has applicability to the identification of any segment of audio or an image (as used herein, the term “image” is intended to encompass both still and moving images), regardless of the type or source of the audio or image, and regardless of in what circumstances the audio or image is encountered. Furthermore, the present invention also has applicability to the identification of any segment of data such as, for example, data obtained from any type of sensor. Therefore, as used herein, the term “dataset” shall encompass a collection of any type of data, whether comprising audio, image, or other type of data.
In a classic contact center scenario, outbound calls are made either automatically (by a class of devices known as “automated dialers” or “autodialers”) or manually. A number of human “agents” are available to join into calls that are determined to reach a live person at the called end. In this way, efficiencies are obtained by not having agents involved in a call until it is determined that there is a live person at the called end with whom the agent may speak. The use of automated equipment to monitor the telephone line during the outbound call is referred to as call progress analysis (CPA). CPA is a class of algorithms that operate on audio and network signaling during call setup. The goal of CPA is to determine the nature of the callee, or the outcome of call setup to an external network (traditional public switched telephone network or Voice over Internet Protocol (VoIP)). Specifically, when a call or session is being established, the caller or initiator must determine whether it was answered by a live speaker, if the line is busy, etc. When the caller is an automated application, such as an automated dialer or message broadcasting system, CPA algorithms are used to perform the classification automatically. CPA is used to interpret so-called call-progress tones, such as ring back and busy, that are delivered by the telephone network to the calling entity. Traditional CPA is performed using low- and high-pass frequency discriminators together with energy measurements over time to qualify in-band signaling tones.
Another method for classifying audio on an outbound call is known as Voice Activity Detection (VAD), which is a class of audio processing algorithms that identify where speech is present in an audio stream. The detected speech may originate from any source, including a live speaker or a prerecorded message. Modern VAD algorithms use spectral analysis to distinguish the utterance of a primary speaker from background noise.
A subclass of CPA algorithms that extract speaking patterns using VAD, and determine whether the patterns originate from a live speaker or a prerecorded message, is known as Answering Machine Detection (AMD). By identifying calls that do not connect to a live speaker, an accurate AMD algorithm can significantly increase throughput of an automated dialer. However, false positives from AMD lead to silent or abandoned calls, causing revenue loss for the contact center, and negative impressions amongst the public. The quality of an AMD algorithm is a function of the accuracy and response time, and some regions of the world (notably the U.S. and U.K.) impose strict legal requirements on both.
AMD is not an exact science, and the optimal approach is an open problem. To achieve acceptable accuracy, speed, and flexibility, AMD algorithms use a combination of heuristics and statistical models such as neural networks to classify an utterance as live or pre-recorded. Although many commercial AMD systems available on the market report high accuracy rates in the marketing literature (e.g., 95% or more), there is no independent auditor for these figures, and the actual accuracy rate is typically much lower in practice (e.g., 80% or less), as reflected by continued widespread complaints. A general ban has been proposed by some consumer advocacy groups, and some contact centers simply cannot use AMD because of its limitations.
A relatively new science of audio identification is known as Acoustic Fingerprinting, in which a system generates a “fingerprint” of a candidate audio stream, and compares it against a database of known fingerprints, analogous to human fingerprinting used in forensics. In this context, a “fingerprint” is a condensed digest of an audio stream that can quickly establish perceptual equality with other audio streams. A database of known fingerprints may associate known fingerprints with meta-data such as “title”, “artist”, etc. The past ten years have seen a rapidly growing scientific and industrial interest in fingerprinting technology for audio and images. Applications include identifying songs and advertisements, media library management, and copyright compliance.
Various acoustic fingerprinting algorithm classes have been proposed, and the most prevalent today are those based on either “landmarks” or “bitmaps”. Landmark-based algorithms extract discrete features from an audio stream called “landmarks”, such as spectral peaks, sudden changes in tone, pitch, loudness, etc. The optimal choice of landmark is an open question guided mostly by heuristics. The acoustic fingerprint is stored as a sequence of data structures that describe each landmark. At runtime, landmarks extracted from a candidate audio stream are compared to a database of fingerprints based on a distance metric.
Bitmap-based algorithms analyze an audio stream as a sequence of frames, and use a filter bank to quantize each frame into a bit vector of size N, where N is typically chosen for convenience as the number of bits in a C-style integer, e.g. Nε{8, 16, 32, or 64}. A popular and well-studied example is known as the “Haitsma-Kalker algorithm”, which computes a binary bitmap using a filter that compares short-term differences in both time and frequency. The Haitsma-Kalker Algorithm has been well-studied in the literature. It's inventors, Jaap Haitsma and Ton Kalker, have published a report of use of the Haitsma-Kalker Algorithm and the comparison of binary acoustic fingerprint bitmaps to identify three (3) second recordings of songs from a database of millions of songs (Haitsma and Kalker, “A Highly Robust Audio Fingerprinting System,” Journal of New Music Research, Vol. 32, No. 2 (2003), pp. 211-221). The complete acoustic fingerprint is stored as a sequence of bit vectors, or a bitmap. As illustrated in FIG. 1A-C, there are shown three images of an audio stream containing a message from a telephone network saying “This number has been disconnected”. FIG. 1A shows the original audio wave signal, with 1.5 seconds of audio sampled at 8000 KHz. FIG. 1B shows a spectrogram of the original audio input signal, with dark regions indicating high energy at a particular frequency. FIG. 1C shows a binary acoustic fingerprint bitmap created using the Haitsma-Kalker algorithm, with height N=16. The height is determined by the number of bits computed at each frame, and the width is determined by the number of frames in the audio stream. At runtime, the bitmap computed from a candidate audio stream is compared to a database of bitmaps based on the number of non-matching bits, also known as the Hamming distance.
The use of bitmap matching and the process of acoustic fingerprinting is a powerful emerging tool in the science of audio recognition; however, it is computationally intense and requires several seconds of sampled audio to make a match in many cases. This delay makes it not well suited for use in call progress analysis. Accordingly, there remains a need for faster and more accurate systems and methods for identifying audio, both in the general case and during an outbound call attempt.