Many different types of telephony technologies and devices are presently available and in use for processing, storing, and transmitting audio streams, and in particular voice streams, and new telephony technologies and devices are constantly being developed and introduced into the market. These technologies and devices span a gamut that includes plain old telephony systems (POTS) and devices, voice over IP (VOIP), voice over ATM, voice over mobile (e.g. GSM, UMTS), and various speech coding technologies and devices. For convenience of presentation, any technology and/or device, such as by way of example, a technology or device noted above, that provides a reproduction of a voice signal, is generically referred to as a “CODEC”.
Testing CODECs to determine quality of speech that they provide and if the quality is acceptable, was, and often still is, determined by having human subjects listen to, and grade, voice signals that the CODECs produce. An advantage of using human subjects to test and grade a CODEC is that humans provide a measure of quality of voice reproduction that is perceived by the consumers who use the CODEC. The measures they provide reflect the human auditory-brain system and are responsive to features of sound to which the human auditory-brain system is sensitive and to how sound is perceived by humans. Quality grades for CODEC voice reproduction signals perceived by human subjects has been standardized in a Mean Opinion Score (MOS), which ranks perceived quality of voice reproduction in a scale of from 1 to 5, with 5 being a best perceived quality.
However, using human subjects to grade CODEC sound quality is generally expensive, time consuming, not easily used in many venues, difficult to arrange and often not reproducible. A method referred to as “Perceptual Evaluation of Speech Quality (PESQ)” provides an “objective” method of grading quality of voice reproductions provided by a CODEC and is presently a standard for measuring voice reproduction quality. PESQ is configured to generate a voice quality score for a reproduced vocal signal that is indicative of, and generally correlates highly with, a quality score that would be perceived for the voice signal by human subjects. PESQ is described in ITU-T Recommendation P.862, the disclosure of which is incorporated herein by reference, and was adopted as a standard by the ITU-T for assessing speech quality for CODECs in February of 2001.
In accordance with PESQ, a CODEC is graded for quality of voice signal reproductions that it provides by comparing an input voice signal that it receives with a reproduction, output voice signal that the CODEC outputs responsive to the input. To make the comparison, the input and output voice signals are processed to provide input and output psychophysical “perceptual” representations of the signals. The perceptual representations, hereinafter “perceptual signals”, are representative of the way in which the input and output signals are perceived by the human auditory system. The perceptual signals are a frame-by-frame mapping of the frequencies and loudness of the input and output signals onto frequency and loudness scales that reflect sensitivity of the human auditory system.
Typically, the perceptual signals are generated by performing a windowed, frame by frame, fast Fourier transform (FFT) of the signals to provide a frequency spectrum for each frame of the signals. The frequency spectra are warped to the human perceptual frequency and loudness scales measured in barks and sones respectively to provide for each frame, in the input and output perceptual signals, loudness in sones as a function, hereinafter referred to as a sone density function”, of frequency in barks. The input signal and output signal are each therefore represented by a two dimensional “perceptual” array of sone values as a function of frame number and frequency. A typical frame is a 32 ms long period with 50% overlap of PCM samples acquired at a sampling rate of 8 kHz or 16 kHz and windowing is defined by a multiplication of each frame with Hanning window 32 ms long.
Signed differences between the sone density functions of corresponding frames in the perceptual input and output signals are determined to provide a frame-by-frame “audible perceptual difference” between the original, input signal and the output signal as a function of bark frequency. The perceptual differences are adjusted for masking to define a “disturbance” function of bark frequency for each frame, which function is conventionally referred to as a “disturbance density function” of the frame.
The disturbance density functions for a given pair of corresponding input and output signal frames is particularly sensitive to temporal misalignment between the frames. An article entitled “Perceptual Evaluation of Speech Quality (PESQ), the new ITU standard for end-to-end speech quality assessment. Part I—Time alignment” by A. w. Rix, et al; J. Audio Eng. Soc.; Vol. 50, No. 10; October 2002 pp 754-764; Part II—Psychoacoustic model. JAES Volume 50 Issue 10 pp. 765-778; October 2002, the disclosures of which is incorporated herein by reference, notes that PESQ values are very sensitive to temporal frame misalignment by even small fractions of a frame length. As a result, prior to calculating disturbances and disturbance density functions, PESQ performs a relatively elaborate procedure for determining relative delays between corresponding input and output signal frames and time aligning the frames.
Conventionally PESQ assumes that delays are piecewise constant i.e. that a delay for a given section, generally comprising a plurality of frames, of the output signal relative to a corresponding section of the input signal, is constant for all frames in the output section. The section delay is determined responsive to cross correlating a portion of the output signal that comprises the section and/or the section with, respectively, a portion of the input signal that comprises the corresponding section and/or the corresponding section of the input signal.
The disturbance density function for a frame is processed in accordance with a metric defined by a cognitive model that models human sensitivity to disturbances to calculate a disturbance and an asymmetric disturbance for each frame. The frame disturbances and asymmetric frame disturbances are processed in accordance with the cognitive model to provide an “objective” PESQ measure of perceived quality, typically in MOS units, of the output signal.