The present invention relates to the field of telecommunications. More particularly, the present invention relates to estimating the quality of a speech signal.
In a conventional telecommunications system, the transmission chain over which a speech signal (e.g., a signal carrying a spoken sentence) must pass, may include speech encoders, speech decoders, an air interface, public switched telephone network (PSTN) links, computer network links, receive buffering, signal processing logic, and/or playback equipment. As one skilled in the art will readily appreciate, any one or more of these elements which make up the transmission chain may distort the speech signal. Estimating the quality of speech signals is important in order to ensure that speech quality exceeds minimum acceptable standards, so that speech signals can be heard and understood by a listener.
Typically, estimating speech quality involves transmitting a reference speech signal (herein referred to as a xe2x80x9creference signalxe2x80x9d) across a transmission chain to a receiving entity. The received signal, having been distorted by the various elements that make up the transmission chain, is herein referred to as the test signal. The test signal and the original reference signal are then forwarded to a speech quality estimation algorithm.
There are a number of conventional, speech quality estimation algorithms. Most, however, employ the same basic technique which is illustrated in FIG. 1. As shown, a reference signal 105 and a test signal 110 are divided into N number of short time frames (e.g., 20 msec. each). A new representation, such as a frequency representation, is then derived for each of the N time frames associated with the reference signal 105 and each of the N time frames associated with the test signal 110. A difference vector comprising N time frames is then derived by comparing the representations associated with each of the N time frames of the reference signal 105 with the corresponding representation associated with the test signal 110. The comparison might be accomplished by subtracting the corresponding representations on a frame-by-frame basis. For each frame, the difference between the corresponding representations may be summed so that a single distortion metric is derived for each of the N time frames. The N distortion metrics may then be averaged, where the average value can be used as a measure of total signal distortion or speech quality.
A problem with the above-identified speech quality estimation technique is that it is highly sensitive to time shifts (e.g., transmission delays); the greater the time shift, the more unreliable the speech quality estimation. In an attempt to avoid this problem, conventional speech quality estimation algorithms align the reference signal and the test signal before performing the speech quality estimation, as illustrated in FIG. 2. Of course, just as there are a number of conventional approaches for estimating speech quality, there are a number of conventional techniques for aligning a reference signal and a test signal.
One such technique for aligning a reference signal and a test signal utilizes a known, estimated xe2x80x9cglobalxe2x80x9d delay factor, as illustrated in FIG. 2. In accordance with this technique, the test signal or the reference signal is shifted in the time domain by an amount that is equivalent to an estimated global delay. Thereafter, the two signals may be fed to the speech quality estimation algorithm. Another well-known technique for aligning a reference signal and a test signal involves iteratively aligning the two signals in the time domain until a cross-correlation measurement, or other similar metric is maximized. Still another technique involves transmitting the reference signal, and in addition, information which identifies one or more portions of the signal, for example, by inserting sinusoidal signals or chirps into the reference signal. Accordingly, these one or more portions of the test signal can be more easily recognized and aligned with the corresponding portions of the reference signal.
Each of the above-identified techniques for aligning a reference signal and a test signal, however, assume that the delay introduced by the various components which make up the transmission chain is a fixed delay, or a delay that changes slowly over time, such that periodic resynchronization is possible. In other words, it is assumed that a constant time shift exists between the reference signal and the test signal. While this may hold true for circuit switched networks, transmission delays are rarely fixed or constant in packet switched networks, for example, Internet Protocol (IP) based networks. For instance, in virtually all packet switched network scenarios, transmission delays vary with traffic load (i.e., the level of congestion in the network). Since traffic load generally changes on a continuous basis, the transmission delay experienced by a single speech signal traversing the network may vary. If these variable transmission delays go undetected, the reference signal and the test signal cannot be properly aligned, and the speech quality estimation algorithm cannot possibly perform an accurate speech quality estimation. Furthermore, the use of inexpensive personal computer systems as communications devices might also contribute to a speech signal experiencing variable delays.
The present invention involves a speech quality estimation technique that permits the use of an arbitrary speech quality estimation algorithm. In general, the present invention analyzes the reference signal and the test signal, and based on this analysis, identifies delay variations and/or discontinuities in the test signal, if any. These portions of the test signal are then removed so that the reference signal and the test signal are similarly scaled with respect to time. The reference signal and the test signal are then forwarded to a standard speech quality estimation algorithm. The resulting speech quality estimation is then adjusted based on an analysis of the portions of the test signal that were previously removed.
Accordingly, it is an object of the present invention to provide a speech quality estimation technique that is capable of assessing speech quality despite the presence of variable transmission delays, including continuous and intermittent, variable transmission delays.
It is another object of the present invention to prevent the presence of variable transmission delays from precluding the use of a standard speech quality estimation algorithln.
In accordance with a first aspect of the present invention, the above-identified and other objectives are achieved by a method for estimating speech quality. The method involves identifying portions of a first speech signal that exhibit distortions caused by transmission delays. The identified portions are then removed from the first speech signal, and the first speech signal is compared to a second speech signal. A speech quality estimate is then generated, based on the comparison of the first speech signal and the second speech signal.
In accordance with a second aspect of the present invention, the above-identified and other objectives are achieved through a method of estimating speech quality in a telecommunications network, wherein a first speech signal is transported across a transmission chain to a receiving entity. The method involves aligning, at the receiving entity, each of a number of synchronization points along the first speech signal and a corresponding one of a number of synchronization points along a reference speech signal. A determination is then made as to whether any portions of the first speech signal reflect an intermittent delay variation, based on the alignment of the synchronization points along the first speech signal and the reference speech signal. The level of continuous delay variation exhibited by the first speech signal is then determined, and the first speech signal, or the reference speech signal, is adjusted to account for the level of continuous delay variation exhibited by the first speech signal, as well as for any portions of the first speech signal that reflect an intermittent delay variation. The first speech signal is then compared to the reference speech signal, and, based thereon, speech quality is estimated.
In accordance with a third aspect of the present invention, the above-identified and other objectives are achieved through a method of estimating speech quality in a packet switched telecommunications network, where speech signals are transported across a transmission chain to a receiving entity. The method involves aligning each of a number of sync point segments along a first speech signal with a corresponding sync pulse segment along a reference speech signal, where the first speech signal was transported across the transmission chain to the receiving entity, and where the reference signal is identical to the first speech signal prior to the first speech signal having been transported across the transmission chain. After aligning the sync point segments along the first speech signal and the sync pulse segments along the reference speech signal, an intermittent delay variation between adjacent sync point segments along the first speech signal, assuming one exists, is identified. Next, the location and size of any identified intermittent delay variation along the first speech signal is determined, as is any level of continuous delay variation exhibited by the first speech signal. The first speech signal or the reference speech signal is then adjusted to account for the presence of any intermittent delay variations and the level of continuous delay variation along the first speech signal. The first speech signal is then compared to the reference signal, and speech quality is estimated based on the comparison of the first speech signal and the reference signal. Finally, the estimated speech quality is adjusted to achieve a perceived speech quality, where the adjustment of the estimated speech quality is based on the intermittent delay variations, if any, and the level of continuous delay variation.