The rapid increase in the usage of speech processing algorithms in multi-media and telecommunications applications raises the need for speech quality evaluations. An accurate and reliable assessment of speech quality is thus becoming important for the satisfaction of the end-user or customer of a deployed speech processing system (e.g., cell phone, speech synthesis system, etc.). Assessment of speech quality can be done using subjective listening tests or using objective quality measures. Subjective evaluation involves comparisons of original and processed speech signals by a group of listeners who are asked to rate the quality of speech along a predetermined scale. Objective evaluation often involves a mathematical comparison of the original and processed speech signals. Many objective measures quantify quality by measuring the numerical “distance” between the original and processed signals. For an objective measure to be considered valid, the objective measure normally needs to correlate well with subjective listening tests.
Subjective listening tests provide perhaps the most reliable method for assessment of speech quality. However, these subjective listening tests can be time consuming and require, in most cases, access to trained listeners. For these reasons, researchers have investigated the possibility of devising objective, rather than subjective, measures of speech quality.
Objective quality assessment models can be classified into signal-based models, paramedic models, and protocol-information-based models. The different classifications of objective quality assessment models are further discussed below.
Signal-based models employ speech signals transmitted or otherwise modified by speech processing systems to estimate quality. Two general types of signal-based models exist. These include full reference models and reference-free models.
A full reference model, also known as an “intrusive” or “double-ended” model, depends on a reference (system input) speech signal and a corresponding degraded (system output) speech signal. This allows for the degraded output to be scored to be compared to the original input. In this case, specific test calls are set up and measurement speech signals are transmitted across a network which degrades the communicated signal. From a comparison of the output and input signals, a direct quality estimate or quality-relevant network parameters can be obtained. The International Communication Union (ITU) has standardized a Perceptual Evaluation of Speech Quality as a full reference mode for Narrow Band (NB) speech signals. Unfortunately, this approach requires the availability of the original uncorrupted signal which is often not available at a user location during actual call conditions making such an approach to determining speech signal quality during actual calls impractical for many end users who may seek to measure the quality of a speech signal being communicated by a network, e.g., to assess the impact and/or degradation caused by the communications network to the speech communicated by a call through the network.
A reference-free model, also known as a “non-intrusive” or “single ended” model, depends on the latter degraded signal but does not require the availability of the original uncorrupted original speech signal. Since the reference-free model does not require access to the original speech signal it is considered “single ended” since it depends on the signal at only one end, e.g., the measurement end. In this type of model, a measurement signal is acquired at a specific point of the network during normal network operation. From this signal, network or conversation parameters relevant to quality or indicative of quality can be measured and/or derived from the signal. The ITU has standardized P.563 as a reference free model.
FIG. 1 is a drawing 100 illustrating the computation of intrusive and non-intrusive models. As illustrated in the FIG. 1, the input signal 101, e.g. a speech signal, is supplied to the system to be tested 102. e.g., a telecommunications system. The processed speech signal, being output by the system 102, is labeled as degraded signal 103 since the output signal is degraded in quality as compared to the original input signal.
For non-intrusive (reference free) evaluation model based systems such as system 104, the degraded processed signal is needed to evaluate the signal quality. Thus, as shown in the FIG. 1, the degraded signal 103 is supplied as an input signal to the non-intrusive evaluation system 104. As discussed above, from the supplied degraded, e.g., communicated, signal 103, the non-intrusive evaluation system 104 derives network or conversation parameters relevant to quality, which are output as information 105. Intrusive (full reference) evaluation model based systems such as system 106 need both a reference input signal and a corresponding degraded output signal to evaluate the signal quality. Thus as shown in the FIG. 1, both of these signals (101, 103) are supplied as inputs to the intrusive evaluation system 106. Intrusive evaluation system 106 compares the output and input signals (103, 101) and evaluates a direct quality estimate and/or derives quality-relevant network parameters, which are output as information 107.
Parametric models will now be discussed. Signal-based models use speech signals as input to the quality estimation methods. Thus, to use a signal-based model, at least a prototype implementation or simulation of the transmission channel has to be set up. However, during the network design process, such signals are commonly not available but the network can be characterized by the technical specifications of its elements. Such technical specifications typically include: delay associated with a particular transmission path, the probability that packets get lost or discarded in Internet-Protocol (IP)-based transmission, as well as the type of codec and error concealment techniques used. Many of these specifications can be quantified in terms of planning parameters that enable a parametric estimation of speech quality to be performed prior to the connection becoming alive. While parametric models allow a network's effect on speech to be estimated or predicted without the need for actual signal measurements, quality estimates based on parametric models may be less accurate than actual signal measurements since the number of parameters used may be limited and the parameters may not fully represent or predict the effect of the actual network during real use on a speech signal communicated through the network.
One of the common parametric models is the E-model, that is used to estimate the quality associated with a speech transmission channel. The limitations of the E-model are discussed below. The E-model is limited to the speech impairments caused by packet loss and delay, and the E-model does not take in account impairments due to noise, clipping and codec distortions. In some cases in which a RTP (real-time transport protocol) stream has been terminated at an intermediate node along the call path, e.g., for transcoding, the terminated RTP stream is regenerated. As part of the regeneration process packets may be sequentially numbered making previously lost packets undetectable from the packet numbering of the regenerated RTP stream. Thus a node receiving a regenerated stream communicating a speech signal may be unaware from the packet headers that speech has been lost. Thus depending on the monitoring point across the call path, the packet delay and loss, used by the E-model, may not be accurate. For example, if the monitoring occurs after regeneration, a packet loss count based on RTP packet header numbers may be lower than the actual number of lost packets. A quality score that does not reflect the user perception of the call quality may be reported in such a case. The E-model assesses the speech quality based on network level metrics, such as a missing packet count based on packet header numbers, and thus the E-model is not aware of the content of the actual speech signal. For example, the E-model accounts for all the packet loss measured at the network level and may not accurately reflect the impact on the user perception of the speech quality of the received signal. The E-model does not take into account the location of the packet loss, which can impact user perception. For example, a packet loss that occurs during silence will not impact user perception of speech quality; however, the E-model does not account for differences in user perception as a function of whether the packet loss occurred during a silence. In addition, the E-model does not distinguish between a packet loss which occurs near the beginning of a call from a packet loss which occurs near the end of a call. From a user perception viewpoint, a packet loss near the end of a call may more negatively impact the user's perception of speech quality than a packet loss near the beginning of a call. The E-model approach reduces the MOS score based on the number of packets lost and does not take into account the location of the packet loss which may impact user perception of speech quality.
Protocol-Information-based models will now be discussed. The E-model has also been used for monitoring quality of VoIP (voice over internet protocol), but often does not provide accurate measurements for individual calls. As a consequence, alternative models have been developed for measuring and/or monitoring the quality of VoIP communicated speech for individual calls. Instead of using the voice payload of the transmitted packets, the known protocol information model exploits protocol header information such as the timestamps and sequence numbers of RTP headers for delay and packet-loss related information, and information on the end-point behavior such as dropped packet statistics or PLC information. The main goal of such models is to enable passive network and/or end-point monitoring with a lightweight parametric approach, while at the same time avoiding privacy concerns when accessing user related payload information.
Unfortunately, models and/or monitoring techniques which are based solely on protocol information and/or header information may not accurately reflect the quality of a received speech signal and/or the loss of speech information during the process of communicating the speech signal through a network, e.g., due to packet stream regeneration and/or other factors.
In view of the above, it should be appreciated that there is a need for methods and/or apparatus which allow the actual content of a received speech signal into consideration when measuring and/or estimating received speech signal quality without merely relying on packet header or network level information. While it is desirable to take the actual content of a received signal into consideration in generating a signal quality measurement it is desirable that the content of a received signal be used in a manner that does not create an excessive processing burden on the system generating the signal quality metric.
In view of the above it should be appreciated that there is a need for methods and apparatus which overcome one or more the limitations of the various known approaches discussed above, allowing for faster analysis and/or evaluation of signal quality and/or are more accurate than the known approaches.