1. Field of the Invention
This invention relates to the classification of data which can be used to train a trainable process. It is of application to the assessment of signals carried by a telecommunications system, for example to assess the condition of telecommunications systems whilst in use. Embodiments will be described of application to audio signals carrying speech, and to video signals.
2. Related Art
Signals carried over telecommunications links can undergo considerable transformations, such as digitisation, data compression, data reduction, amplification, and so on. All of these processes can distort the signals. For example, in digitising a waveform whose amplitude is greater than the maximum digitisation value, the peaks of the waveform will be converted to a flat-topped form (a process known as peak clipping). This adds unwanted harmonics to the signal. Distortions can also be caused by electromagnetic interference from external sources.
Many of the distortions introduced by the processes described above are non-linear, so that a simple test signal may not be distorted in the same way as a complex waveform such as speech, or at all. For a telecommunications link carrying data it is possible to test the link using all possible data characters; e.g. the two characters 1 and 0 for a binary link, the twelve tone-pairs used in DTMF (dual tone multi-frequency) systems, or the range of "constellation points" used in a QAM (Quadrature Amplitude Modulation) system. However an analogue signal does not consist of a limited number of well-defined signal elements, but is a continuously varying signal. For example, a speech signal's elements vary according not only to the content of the speech (and the language used) but also the physiological and psychological characteristics of the individual talker, which affect characteristics such as pitch, volume, characteristic vowel sounds etc.
It is known to test telecommunications equipment by running test sequences using samples of the type of signal to be carried. Comparison between the test sequence as modified by the equipment under test and the original test sequence can be used to identify distortion introduced by the equipment under test. However, these arrangements require the use of a pre-arranged test sequence, which means they cannot be used on live telecommunications links--that is, links currently in use--because the test sequence would interfere with the traffic being carried and be perceptible to the users, and also because the live traffic itself (whose content cannot be predetermined) would be detected by the test equipment as distortion of the test signal.
In order to carry out tests on equipment in use, without interfering with the signals being carried by the equipment (so-called non-intrusive testing), it is desirable to carry out the tests using the live signals themselves as the test signals. However, a problem with using a live signal as the test signal is that there is no instantaneous way of obtaining, at the point of measurement, a sample of the original signal. Any means by which the original signal might be transmitted to the measurement location would be as subject to similar distortions as the link under test.
The present Applicant's co-pending International Patent applications WO96/06495 and WO96/06496 (both published on Feb. 29th 1996) propose two possible solutions to this problem. WO96/06495 describes the analysis of certain characteristics of speech which are talker-independent in order to determine how the signal has been modified by the telecommunications link. It also describes the analysis of certain characteristics of speech which vary in relation to other characteristics, not themselves directly measurable, in a way which is consistent between individual talkers, and which may therefore be used to derive information about these other characteristics. For example, the spectral content of an unvoiced fricative varies with volume (amplitude), but in a manner independent of the individual talker. The spectral content can thus be used to estimate the original signal amplitude, which can be compared with the received signal amplitude to estimate the attenuation between the talker and the measurement point.
In WO96/06496, the content of a received signal is analysed by a speech recogniser and the results of this analysis are processed by a speech synthesiser to regenerate a speech signal having no distortions. The signal is normalised in pitch and duration to generate an estimate of the original speech signal which can be compared with the received speech signal to identify any distortions or interference, e.g. using perceptual analysis techniques as described in International Patent Applications WO94/00922 and WO95/15035.
Typically speech transmission over a limited bandwidth employs data reduction e.g. linear predictive codecs (LPCs)). Such codecs are based on an approximation to the human vocal tract and represent segments of speech waveform as the parameters required to excite equivalent behaviour in a vocal tract model.
In the Applicant's International Patent Specification WO97/05730, there is disclosed a method and apparatus for assessing the quality of a signal carrying speech, in which the signal is analysed according to a spectral representation model (preferably an imperfect vocal tract model, although auditory models may be used instead) to generate output parameters, the output parameters are classified according to a predetermined network definition function, and an output classification is generated. The classifications can be generated according to a network definition function which is derived in a preliminary step from data for which the output value is known. Alternatively, it could be derived according to predetermined rules derived from known characteristics known to occur under certain conditions in the system to be tested.
The term "auditory model" in this context means a model whose response to a stimulus is approximately the same as the response of the human auditory system (i.e. the ear-brain combination). It is a particular category of the more general term "perceptual" model; that is, a model whose response to a stimulus is approximately the same as the response of the human sensory system (i.e. eye-brain, ear-brain, etc.).
The term `imperfect vocal tract model` in this context means a vocal tract model which is not `ideal` but is also capable of generating coefficients relating to auditory spectral elements that the human vocal tract is incapable of producing. In particular it means a model that can parametrically represent both the speech and the distortion signal elements, which is not the normal goal for vocal tract model design. Speech samples known to be ill-conditioned or well-conditioned, (i.e. respectively including or not including such distortion elements) are analysed by the vocal tract model, and the coefficients generated can then be identified as relating to well or ill-conditioned signals, for example by a trainable process such as a neural network. In this way classification data can be generated for vocal tract parameters associated with each type of signal, (any parameters which are associated with both, and are therefore unreliable indicators, can be disregarded in generating the classification data), so that when an unknown signal is subsequently processed, an output can be generated using the previously generated classification data associated with those parameters which relate to the unknown signal.
Sequences of parameters, as well as individual parameters, may also be used to characterise a signal. Data compression techniques may be used to store the parameters recorded.
The apparatus of the aforementioned WO97/05730 comprises training means for generating the stored set of classification data, the training means comprising first input means for supplying a sample of speech to the modelling means; second input means for supplying to the training means known output information (referred to hereinafter as "labels") relating to the speech sample; means for generating classification data from the modelling means based on the labels, and storage means for storing classification data generated by the modelling means.
The speech segments used in the training sample must therefore each be labelled as well or ill-conditioned. This is a major undertaking, because a typical sample comprises several hours of speech, and many such samples are required in order to train the system to respond correctly to a range of talkers, conditions, and other variables. The duration of an individual segment is typically 20 milliseconds, so in all several million segments must be labelled. Moreover it would be necessary to use a number of human analysts to classify each sample to obtain a statistically valid result because of individual variations in perception, concentration, and other factors. Moreover, it is not possible for a human observer to accurately identify whether individual segments of such short duration are well- or ill-conditioned.