In recent years progress has been made in devising automatic speech recognition (ASR) systems which receive input data (generated by a microphone) which encodes speech spoken by a speaker—here referred to as a “test speaker”- and from it recognise phonemes spoken by the test speaker. A phoneme is a set of one or more “phones”, which are individual units of sound. Typically, the input data is initially processed to generate feature data indicating whether the input data has certain input features, and the feature data is passed to a system which uses it to recognise the phones. The phones may be recognised as individual phones (“mono-phones”), or pairs of adjacent phones (“diphones”), or sequences of three phones (“triphones”).
Since multiple individuals speak in different respective ways, it is desirable for the system which recognises the phones to be adapted to the speech of the test speaker, and for the adaptation to be performed automatically using training data which is speech spoken by the test speaker.
Desirably, the volume of training data which the test speaker is required to speak should be minimised. For that reason, conventional ASR systems are trained using data from many other speakers (“training speakers”) for whom training data is available. Since there is huge amount of speaker variability in the data used for training the system, the performance can be very poor for an unknown test speaker. Speaker adaptation, which either transforms the features of the test speaker to better match the trained model or transforms the model parameters to better match the test speaker, has been found to improve the ASR performance.
Many adaptive systems are known. Recently there has been increasing interest in so-called deep neural networks (DNN). A deep neural network is an artificial neural network with more than one hidden layer between the input and output layers. Each layer is composed of one or more neurons, and each neuron performs a function of its inputs which is defined by a set of network parameters, such as numerical weights. DNNs are typically designed as feedforward networks, although recurrent forms of DNN also exist. In feedforward networks, each neuron in the first layer of neurons receives multiple input signals; in each successive layer, each neuron receives the output of multiple neurons in the preceding layer.
Speaker adaptive training (SAT) is an approach to perform speaker adaptation in ASR, where speaker variability is normalized both in training and recognition. SAT improves acoustic modelling and can be helpful both in DNN-based automatic speech recognition (ASR) and speech synthesis. Speaker adaptation in DNNs is performed either by transforming the input features before training the DNN or by tuning parameters of the DNN using the test speaker specific data. A wide range of systems have been proposed using both approaches. For approaches that focus on transforming the input features before training the DNN, the primary drawback is that the DNN has to be re-trained once a new feature transformation is applied. Whereas for approaches that focus on tuning the network parameters, the DNN typically requires more adaptive parameters, so the primary challenge is to tune the network parameters with the limited available data from the test speaker.