1. Technical Field
This invention relates to a method of and an apparatus for speech recognition which is robust to missing speech data. It is particularly useful in distributed speech recognition in which data is transmitted via a packet switched network.
2. Description of Related Art
Recently there has been an enormous increase in the use of mobile devices such as mobile phones and personal digital assistants. It is desirable to make the human to device interface as natural and easy to use as possible. Speech recognition is one solution which increases naturalness, and overcomes the difficulties in using very small keyboards found on many mobile devices. A Personal Computer (PC) usually provides sufficient processing power to operate a speech recogniser. However, on mobile devices processing power is a limiting factor. One solution is to use distributed speech recognition (DSR). DSR makes use of remote speech recognisers which are accessed by a device across a transmission network. Speech data from the device is transmitted across the network to the remote speech recogniser and the remote speech recogniser processes the speech to provide a recognition result (or set of results) which is then transmitted back to the device.
There are basically two types of network across which such information can be transmitted; namely connection-orientated networks and connectionless networks. The connection-orientated network is essentially the telephony service which has evolved over the last 100 years for the switching and transmission of voice data. A connectionless network is packet-based and its main functionality is the routing and switching of data packets from one location to another.
When a call is made on a connection-orientated network a reservation is made to ensure that sufficient network resources are available to sustain the call. This may be the allocation of a physical connection or of time slots in a pulse code modulation (PCM) system. If sufficient resources are not available then the call is refused, typically accompanied by an engaged signal.
The connectionless network is very much aimed at the routing and switching of data packets and is designed to efficiently handle the high burstiness of this traffic. Packets are comprised of two parts—a header and payload. The header contains information regarding the source and destination address while the payload contains the actual data which needs to be sent.
For transmitting real-time data such as speech, the essential difference between the two networks is that the connection-orientated network reserves sufficient capacity, or bandwidth, to maintain a connection throughout the call. With a connectionless network sufficient bandwidth is not guaranteed which means that the network may produce delays or missing packets which interrupt the data transmission. Therefore the connection-orientated network is much better suited to delivering real-time data. Voice has therefore traditionally been transmitted using connection-orientated networks. However, because of the enormous growth in data networks, the technique of Voice over Internet Protocol (VoIP) has been developed to allow the real-time transmission of voice signals across connectionless networks.
In a connectionless network the packets containing the speech can be routed across a wide variety of paths depending on the network traffic. Indeed, it may be that successive packets are routed around the network on different paths. As a result it is possible that some packets arrive out of sequence or may never even arrive. This is clearly undesirable in a DSR system as it will introduce recognition errors. An approach to dealing with this problem of missing packets is to use protocols designed specifically real-time data which ensure all the data arrives with minimal delay.
The traditional connectionless network is termed best-effort. This means that packets from a source are sent to a destination with no guarantee of a timely delivery. For applications such as file transfer which require a guarantee of delivery, Transmission Control Protocol (TCP) is able to trade packet delay for guaranteed reception. In the event of lost packets TCP allows for the destination to request the retransmission of those lost packets. However, for real-time data it is important to minimise transmission delays. It is therefore impracticable to use TCP and wait for the retransmission of lost packets. A better approach is to use User Datagram Protocol (UDP) as the protocol for sending the packets. This has-a short duration buffer which allows for slight delays in packet arrival after which UDP assumes the packet is not going to arrive. No facility for the retransmission of lost packets is available. This has the advantage that delays are minimised but at the expense of possibly losing some of the speech signal when network traffic is high and packet loss is probable.
Protocols designed specifically for real-time data transmission include Resource Reservation Protocol (RSVP). This is a signalling protocol which reserves network resources at the start of a call to ensure that a direct connection to the destination is available throughout. In effect it makes a connection-orientated path from a connectionless network. In order for this to function all the routers in the network from the source to destination must be RSVP enabled. As RSVP is a relatively new protocol not all routers are equipped with this facility.
Another protocol designed specifically for real-time data transmission is DiffServ. This makes use of a byte of data in the packet header to specify a Type of Service (ToS)—i.e. how much priority should be given to the immediate routing of that packet through the network. Clearly some data will have very high priority such as network management and system commands. Lower priority will be given to file transfer and email where immediate delivery is not too important. Depending on the emphasis given to the network, high priority can be given to speech packets to assist real-time use. Again, this protocol is only in development and not available generally.
The increase of connectionless voice networks, coupled with the increase in automation of call centres means that the ability to perform robust speech recognition over a connectionless network is becoming more important.
An alternative approach to ensuring that all packets containing the speech signal successfully reach the speech recogniser is to make the recogniser itself robust to missing packets. When the packet loss is low (<5%) the drop in recognition performance is not too significant. However, as packet loss increases—or occurs in bursts—the effect is more detrimental. Therefore, a speech recogniser is required which is able to tolerate this loss of speech.
Known signal processing techniques which deal with missing packets range from very simple to complex—a good review is made in C. Perkins, O. Hodson and V. Hardman, “A survey of packet loss recovery techniques for streaming audio”, IEEE Network Magazine, Vol. 12, No. 5, pp. 40–48, October 1998. Simple techniques include splicing which merely joins the speech signal together either side of the gap. Silence and noise substitution replace the missing frames of speech with either silence or noise. Repetition replaces the lost frames of speech with copies of the speech which arrived before the gap.
More sophisticated techniques attempt to estimate the missing parts of the signal from those parts which have been correctly received. These include waveform substitution which uses the pitch on either side of the gap to estimate the missing speech. Time scale modification stretches the audio signal either side of the gap to fill in the missing speech. Regeneration-based repair uses parameters of the codec to determine the required fill-in speech. All these techniques attempt to reconstruct the time-domain speech signal.