The invention relates to a method and an apparatus for dealing with errors in the transmission of speech.
In order to transmit speech signals via cable-based or wire-free networks, it is known for a speech signal to be transmitted on the basis of speech signal frames, wherein, after reception of the speech signal frames, a receiver uses these speech signal frames to produce a speech signal to be output. In this case, the speech signal frames are preferably transmitted as data in the form of so-called packets via networks, for example a GSM network, a network based on the Internet Protocol, or a network based on the WLAN protocol, in which case a speech signal frame may be lost because of data being transmitted with errors. It is likewise possible, when data is transmitted in a packet-switched form, for an excessively long time delay to occur in the transmission of a speech signal frame, as a result of which this speech signal frame cannot be considered in the course of a continuous output of a speech signal, because, for example, the delayed transmitted, or else lost, speech signal frame is not available in order to output the speech signal. If no signals at all are inserted at an appropriate point in the speech signal to be output instead of the speech signal frame which has not been received, then this results in failure of the speech signal to be output at the corresponding point, resulting in degradation of the acoustic quality of the speech signal. For this reason, it is necessary to use a substitute speech signal frame in order to achieve so-called error concealment, instead of a speech signal frame which has not been received.
The fundamental principle for transmission of a speech signal on the basis of speech signal frames and for production of the speech signal on the basis of these speech signal frames is illustrated in FIG. 1. FIG. 1 shows a speech signal 10 which, for example, comprises three segments in the form of speech signal frames 1, 2, 3. In this case, the total of three segments has been chosen only by way of example. To a person skilled in the art, it is self-evident that the number of speech signal frames 1, 2, 3 need not be three. When the speech signal frames 1, 2, 3 are received after transmission, then the speech signal 10 is output continuously at different times. FIG. 1 shows a time axis 20 along which times 31, 32, 33 are shown, at each of which reception of one speech signal frame 1, 2, 3 is completed. According to the exemplary embodiment, the reception of the first speech signal frame 1 is completed at a first time 31, as a result of which the speech signal 10 can be output, as far as a specific part, at the first time 31. According to the exemplary embodiment, the reception of the second speech signal frame 2 is completed at a second time 32, as a result of which a further part of the speech signal 10 can be output at this second time 32. This also applies to a third time 33, at which the third speech signal frame 3 has been completely received.
According to the exemplary embodiment in FIG. 2, production of a further speech signal 11 which is to be output is illustrated. In the exemplary embodiment, the further speech signal 11 is assembled such that the received speech signal frames 1, 2, 3 are not adjacent to one another in time, but overlap. According to the exemplary embodiment in FIG. 2, the further speech signal 11 consists of a first segment 111, a second segment 112 and a third segment 113. As can be seen from FIG. 2, the first segment 111 can be determined by means of the first speech frame 1 and at least a part of the second speech frame 2. The second segment 112 can be determined by means of the second speech frame and at least on the basis of a part of the third speech frame 3. The third segment 113 can be determined on the basis of the third speech frame 3 and on the basis of possibly subsequent further speech frames. A first time 41 is shown on a second time axis 21 that is illustrated in FIG. 2, corresponding to the time at which the first segment 111 of the further speech signal 11 ends. Therefore, in order to allow the further speech signal 11 to be output at the first time 41 at least until the time at which its first segment 111 ends, at least the first speech signal frame 1 and the second speech signal frame 2 must therefore be available. Furthermore, there is a second time 42 on the second time axis 21, which corresponds to the time at which the second segment 112 of the further speech signal 11 ends. Therefore, in order to allow the further speech signal 11 to be output as well at least until the time at which its second segment 112 ends, the second speech signal frame 2 and the third speech signal frame 3 must be available at the second time 42. This also applies to a third time 43 for the third segment 113 of the further speech signal 11 with respect to the third speech signal frame 3 and possibly subsequent speech signal frames. The speech signal frames 1, 2, 3 shown in FIGS. 1 and 2 preferably have respective indices 11, 12, 13 in order to allow the received speech signal frames to be associated with a time sequence.
FIG. 3 shows the situation in which the second speech signal frame 2 has not been received. If the first speech signal frame 1 had actually been received, as shown in FIG. 3, by the first time 41, but not the second speech signal frame 2, it would not be possible to correctly output the further speech signal 11 from FIG. 2 at the first time 41. In addition, although the further speech signal can be produced on the basis of the received third speech signal frame 3 in order to output the further speech signal at the second time 42, the second speech signal frame 2 is still missing, however, at this second time 42. It is therefore necessary to produce a substitute speech signal frame 100 instead of the speech signal frame 2 which has not been received, in order to use this to produce the further speech signal to be output. Appropriate methods for this purpose are already known. The way in which these methods operate is explained in detail in FIG. 4.
FIG. 4 shows steps in a method, with the aid of which a substitute speech signal frame 100 is produced on the basis of a received speech signal frame 50. For this purpose, the received speech signal frame 50 is first of all passed to a linear prediction analysis process 62, which determines linear prediction coefficients 51 for an analysis filter of a linear prediction means 61. The principle of linear prediction and its determination of the linear prediction coefficients for an analysis filter for linear prediction of a speech signal, modeled as a pulse code, of a received speech signal frame 50 is known. The linear prediction analysis filter 61 filters the speech signal of the received speech signal frame 50, thus resulting in the remaining signal 52. This remaining signal 52 is supplied to a decision maker 63, which uses the remaining signal 52 to determine whether the speech signal in the received speech signal frame 50 is a speech signal with or without voice. The decision maker 63 passes on its decision 53 relating to whether the speech signal has or has not got voice to a fundamental frequency determination unit 64. This fundamental frequency determination unit 64 uses the remaining signal 52 and the decision 53 to determine a fundamental frequency 54 of the speech signal. In this case, the fundamental frequency is determined by means of that argument of a normalized autocorrelation function for which the value of the normalized autocorrelation function assumes its maximum.
In this case uses only those values for a fundamental frequency which appear to be worthwhile for human speech signals. In the situation where a speech signal without voice is present, has a noise-like character and therefore does not have a clear fundamental frequency, the fundamental frequency 54 is set to a minimum value, in order to reduce artefacts in the high-frequency range, which result from unnatural periodicities in a signal to be determined.
An estimated remaining signal 55 is determined by means of an estimation unit 65, on the basis of the remaining signal 52 and the fundamental frequency 54. The estimated remaining signal 55 is passed to a linear prediction synthesis filter 66, which uses the previously determined linear prediction coefficients 51 to subject the estimated remaining signal 55 to synthesis filtering, as a result of which the speech signal for the substitute speech signal frame 100 is obtained. In this way, the spectral envelope of the speech signal is extrapolated, while the periodic structure of the signal is maintained at the same time.
As shown in FIG. 4, the substitute speech signal frame 100 is produced on the basis of a received speech signal frame 50. In this case, the received speech signal frame 50 may, for example, be the first speech signal frame 1 in FIG. 3. In the event of short-term interference with the reception and transmission of speech signal frames, all that is necessary according to the prior art is to produce a single speech signal frame. However, if the third speech signal frame 3 from FIG. 3 is also not received, then it is necessary to produce a further substitute speech signal frame. In a situation such as this, a fundamental frequency 54 is used to produce the further substitute speech signal frame, which fundamental frequency 54 is obtained by analysis of that speech signal frame which was obtained before the most recently received first speech signal frame in a time sequence. This results in a variation of the fundamental frequency of the speech signals in the various speech signal frames that are produced, by which means undesirable harmonic artefacts are avoided, which would result if the same speech signal were to be output over an excessively long time period.
For the situation in which a further, third substitute speech signal frame must be produced, the fundamental frequency 54 is once again varied in order to produce the further, third substitute speech signal frame, by obtaining the fundamental frequency 54 on the basis of that speech signal frame which was received two positions before the most recently received, first speech signal frame 1 in the time sequence. In the situation where further substitute speech signal frames must be produced after three substitute speech signal frames have already been determined, the fundamental frequency is not modified any further. Instead of this, all the further substitute speech signal frames are produced by means of that fundamental frequency 54 which was used to produce the third substitute speech signal frame. This fundamental frequency 54 for production of the third substitute speech signal frame is used until the end of the reception interference.
Substitute speech signal frames produced in this way are used instead of the substitute speech signal frames which have not been received. A smooth transition is preferably used for the speech signal frames when producing the speech signal 11 to be output.