The invention relates to a speech signal processing apparatus, comprising detecting means for selectively detecting a sequence of time instants of glottal closure, by determining specific peaks of a time dependent intensity of a speech signal.
Glottal closure, that is, closure of the vocal cords, usually occurs at sharply defined instants in the human speech production process. Knowledge where such instants occur can be used in many speech processing applications. For example, in speech analysis, processing of the signal is often performed in successive time frames, each in the same fixed temporal relation to a respective instant of glottal closure. In this way, the effect of glottal closure upon the signal is more or less independent of the time frame, and differences between frames will be largely due to the change in time of the parameters of the vocal tract. In another application example, a train of glottal excitation signals is fed through a synthetic filter modelling the vocal tract in order to produce synthetic speech. To produce high quality speech, glottal excitations derived from physical speech are used to generate the glottal excitation signal.
For such applications, it is desirable to identify the instants of glottal closure from physically received human speech signals. An apparatus for finding these instants, or at least instants which stand in fixed phase relation to these instants is known from U.S. Pat. No. 3,940,565. According to this publication, the instant of glottal closure is identified as an instant of maximum amplitude in the signal. To detect this, the received speech signal is fed to a peak detector, and when the resulting peak signal is sufficiently large this detector triggers a flipflop to signal glottal closure.
The disadvantage of this method is that in not all speech signals glottal closure corresponds to the largest peak or even to a single peak. In voiced signals, there may be several peaks distributed over one period which may give rise to false detections. Also there may be several comparably large peaks surrounding each instant of glottal closure, which gives rise to jitter in the detected instants as the maximum jumps from one peak to another. Moreover in unvoiced signals no instants of glottal closure are present, but there are many irregularly spaced peaks, which give rise to false detection.
It is an object of the invention to improve the robustness of glottal closure detection without requiring complex processing operations.
In an embodiment, the invention realizes the objective because it is characterized in that the apparatus includes
a filter, for forming from the speech signal a filtered signal, through deemphasis of a spectral fraction below a predetermined frequency, the filter then feeds the filtered signal an
averaging mechanism which generates through averaging in successive time windows, a time stream of averages representing said time dependent intensity of the speech signal.
In this apparatus, the physical speech signal is first filtered using a high pass or band pass filter which emphasizes frequencies well above the repetition rate of glottal closure. The filtering will emphasize the short term effects of glottal closure over longer term signal development which is due mainly to ringing in the vocal tract after glottal closure. However, in itself the filtering usually will not give rise to a single peak, corresponding to the instant of glottal closure. On the contrary, it will increase the relative contribution of noise peaks, and moreover the effect of glottal closure itself is often distributed over several peaks, an effect which can be worsened by the occurrence of short term echoes.
We have found that near the instant of glottal closure, there will usually be a large peak or many small peaks, both of which correspond to a large local signal density, i.e. aggregate peak number/amplitude count. Therefore, instead of containing only detection means for signal peaks, the apparatus comprises averaging means which determine the signal intensity by averaging contributions from successive windows of time instants. Consequently each instant of glottal closure will correspond to a single peak in the physical intensity, and for example the instant when the peak value is reached or the the center of the peak will have a time relation to the instant of glottal closure which is independent of the details of the speech signal.
In an embodiment of an apparatus according to the invention, characterized, in that the filtering means are arranged for feeding the filtered signal to the averaging means via rectifying means, for rectifying the filtered signal, through value to value conversion, into a strength signal. By rectifying is meant the process of obtaining a signal with a DC component which is responsive to the amplitude of an AC signal, in this case the strength signal from the filtered signal. A simple example of a rectifying value to value conversion is the conversion of filtered signal values to their respective absolute values. In general, any conversion in which values of opposite sign do not consistently yield exactly opposite converted values qualifies as rectifying, provided values with successively larger amplitudes are converted to converted values with successively larger amplitudes at least in some value range. Examples of rectifying conversions in this sense are taking the exponential of the signal, any power of its absolute value or linear combinations thereof.
One embodiment of the apparatus according to the invention is characterized, in that the conversion comprises squaring of values of the filtered signal. In this way, the DC component of the strength signal, i.e. the physical intensity, represents the energy density of the signal, which will give rise to optimal detection if the peaks amplitudes are normally distributed in the statistical sense.
In an embodiment of the apparatus according to the invention characterized, in that, in said averaging, the strength signal is weighted in each of the windows, with weighting coefficients which remain constant as a function of time distance from a centre of the window up to a predetermined distance, and from the predetermined distance monotonously decrease to zero at the edge of the window. A set of weighting coefficients which gradually decreases at the edges of the window mitigates the suddeness of the onset of contribution due to peaks in the filtered signal; this makes the onset of peaks in the physical intensity less susceptible to individual peaks in the filtered signal if this contains several peaks for one instant of glottal closure.
The precise temporal extent of the windows is not critical. However, if the windows are so wide as to encompass more than one successive instant of glottal closure, there will be contributions to the average which do not belong to a single instant of glottal closure and a poorer signal to noise ratio will generally occur in the intensity. To avoid overlap of contributions from neighboring instants of glottal closure, the extent should be made shorter than the time interval between neighboring instants of glottal closure, which for male voices is in the range of 8 to 10 msec and for female voices is in the range of 4 to 5 msec. Too small an extent incurs a risk of multiple detections, which is reduced as the extent is increased. Depending on the quality of the physical speech signal a minimum extent upward of 1 msec has been found practical; an extent of 3 msec was a good tradeoff for both male and female voices.
In one embodiment of the apparatus, characterized, in that it comprises width setting means, for setting a temporal width of the windows according to a pitch of the speech signal. The width setting means use a prior estimate of the pitch, i.e. the interval between neighboring instants of glottal closure, to restrict the temporal extent of the window to below this interval. The prior estimate may be obtained in any one of several ways, for example by feeding back an average of the interval lengths between earlier detected instants of glottal closure, or using a separate pitch estimator, or by using a user control selector etcetera. Since the most significant pitch differences are between male and female voices, a male/female voice selection button may be used for selecting from one of two extents for the window. Accordingly, an embodiment of apparatus according to the invention is characterized, in that the setting means are arranged for setting the temporal width to a first or second extent, the first extent lying between 1 and 5 milliseconds and the second extent lying between 5 and 10 milliseconds.
In an embodiment of the apparatus according to the invention characterized, in that the filtering means copy a further spectral fraction of the speech signal above 1 kHz substantially indiscriminately into the filtered signal. This makes the filtering means easy to implement. For example, when the physical speech signal is a sampled signal, with 10 kilosamples per second, samples In being identified by a sample time index xe2x80x9cnxe2x80x9d, the expression
sn=Inxe2x88x920.9Inxe2x88x921
gives a satisfactory way of producing a filter signal sn.
The detection of the instants of glottal closure may be performed by locating locally maximal intensity values, or simply by detecting when the physical intensity crosses a threshold, or by measuring the centre position of peaks. In an embodiment of the apparatus according to the invention detection is accomplished by
determining an average DC content of the strength signal, averaged over a temporal extent wider than the width of the windows, then,
for determining whether the time dependent intensity exceeds the average DC content by more than a predetermined factor, excesses corresponding to the specific peaks. In this way, the thresholds are set automatically and are robust against variations in the nature of the signal. When the predetermined factor is set sufficiently high, unvoiced signals will not lead to detection of any instants of glottal closure.
In an embodiment of the apparatus according to the invention characterized, in that the detection systems feed a synchronization input of frame by frame speech analysis mechanism, for controlling positions of frames during analysis of the physical speech signal.
In an embodiment of the apparatus according to the invention characterized, in that the detection mechanism feed an excitation input of a vocal tract simulator, for forming a synthesized speech signal.