1. Field of the Invention
This invention relates generally to electronic speech recognition systems, and relates more particularly to a method for utilizing validity constraints in a speech endpoint detector.
2. Description of the Background Art
Implementing an effective and efficient method for system users to interface with electronic devices is a significant consideration of system designers and manufacturers. Human speech recognition is one promising technique that allows a system user to effectively communicate with selected electronic devices, such as digital computer systems. Speech typically consists of one or more spoken utterances which each may include a single word or a series of closely-spaced words forming a phrase or a sentence. In practice, speech recognition systems typically determine the endpoints (the beginning and ending points) of a spoken utterance to accurately identify the specific sound data intended for analysis. Conditions with significant ambient background-noise levels present additional difficulties when implementing a speech recognition system. Examples of such conditions may include speech recognition in automobiles or in certain manufacturing facilities. In such user applications, in order to accurately analyze a particular utterance, a speech recognition system may be required to selectively differentiate between a spoken utterance and the ambient background noise.
Referring now to FIG. 1, a diagram of speech energy 110 from an exemplary spoken utterance is shown. In FIG. 1, speech energy 110 is shown with time values displayed on the horizontal axis and with speech energy values displayed on the vertical axis. Speech energy 110 is shown as a data sample which begins at time 116 and which ends at time 118. Furthermore, the particular spoken utterance represented in FIG. 1 includes a beginning point ts which is shown at time 112 and also includes an ending point te which is shown at time 114.
In many speech detection systems, the system user must identify a spoken utterance by manually indicating the beginning and ending points with a user input device, such as a push button or a momentary switch. This xe2x80x9cpush-to-talkxe2x80x9d system presents serious disadvantages in applications where the system user is otherwise occupied, such as while operating an automobile in congested traffic conditions. A system that automatically identifies the beginning and ending points of a spoken utterance thus provides a more effective and efficient method of implementing speech recognition in many user applications.
Speech recognition systems may use many different techniques to determine endpoints of speech. However, in spite of attempts to select techniques that effectively and accurately allow the detection of human speech, robust speech detection under conditions of significant background noise remains a challenging problem. A system that utilizes effective techniques to perform robust speech detection in conditions with background noise may thus provide more useful and powerful method of speech recognition. Therefore, for all the foregoing reasons, implementing an effective and efficient method for system users to interface with electronic devices remains a significant consideration of system designers and manufacturers.
In accordance with the present invention, a method for utilizing validity constraints in a speech endpoint detector is disclosed. In one embodiment, a validity manager preferably includes, but is not limited to, a pulse width module, a minimum power module, a duration module, and a short-utterance minimum power module.
In accordance with the present embodiment, the pulse width module may advantageously utilize several constraint variables during the process of identifying a valid reliable island for a particular utterance. The pulse width module preferably measures individual pulse widths in speech energy, and may then store each pulse width in constraint value registers as a single pulse width (SPW) value. The pulse width module may then reference the SPW values to eliminate any energy pulses that are less than a pre-determined duration.
The pulse width module may also measure gap durations between individual pulses in speech energy (corresponding to the foregoing SPW values), and may then store each gap duration in constraint value registers as a pulse gap (PG) value. The pulse width module may then reference the PG values to control the maximum allowed gap duration between the energy pulses to be included a TPW value constraint that is discussed below.
In the present embodiment, the validity manager may advantageously utilize the pulse width module to detect a valid reliable island during conditions where speech energy includes multiple speech energy pulses within a certain pre-determined time period xe2x80x9cPxe2x80x9d. In certain embodiments, a beginning point for a reliable island is detected when sequential values for the detection parameter DTF are greater than a reliable island threshold Tsr for a given number of consecutive frames. However, for multi-syllable words, a single syllable may not last long enough to satisfy the condition of P consecutive frames.
The pulse width module may therefore preferably sum each energy pulse identified with a SPW value (subject to the foregoing PG value constraint) to thereby produce a total pulse width (TPW) value, that may also be stored in constraint value registers. The validity manager may thus detect a reliable island whenever a TPW value is greater than a reliable island threshold Tsr for a given number of consecutive frames xe2x80x9cPxe2x80x9d.
In addition, the validity manager may preferably utilize the minimum power module to ensure that speech energy below a pre-determined level is not classified as a valid utterance, even when the pulse width module identifies a valid reliable island. Therefore, in the present embodiment, the minimum power module preferably compares the magnitude peak of segments of the speech energy to a pre-determined constant value, and rejects utterances with a magnitude peak speech energy below the constant value as invalid.
In the present embodiment, the validity manager also preferably utilizes the duration module to impose duration constraints on a given detected segment of speech energy. Therefore, the duration module may preferably compare the duration of a detected segment of speech energy to two pre-determined constant duration values. In accordance with the present invention, segments of speech with durations that are greater than a first constant are preferably classified as noise. Segments of speech with durations that are less than a second constant are preferably analyzed further by the short-utterance minimum power module as discussed below.
In the present embodiment, the validity manager may preferably utilize the short-utterance minimum power module to distinguish an utterance of short duration from background pulse noise. To distinguish a short utterance from background noise, the short utterance preferably has a relatively high energy value.
Therefore, the short-utterance minimum power module may preferably compare the magnitude peak of segments of the speech energy to a pre-determined constant value that is relatively larger than the pre-determined constant utilized by the foregoing minimum power module. The present invention thus efficiently and effectively implements a method for utilizing validity constraints in a speech endpoint detector.