1. Field of the Invention
The present invention relates to a speech-recognizing apparatus. More particularly, the present invention relates to improvement of a speech-recognition rate in a noisy environment and reduction of the amount of speech-recognition processing.
2. Description of the Related Art
In recent years, presentation of products each including a speech-recognizing function has been becoming popular. However, speech-recognition technologies of the present state of the art have a problem of an inability to display good performance without restrictive conditions such as a requirement that the technologies be applied in a quiet environment. Such restrictions serve as a big barrier to popularization of the speech recognition, raising a demand for improvement of a speech-recognition rate in a noisy environment. One of conventional speech-recognition methods for improvement of a voice-recognition rate in a noisy environment is disclosed in Japanese Patent Laid-open No. Hei5-210396. This disclosed method is referred to hereafter as a method of the first prior art. The first prior art provides a method for correcting a similarity between vectors by using a maximum similarity in the frame of the vectors. To put in detail, in accordance with this method, characteristics of an input audio signal are first analyzed and converted into a sequence of characteristic vectors along the time axis. A similarity between vectors is then found from a distance between a characteristic vector of 1 frame of the time-axis sequence of characteristic vectors and a characteristic vector composing a standard pattern cataloged in advance in accordance with a probability distribution. Then, a maximum value of similarities between vectors is found for each frame.
Subsequently, a correction value is found from the maximum value of similarities between vectors found for each frame. A similarity between vectors is then corrected by using the correction value to produce a corrected similarity. Frame-corrected similarities are then cumulated to result in a cumulative corrected similarity. Subsequently, the cumulative corrected similarity is compared with a predetermined threshold value. If the cumulative corrected similarity is found greater than the threshold value, a voice corresponding to the cumulative corrected similarity is determined to have been input. Since a similarity between vectors is corrected by using a maximum similarity for each frame as described above, effects of noises kill each other, resulting in an improved speech-recognition rate. One of the conventional speech recognition methods for improving the speech-recognition rate in a word-spotting process is disclosed in Japanese Patent Laid-open No. Sho63-254498. This disclosed method is referred to hereafter as a method of the second prior art. This method utilizes a difference between largest and second largest similarities or a ratio of the largest similarity to the second largest similarity. To put it in detail, first of all, a characteristic parameter is extracted from an input voice. Then, a similarity between the extracted characteristic parameter and a characteristic parameter of a standard pattern is found. A cumulative similarity for each standard pattern cumulating similarities is then computed. A cumulative similarity is found by word spotting, which shifts the start point of time and the end point of time of a cumulating interval little by little. Subsequently, the cumulative similarities are sorted to determine the largest and second largest ones. Then, a difference between the largest and second largest similarities or a ratio of the largest similarity to the second largest similarity is compared with a predetermined threshold value. If the difference between the largest and second largest similarities or the ratio of the largest similarity to the second largest similarity is found greater than the threshold value, the input speech is determined to be a word corresponding to the largest cumulative similarity. By comparing a difference between the largest and second largest similarities or a ratio of the largest similarity to the second largest similarity with a predetermined threshold value as described above, only a probable result of recognition is recognized as a word. As a result, the speech-recognition rate is improved.
In the first prior art, a similarity between frames found by using a probability distribution is used in comparison of input speech with a standard pattern. In this case, the effect of the noise can be inferred to a certain degree by using a maximum similarity. If a distance between vectors is used in place of the similarity between frames, however, the minimum value of the vector-to-vector distances varies in dependence on, among others, the type of the phoneme. It is thus difficult to infer an effect of a noise by using the minimum value of the vector-to-vector distances. For this reason, there is raised a problem of impossibility to apply the method according to the first prior art to a case wherein a distance is used in comparison of an input voice with a standard pattern. In the case of the second prior art, on the other hand, the threshold value is set intensely so as to prevent a noise from being determined to be speech. In consequence, when the similarity between input speech and a standard pattern decreases due to the effect of a noise or the like, speech cannot be detected in many cases.
FIG. 14 is a diagram showing a problem of a word-spotting process. Notations A1, A2, A3, A4, B1, B2, B3, B4, C1, C2, C3 and C4 shown in FIG. 14 each denote a voice interval in a word-spotting process. It is quite within the bounds of probability that speech exists in each speech interval. The speech intervals have different start and end edges. For each of the speech intervals, a cumulative similarity between frames and a cumulative distance between frames are found by adopting methods such as a DP (Dynamic Programming) matching technique or an HMM technique. In the example shown in FIG. 14, the similarity of the speech interval C2 coinciding with an input voice is a maximum. It is quite within the bounds of probability that speech exists in each speech interval and since cumulative processing is carried out for each of such intervals, the word-spotting process has a problem of a large amount of processing. In order to solve this problem, there has been proposed an end-edge-free method. However, the end-edge-free method has the following problem.
FIG. 15 is a diagram showing the problem of the end-edge-free method. In the case of the end-edge-free method shown in FIG. 15, cumulative processing is carried out by identifying a start edge for an interval beginning from the start edge, which is treated as a speech-interval. Since cumulative processing is carried out for speech intervals A, B and C in the case of the end-edge-free method shown in FIG. 15 instead of the voice intervals A1, A2, A3, A4, B1, B2, B3, B4, C1, C2, C3 and C4 shown in FIG. 14 in the word-spotting process, the amount of processing can be reduced. Since a period between the start edge and a speech-input point with a fixed duration in the speech interval is indefinite, however, the end-edge-free method has a problem of a resulting extension. In the case of the voice interval C, for example, a delay xcfx84 inevitably results.
It is thus an object of the present invention addressing the problems described above to provide a speech-recognizing apparatus capable of improving the speech-recognition rate by reducing the effect of a noise in a case of using a distance between frames in comparison of an input voice with a standard pattern.
It is another object of the present invention to provide a speech-recognizing apparatus capable of detecting speech even for a case in which a frame-to-frame distance between input speech and a standard pattern increases or a frame-to-frame similarity between input speech and a standard pattern decreases due to an effect of a noise or the like.
It is a further object of the present invention to provide a speech-recognizing apparatus capable of reducing the amount of processing in a word-spotting process and decreasing the magnitude of a delay in the end-edge-free method.
In accordance with an aspect of the present invention, there is provided a speech-recognizing apparatus for recognizing input speech, the apparatus comprising a phoneme-standard-characteristic-pattern storage unit for storing a phoneme characteristic vector of a plurality of phoneme standard patterns in advance, an analysis unit for computing a characteristic vector for each of frames of the input speech, a vector-to-vector-distance-computing unit for computing a vector-to-vector distance between the characteristic vector for each of the frames and the phoneme characteristic vector, an average-value-computing unit for computing an average value of vector-to-vector distances of phonemes for one of the frames, a correction unit for correcting the vector-to-vector distance by subtracting the average value from the vector-to-vector distance, a word-standard-pattern storage unit for storing a word standard pattern defining side information of the phoneme standard patterns, and a recognition unit for cumulating corrected vector-to-vector distances each produced by the correction unit into a cumulative vector-to-vector distance and comparing the cumulative vector-to-vector distance with the word standard pattern in order to recognize the input speech.
In accordance with another aspect of the present invention, there is provided a speech-recognizing apparatus for recognizing input speech, the apparatus comprising an analysis unit for computing characteristic vectors of intervals in the input speech, a word-standard-pattern storage unit for storing characteristic vectors of word standard patterns in advance, a similarity-computing unit for comparing the characteristic vectors of the intervals in the input speech with the characteristic vector of the word standard patterns in order to compute first similarities to the word standard patterns for a portion of the input speech in each of the intervals, a first judgment unit for forming a judgment as to whether or not a word of the word standard patterns corresponding to the first similarities is a word represented by the input speech by comparison of the first similarities or a result of computation based on the first similarities with a first threshold value, a candidate storage unit for storing second similarities or a result of computation based on the second similarities, a candidate-determining unit, which is used for storing the first similarities or a result of computation based on the first similarities as the second similarities or a result of computation based on the second similarities respectively into the candidate storage unit if an outcome of a judgment formed by the first judgment unit indicates that the word of the word standard patterns corresponding to the first similarities is not the word represented by the input speech as evidenced by the fact that the first similarities or a result of computation based on the first similarities are smaller than the first threshold value, the first similarities or a result of computation based on the first similarities are greater than a second threshold value smaller than the first threshold value, and the first similarities or a result of computation based on the first similarities are greater than the second similarities or a result of computation based on the second similarities respectively, and a second judgment unit, which is used for determining that the word of the word standard patterns corresponding to the second similarities is the word represented by the input speech on the basis of the second similarities or a result of computation based on the second similarities stored in the candidate storage unit in case an outcome of a judgment formed by the first judgment unit indicates that the word of the word standard patterns corresponding to the first similarities is not the word represented by the input speech within a predetermined period.
In accordance with a further aspect of the present invention, there is provided a speech-recognizing apparatus for recognizing input speech, the apparatus comprising a phoneme-standard-pattern storage unit for storing a phoneme characteristic vector of a plurality of phoneme patterns in advance, an analysis unit for computing a characteristic vector of each frame in the input speech, a distance storage unit for storing vector-to-vector distances to the phoneme standard patterns for each frame, a vector-to-vector-distance-computing unit for computing a vector-to-vector distance between the characteristic vector of the frame and the phoneme characteristic vector of the phoneme standard patterns and storing the vector-to-vector distance into the distance storage unit, a word-standard-pattern storage unit for storing a word standard pattern defining side information of the phoneme standard patterns for each word in advance, a cumulative-distance-computing unit for reading out the vector-to-vector distances in a backward direction, that is, a direction from a most recent vector-to-vector distance to a less recent vector-to-vector distance, from the distance storage unit and computing a cumulative distance in the backward direction for each word, and a judgment unit for forming a judgment as to whether or not a word corresponding to the cumulative distance computed by the cumulative-distance-computing unit is a word represented by the input voice on the basis of the cumulative distance.