1. Field of the Invention
The present invention relates to a method for detecting a similarity between standard information and input information and to a method for recognizing whether or not the input information is the standard information by use of a value obtained by detecting the similarity (a detected value of the similarity) or for judging whether or not the input information is abnormal.
More specifically, the present invention relates to a method for detecting a similarity between a standard voice and an input voice with regard to a voice uttered by a human being and to a method for recognizing a voice by use of a detected value of the similarity. Moreover, the present invention relates to a method for detecting a similarity between a standard vibration wave and an arbitrary vibration wave with regard to a sound or a vibration uttered by equipment under operation or the like and to a method for judging an abnormality in a machine based on a detected value of the similarity. Furthermore, the present invention relates to a method for detecting a similarity between a standard image and an arbitrary image with regard to a letter or a pattern and to a method for recognizing the image by use of a detected value of the similarity. Still further, the present invention relates to a method for detecting a similarity between a standard solid and an arbitrary solid and to a method for recognizing a solid by use of a detected value of the similarity. Yet further, the present invention relates to a method for detecting a similarity between a standard moving picture and an arbitrary moving picture and to a method for recognizing a moving picture by use of a detected value of the similarity.
2. Description of the Related Art
A voice recognition apparatus, in which a computer automatically recognizes a voice uttered by a human being, is equipped with a means for detecting a similarity between a standard voice and an input voice and a means for recognizing the input voice from a detected value of the similarity when a known voice previously registered in the computer is set as the standard voice and an unknown voice newly inputted to the computer is set as the input voice.
In a conventional similarity detection for the voice, a method has been adopted, which includes the steps of: previously registering a standard pattern matrix with a feature amount, as a component, such as a power spectrum of the standard voice; preparing an input pattern matrix with a feature amount of the input voice as a component; and calculating an Euclid distance or an angle between the standard pattern matrix and the input pattern matrix. Moreover, in a conventional voice recognition, a method for recognizing a voice has been adopted, which includes the step of comparing a calculated value of the Euclid distance or the angle with an arbitrarily set acceptable value. Namely, supposed are pattern spaces with dimensions having a number equal to that of kinds of the feature quantities, and a similarity extent between two pattern matrices is numerically evaluated by use of a similarity measure representing a linear distance (Euclid distance) or an angle between a point of the standard pattern matrix and a point of the input pattern matrix, and then the voice is recognized based on the evaluated value.
As a first example of the related art, FIGS. 39 and 40 schematically show a state with regard to a standard voice 20 with a flat power spectrum shape and input voices 21, 22 and 23 with energies equal to that of the standard voice 20 but with different features of the power spectrum shapes. Specifically, FIGS. 39 and 40 show the following state. A standard pattern matrix 20A of seven rows and nine columns with the power spectrum of the standard voice 20 as a component is previously registered. And, each of input pattern matrices 21A, 22A and 23A of seven rows and nine columns with a power spectrum of each of the input voices 21, 22 and 23 as a component is prepared. Then, as a measure of a similarity between the standard pattern matrix 20A and each of the input pattern matrices 21A, 22A and 23A, the Euclid distance or a cosine of the angle indicated by e21, e22 or e23 is calculated.
Here, it is assumed that each of the input voices 21, 22 and 23 has relations γ, δ, ε, ζ, η and θ shown in FIG. 40 with regard to a parameter α. Namely, in the relations shown in FIG. 40, the parameter α prescribes a change of the power spectrum shape of each of the input voices 21, 22 and 23 from the power spectrum shape of the standard voice 20. The Euclid distance is obtained as a square root of a value that is a sum of squares of differences between the respective components of the standard pattern matrix and corresponding components of the input pattern matrix. A cosine of the angle is obtained by dividing a sum of products of the respective components of two pattern matrices by a square root of a value that is a sum of squares of the respective components of the standard pattern matrix and a square root of a value that is a sum of squares of the respective components of the input pattern matrix.
As a second example of the related art, FIGS. 41 and 42 schematically show a state with regard to a standard voice 24 with two peaks in power spectrum shape and input voices 25, 26 and 27 with energies equal to that of the standard voice 24 but with different peak positions in the power spectrum shapes. Specifically, FIGS. 41 and 42 show the following state. A standard pattern matrix 24A of seven rows and nine columns with the power spectrum of the standard voice 24 as a component is previously registered. And, each of input pattern matrices 25A, 26A and 27A of seven rows and nine columns with a power spectrum of each of the input voices 25, 26 and 27 as a component is prepared. Then, as the measure of the similarity between the standard pattern matrix and each of the input pattern matrices, the Euclid distance or a cosine of the angle indicated by e25, e26 or e27 is calculated.
Here, it is assumed that the standard voice 24 and each of the input voices 25, 26 and 27 have relations ω and φ shown in FIG. 42 with regard to a parameter β. Namely, in the relations shown in FIG. 42, the parameter β prescribes a change of the power spectrum shape of each of the input voices 25, 26 and 27 from the power spectrum shape of the standard voice 24.
However, in case of using the Euclid distance or the angle as the measure of the similarity, with regard to a plural input voices with power spectrum shapes different from one to another, calculated values of the Euclid distances or the angles from the standard voice happen to be equal. In such a case, it is impossible to distinguish input voices with features different from one to another, thus causing imprecise detection for the similarity of the voices. The following is detailed description.
As the first example, FIG. 43 shows changes of the calculated values e21, e22 and e23 of the Euclid distances when the value of the parameter α in FIG. 40 is increased from 0 to 1. FIG. 44 shows changes of the calculated values e21, e22 and e23 of the cosines of the angles when the value of the parameter α in FIG. 40 is increased from 0 to 1 similarly.
With reference to FIGS. 43 and 44, in this example, it is understood that the calculated values e21, e22 and e23 of the Euclid distances or the cosines of the angles are always equal from one to another (e21=e22=e23). And it is understood that, according to an increase of the parameter α, the values e21, e22 and e23 of the Euclid distances are increased and the values e21, e22 and e23 of the cosines of the angles are decreased. Such a decrease of each of the values e21, e22 and e23 of the cosines of the angles means an increase of values of the angles.
By the way, generally, a power spectrum shape of a white noise is flat, and a power spectrum shape of a fricative consonant /s/ in voice is nearly flat in many cases. Note that, though the fricative consonant /s/ has the power spectrum shape nearly flat, a phenomenon of a “sway of spectrum intensity” that such power spectrum shape is slightly changed according to time is also observed.
In FIGS. 39 and 40, it is assumed that the input voices 21 and 22 are fricative consonants /s/ with the “sway of spectrum intensity” and the input voice 23 is a voice different from the fricative consonant /s/ in a case where the parameter α is small.
As understood with reference to FIGS. 43 and 44, when the values of the parameter α prescribing the input voice are equal in the three input voices 21, 22 and 23, the values of the Euclid distances or the angles from the standard voice 20 are equal in the three input voices 21, 22 and 23. Therefore, when the values of the three input voices 21, 22 and 23 are compared with an arbitrarily set acceptable value, it is judged that the three input voices 21, 22 and 23 are standard voices, or conversely, it is judged that the three input voices 21, 22 and 23 are not standard voices, then it is impossible to distinguish the three input voices 21, 22 and 23 from one to another.
As the second example, FIG. 45 shows changes of the calculated values e25, e26 and e27 of the Euclid distances when the value of the parameter β, in FIG. 42 is increased from 0 to 1. FIG. 46 shows changes of the calculated values e25, e26 and e27 of the cosines of the angles when the value of the parameter β in FIG. 42 is increased from 0 to 1 similarly.
With reference to FIGS. 45 and 46, in this example, it is understood that the calculated values e25, e26 and e27 of the Euclid distances or the cosines of the angles are always equal from one to another (e25=e26=e27). And it is understood that, according to an increase of the parameter β, the values e25, e26 and e27 of the Euclid distances are increased and the values e25, e26 and e27 of the cosines of the angles are decreased. Such a decrease of each of the values e25, e26 and e27 of the cosines of the angles means an increase of values of the angles.
By the way, generally, a plurality of peaks referred to as formants are observed in the power spectrum shape of the voice. With regard to the formants of the voice, a “shift of frequency” phenomenon that a peak frequency of the power spectrum shape is slightly shifted or a “shift of time” phenomenon that a peak position is slightly shifted according to time is also observed even in the same voice.
Then, in FIGS. 41 and 42, it is assumed that the input voice 25 is the same as the standard voice 24, in which the “shift of frequency” or “shift of time” occurs in the peak, and that the input voices 26 and 27 are voices different from the standard voice 24.
As understood from FIGS. 45 and 46, when the values of the parameter β prescribing the standard voice and the input voices are equal from one to another in the standard voice 24 and the three input voices 25, 26 and 27, the values of the Euclid distances or the angles from the standard voice 24 are equal in the three input voices 25, 26 and 27. Therefore, when the values of the three input voices 25, 26 and 27 are compared with an arbitrarily set acceptable value, it is judged that the three input voices 25, 26 and 27 are standard voices, or conversely, it is judged that the three input voices 25, 26 and 27 are not standard voices, then it is impossible to distinguish the three input voices 25, 26 and 27 from one to another.
As described above, in the conventional method for detecting a similarity between voices, the similarity between the voices cannot be precisely detected, thus causing a problem that a sufficiently satisfactory precision cannot be obtained in recognizing the voice.
The reason is that, in the conventional method for detecting a similarity between voices, a difference between the shape formed by the standard pattern matrix and the shape formed by the input pattern matrix cannot be numerically evaluated as a geometric distance since the value of the Euclid distance or angle between the two pattern matrices is set as the measures of the similarity.
Meanwhile, in the case where the standard pattern matrix with the power spectrum of the standard voice as a component is previously registered, a method is conceived, in which individual standard voices having the “sway of spectrum intensity”, the “shift of frequency” and the “shift of time” are previously registered as a large number of standard pattern matrices. However, since the registration number of the standard pattern matrices has limitations due to a problem such as a storage capacity or a processing time of a computer, there are limitations in judging, by use of this method, the “sway of spectrum intensity” of the standard voice, the “shift of frequency” of the standard voice or the “shift of time” of the standard voice, and the voice different from the standard voice.
Moreover, in the gazette of Japanese Patent Laid-Open No. Hei 10 (1998)-253444 (Japanese Patent Application No. Hei 9(1997)-61007, Title of the Invention: Method for Detecting Abnormal Sound, Method for Judging Abnormality in Machine by Use of the Detected Value, Method for Detecting Similarity Between Vibration Wave and Method for Recognizing Voice by Use of the Detected Value), description has been made for a method for calculating a value of a geometric distance between a standard pattern vector (one-dimension) and an input pattern vector (one-dimension). However, description has not been made for a method for calculating a value of a geometric distance between a standard pattern matrix (two-dimension) and an input pattern matrix (two-dimension) or a method for calculating a value of a geometric distance between a standard pattern matrix layer (three-dimension) and an input pattern matrix layer (three-dimension).
The present invention was made in order to solve the foregoing problems. A first object of the present invention is to provide a method for detecting a similarity between voices, which is capable of obtaining a precise value of a geometric distance between two pattern matrices that are a standard pattern matrix and an input pattern matrix. A second object of the present invention is to provide a method capable of recognizing a voice based on a detected value of the similarity between the voices with high precision.
A third object of the present invention is to provide a method for detecting a similarity between vibration waves, which is capable of obtaining a precise value of a geometric distance between two pattern matrices that are a standard pattern matrix and an input pattern matrix. A fourth object of the present invention is to provide a judgement method for judging an abnormality in a machine based on a detected value of the similarity between the vibration waves with high precision.
A fifth object of the present invention is to provide a method for detecting a similarity between images, which is capable of obtaining a precise value of a geometric distance between two pattern matrices that are a standard pattern matrix and an input pattern matrix. A sixth object of the present invention is to provide a method capable of recognizing an image based on a detected value of the similarity between the images with high precision.
A seventh object of the present invention is to provide a method for detecting a similarity between solids, which is capable of obtaining a precise value of a geometric distance between two pattern matrix layers that are a standard pattern matrix layer and an input pattern matrix layer. An eighth object of the present invention is to provide a method capable of recognizing a solid based on a detected value of the similarity between the solids with high precision.
A ninth object of the present invention is to provide a method for detecting a similarity between moving pictures, which is capable of obtaining a precise value of a geometric distance between two pattern matrix layers that are a standard pattern matrix layer and an input pattern matrix layer. A tenth object of the present invention is to provide a method capable of recognizing a moving picture based on a detected value of the similarity between the moving pictures with high precision.
Note that the present invention was made as the one, in which the method for calculating a value of a geometric distance described in the gazette of Japanese Patent Laid-Open No. Hei 10 (1998)-253444 (Japanese Patent Application No. Hei 9 (1997)-61007) is two-dimensionally extended to be applicable to voice recognition, judgment for an abnormality in a machine and image recognition, and further, is three-dimensionally extended to be applicable to solid recognition and moving picture recognition.