As a method for recognizing a voice or a word, there have currently been developed a number of systems. Many of these are based on the so-called pattern matching, in which voices to be used are previously registered and an unknown input voice is recognized by examining which of the already registered voices is most similar to a later input voice. The pattern matching method is widely prevalent because the number of calculations is less and the rate of recognition is higher as compared with other methods, such as the one using a discrimination function.
FIG. 1 is a structural drawing for describing one example of the above-described pattern matching method; in the drawing, 1 is a sound collecting device, such as a microphone, 2 is a filter bank, 3 is a dictionary, 4 is a local peak detecting unit, 5 is a degree of similarity calculating unit and 6 is a recognition result output unit. As well known, a voice obtained through a sound collecting device, such as a microphone, is converted into a feature quantity, such as a frequency spectrum, which is used to form a feature pattern for pattern matching. At this time, since the value of one spectrum is represented by ordinarily allocating 8-12 bits, if m number of samples are taken on frequency, one time sample (1 frame) can be represented by 8.times.m-12.times.m bits. In general, since one time sample is formed in the order of 10 milli-seconds, a pattern of n frames has 8.times.m.times.n-12.times.m.times.n bits. For a distance which represents a difference between two patterns for pattern matching with one pattern defined by a.sub.11, a.sub.21, . . . , a.sub.m1, . . . , a.sub.mn and the other pattern defined by b.sub.11, b.sub.21, . . . , b.sub. m1, b.sub.21, . . . , b.sub.mn, use is made of the following distance. ##EQU1## That is, according to this method, the comparison of one pattern can be made by executing the calculation of 8-12 bits over i.j times. And yet the above-described example is the case when the two patterns to be compared are same in time length, so that even more calculations are required for equalizing the time length for voices which always change the time length.
One such pattern matching method which is less in the amount of data and which can be executed with simple calculations by using BTSP (Binary Time-Spectrum Pattern) has been presented. (Lecture Papers of Japan Society of Acoustics, p. 195, Autumn, 1983)
FIG. 2 is a structural drawing for describing one example of the above-described BTSP; in the drawing, 11 is a sound collecting device, such as a microphone, 12 is a filter bank, 13 is a correcting unit by the least square, 14 is a binary converting unit, 15 is a BTSP forming unit, 16 is an adding unit of n times pronounced patterns by linear expansion and contraction, 17 is a dictionary, 18 is a peak pattern forming unit, 19 is a pattern length matching unit by linear expansion and contraction, 20 is a degree of similarity calculating unit and 21 is a result displaying unit. A voice input through the microphone is subjected to frequency analysis utilizing a bandpass filter bank or the like, whereby frequency and its temporal variation is represented as a pattern (TSP). Furthermore, this is converted into BTSP by binary conversion with a peak in frequency set as "1" and the rest as "0", and BTSPs obtained by a plurality of pronunciations are superimposed and registered as a standard pattern. When an unknown voice has been input, from this voice, a BTSP is formed through a similar process as in the case of forming a standard pattern and compared with a previously registered standard pattern to thereby determine the degree of similarity with each standard pattern. The degree of similarity is obtained by the overlapping condition of elements "1" when the BTSP of the unknown voice is superimposed on the standard pattern. Typically, for an indefinite speaker voice recognition apparatus capable of recognizing an anybody's voice, use is made of a means for increasing the amount of calculations, e.g., forming a plurality of standard patterns for a voice to be registered; however, in accordance with this method, if a standard pattern is formed well, there is a merit of capability to easily realize a voice recognition apparatus for indefinite speakers without increasing the amount of calculations so much.
The degree of similarity S of two patterns defined by this method is expressed as follows. ##EQU2## Since each of elements a and b is either 1 or 0 or the resultant calculations, although it can be represented even if a large number of bits are not allotted, since it is common to give a unit of computer calculation (4, 8, 16, . . . bits), there results a waste for the one which can be realized by the least amount of calculations and the least amount of memory.
Similarly with the above-described method, also in the field of voice recognition, as the number of patterns to be compared increases, a matching method shorter in calculation time for one pattern comes to be required. Comparison may be made with all of the patterns with such a matching method having fewer calculations, or use may be made of a method in which several correct answer candidates are selected by such a simple method and then a fewer number of patterns are finally compared in detail. As a matching method relatively fewer in the amount of calculations, a method using a binary converted time frequency pattern has been proposed.
The apparatus of FIG. 2 recognizes an input pattern and a dictionary pattern obtained by subjecting a voice pronounced with a word as a unit by linear matching. Incidentally, what is shown in FIG. 2 illustrates the definite speaker type, and a voice is registered following the shaded path. In the case of voice recognition for an indefinite speaker, it is so structured that a dictionary is freshly formed as a superimposition of BTSPs.
This method has a merit in that if the filter bank is set at 16 channels, a binary converted result may be treated as an 16-bit data. A series of these 16 data is called a frame. In what is obtained by adding this 16-bit, 2-byte data three times (dictionary pattern or reference pattern), the maximum of one element is 3 so that each element must be represented by two bits.
The present invention has been made so as to obviate the disadvantages of the prior art as described above and has its object to provide a voice recognition apparatus which allows to carry out pattern matching at high speed in particular with a minimum of calculations.
Another object of the present invention is to provide a simple pattern degree of similarity calculating method useful for voice recognition.
A further object of the present invention is to provide a pattern similarity calculating method useful for voice recognition, which allows high-speed processing and minimizes the amount of calculations.