1. Field of the Invention
The present invention relates to a pattern recognition device that recognizes recognition data, which are inputted as the sounds of words and phrases or the images of characters, as one of a plurality of items of standard data prepared in advance, aid particularly to a pattern recognition device in which standard elements are represented as tree structures.
2. Description of the Related Art
In the prior art, pattern recognition devices have been employed in speech recognition or words and phrases or in pictorial pattern recognition of characters. Among the various methods that have been proposed for speech recognition is one in which words are recognized in phoneme units.
A prior-art example of this type of pattern recognition device is first described with reference to FIGS. 1 to 6.
The pattern recognition device 1 here described is formed as a single chip, and as shown in FIG. 2, is provided with a CPU (Central Processing Unit) 101, which is a microcomputer.
This CPU 101 is connected by bus line 102 to RON (Read Only Memory) 103, RAM (Random Access Memory) 104, and I/F (Interface) 105.
CPU 101 can realize a variety of operations in accordance with various programs, and the control programs necessary for processing by CPU 101 are stored as software in ROM 103, which is an information storage medium. Storage areas for temporary storage of processing data of CPU 101 are formed in RAM 104, and I/F 105 effects the input and output of each type of data.
CPU 101 reads each type of program and executes each type of operation as described hereinabove, whereby pattern recognition device 1 as shown in FIG. 1 is logically provided with data storage section 11, element storage section 12, data input section 13, data dividing section 14, distance calculation section 15, distance storage section 16, distance accumulation section 17, and result output section 18.
Data storage section 11 is made up of prescribed storage areas formed in advance in ROM 103, and catalogues in advance a plurality of standard data items made up of a plurality of consecutive standard elements. As shown in FIG. 3, words of a natural language are catalogued as standard data in data storage section 11, and the plurality of consecutive standard elements are established as all phonemes.
Words of this type are catalogued in the graphic character code of the plurality of consecutive phonemes, and position data that indicate the storage positions in element storage section 12 are added to each unit of the graphic character code.
This element storage section 12 is also constituted by storage areas of ROM 103, and as shown in FIG. 4, all phonemes are catalogued in advance in element storage section 12 as the above-described plurality of standard elements. These standard elements also lake the character code as identification information, and as shown in FIG. 5, the speech signals of these phonemes are each set individually.
Data input section 13 accepts the input of each type of recognition data made up of a plurality of consecutive recognition elements by having CPU 101 store the input data of I/F 105 in prescribed areas of RAM 104 in accordance with the program registered in ROM 103. The recognition elements in this case are also made up from phonemes, and the recognition data are made up of speech signals of words.
Data dividing section 14 divides inputted recognition data into prescribed frames and sequentially generates a plurality of consecutive recognition elements by having CPU 101 execute prescribed data processing in accordance with a program registered in ROM 103.
Distance calculation section 15 also individually calculates the distance of similarity of the plurality of standard elements with respect to each of the plurality of consecutive recognition elements by the execution of prescribed data processing by CPU 101 in accordance with a program registered in ROM 103.
Distance storage section 16 is made up of prescribed storage areas established in advance in RAM 104 and individually temporarily stores the distances of calculated standard elements at prescribed positions of the storage areas.
Through the execution of prescribed data processing by CPU 101 in accordance with a program registered in ROM 103, distance accumulation section 17 reads out all of the standard data stored in data storage section 11, sequentially reads the distances of the plurality of standard elements making up this standard data from distance storage section 16 and accumulates tho distances, and individually calculates the distances of a plurality of items of standard data with respect to one item of recognition data.
Through the execution of prescribed data processing by CPU 101 in accordance with a program registered in ROM 103, result output section 18 selectively outputs from I/F 105 the standard data for which the accumulated distances are a minimum as the recognition result.
A pattern recognition device 1 configured according to the foregoing description can recognize the speech signals of a word, which are inputted from the outside as recognition data, as one word which is standard data catalogued in advance.
As an actual example, the speech recognition of Japanese words is explained hereinbelow. Since hiragana are used as phonetic symbols in Japanese, speech recognition is carried out based on hiragana. In the following explanation, words in italics surrounded by quotation marks represent Japanese hiragana.
Japanese hiragana can be arranged as a table of the syllabary made up of five rows and ten columns in which approximately fifty sounds are produced by the combinations of vowel sounds and consonant sounds. As shown in Table 1 below, in the syllabary, five horizontal rows correspond to the five vowel sounds "a, i, u, c, o," and ten vertical columns correspond to the ten vowel and consonant sounds "a, k, s, t, n, h, m, y, r,
TABLE 1 ______________________________________ wa ra ya ma ha na ta sa ka a ri mi hi ni ti si ki i ru yu mu hu nu tu su ku u re me he ne te se ke e ro yo mo ho no to so ko o ______________________________________
In addition, Japanese also includes voiced consonant sounds such as "ga" and "gi" "p-sounds" such as "pa" and "pi," contracted sounds such as "kya" and "kyu," double-consonant sounds such as "a-" and "i-," and the syllabic nasal sound "n." Japanese is therefore not made up of exactly 50 sounds, but this presents no problem because, as with the Western alphabet, the various sounds including the voiced consonant sounds can be represented by hiragana.
As shown in FIG. 6, when pattern recognition device 1 is used to recognize the speech signals of Japanese words, words made up of a plurality of consecutive phonemes are inputted as recognition data to data input section 13 (Step S1).
These recognition data are divided into prescribed frames by dividing section 14, and a plurality of consecutive recognition elements are generated by melcepstrum analysis (Step S2).
In simple terms, in a case in which the inputted recognition data are, for example, "ohayou," the data are divided into the four recognition elements "o, ha, yo, u." Phonemes do not actually have a one-to-one correspondence with frames and one phoneme is generally divided into a multiplicity of recognition elements, but explanation is here simplified as described above.
Distances of similarity of a plurality of standard elements with respect to each of a plurality of recognition elements successively generated as described hereinabove are calculated by distance calculation section 15, and each of the distances of standard elements thus calculated are temporarily stored by distance storage section 16 (Steps S3-S5). The distances of all standard elements are thus detected by frame for each of the plurality of consecutive recognition elements of the recognition data.
In a case in which the recognition elements of recognition data are the four phonemes "o, ha, yo, u" as described hereinabove, the distances of the standard elements of all phonemes "a, i, u -(n)" with respect to each of these four phonemes are calculated, and the distances of all of these standard elements are stored by the frame of the recognition data. Calculation of the distance of the standard element "n" is omitted only for the first frame because this phoneme never occurs at the beginning of a word in Japanese.
All of the standard data stored in data storage section 11 are read out by distance accumulation section 17, the distances of the plurality of standard elements that form these standard data are successively read out from distance storage section 16 and accumulated, and the distances of the plurality of standard data with respect to one item of recognition data are each calculated (Step S6).
If "ohayou" is catalogued as a standard data word, the distance of "ohayou" is calculated as the accumulation of the distance of the first frame made up by the standard element "o," the distance of the second frame made up by the standard element "ha," the distance of the third frame made up by the standard element "yo," and the distance of the fourth frame made up by the standard element "u." Still, as explained hereinabove, recognition data are actually divided into a multiplicity of recognition elements and these recognition elements do not bear a one-to-one correspondence with standard elements.
Here, dynamic programming is employed when the distances of words which are standard data are accumulated as described hereinabove. In such a case, distances are calculated for a plurality of combinations in which the plurality of standard elements of standard data are placed in a variety of correspondences with the large number of recognition elements of recognition data, and the minimum distance is selected as the distance of the standard data with respect to the recognition data.
Since the distances of a plurality of standard data are thus calculated for one item of recognition data, the standard data having the minimum distance is selectively outputted as the recognition result by result output section 18 (Step S7).
For example, if the distance of the standard data "konnitiwa" is 450 and the distance of the standard data "ohayou" is 120 with respect to the recognition data "ohayou," "ohayou" is outputted as the recognition result.
The above-described pattern recognition device 1 can recognize inputted recognition data as one item of standard data catalogued in advance, but the calculation of the distances of the plurality of standard elements with respect to each of a plurality of consecutive recognition elements is a huge processing burden and is extremely time-consuming.
A pattern recognition device to solve these problems is disclosed in U.S. Pat. No. 5,912,989. As shown in FIG. 7, in this pattern recognition device, tree structure data are established that take standard elements as nodes, and those data are employed to simplify the calculation of distances of standard elements.
In more detail, tree structure data are formed in a tree structure in which one root node is linked by a small number of parent nodes to a large number of subordinate nodes that individually correspond to all of the standard elements, these parent nodes bearing an average correspondence to a plurality of mutually similar standard elements.
In more specific terms, although the subordinate nodes correspond to all phonemes "a, i, u, - n," the parent nodes correspond only to the five phonemes "a, i, u, e, o" of the "a" column.
In this pattern recognition device, distances of similarity of the plurality of tree structure data parent nodes are individually calculated for recognition data that have been divided by frame into recognition elements. Next, parent nodes having small calculated distances are selected, following which The distances of subordinate nodes with respect to the recognition elements are calculated only for the subordinate nodes of the selected parent nodes while the distances of parent nodes are appropriated as the distances of subordinate nodes for which distances are not calculated.
For example, if the distances of parent nodes "a, i, u, e, o" are calculated for a case in which the recognition element is the phoneme "ka," only the distance for "a" is small and the distances for "i-o" are larger. In this case, the distances of the subordinate nodes of "a," i.e., "a-wa," are calculated with respect to "ka", but the distances of the subordinate nodes of "i," i.e., "i-ri," are not calculated, and the distance of the parent node "i" is appropriated as the distance of these subordinate nodes.
After the distances for all of the subordinate nodes have been detected as described hereinabove, they are stored as the distances of all standard elements with respect to the recognition element.
By carrying out this process frame by frame, the distances with respect to each of the consecutive recognition elements of recognition data are stored by frame for all standard elements, and the distances of a plurality of standard data with respect to one item of recognition data can therefore be calculated by appropriately reading and accumulating these distances.
As described hereinabove, the pattern recognition device of the above-described disclosure enables an increase in speed and a reduction in the processing load without reducing recognition accuracy by categorizing all standard elements according to similarity and placing them in correspondence with a small number of parent nodes, and then using distances between these parent nodes and recognition elements to bypass the calculation of distances with respect to subordinate nodes not likely to be taken as recognition results.
However, even in the above-described pattern recognition device, the phonemes that constitute the standard elements have a one-to-one correspondence with the subordinate nodes of the tree structure data, and as a result, the calculation of the distances of standard data still requires the distances of subordinate nodes for which calculation should not be necessary.
In this case, the distances of parent nodes are duplicated as the distances of subordinate nodes for which calculation has been omitted, and identical information is therefore repeatedly stored in RAM, thereby preventing reduction of the storage capacity of RAM and complicating the miniaturization of the pattern recognition device.