An example of a speech recognition system is described in Non Patent Literature 1. FIG. 7 shows a configuration of the speech recognition system described in Non Patent Literature 1. A system 300 includes speech input unit 301, speech recognition unit 302, recognition result is unification unit 303, recognition result selection unit 304, and recognition result output unit 305. The system 300 operates as described below.
When speech to be recognized is input from the speech input unit 301, the speech recognition unit 302 implements recognition processing on the speech, and outputs a result of the recognition. The speech recognition unit 302 includes N speech recognizers for implementing speech recognition processing, and outputs N word strings of the recognition result. The recognition result unification unit 303 unites N recognition result word strings and generates one word string network.
When generating the word string network, the recognition result unification unit 303 first arranges N recognition result word strings so as to match them with each other as much as possible. With respect to each recognition result word string, a node is set every word punctuation, and each word is recognized as an arc. The word string network is a network in which the recognition result word strings thus aligned branch off or join.
The recognition result selection unit 304 selects an optimum word string path included in the word string network. The recognition result output unit 305 outputs the selected path as a final recognition result.
Operation of the system 300 will now be described with reference to FIG. 8 by taking the case where the number of speech recognizers included in the speech recognition unit 302 is three as an example. The speech recognition unit 302 outputs recognition result word strings of three systems (recognition results #1 to #3) for the input speech by using the three speech recognizers as shown in FIG. 8(A). In FIG. 8, each of a, b, c, . . . represents a word. The recognition result unification unit 303 generates a word string network from the recognition result word strings of the three systems according to a procedure described in section 2.1 in Non Patent Literature 1.
The recognition result unification unit 303 arranges recognition result 1 and recognition result 2 so as to match with each other by implementing DP matching on them, and recognizes each word as an arc. As a result, a word string network based on the recognition result 1 and recognition result 2 is generated as shown in FIG. 8(B-1). “φ” represents an empty word. In the illustrated example, the word “b” and the word “d” coincide between the recognition result 1 and the recognition result 2. Therefore, the recognition result 1 and the recognition result 2 are arranged so as to cause the word “b” and the word “d” to match with each other.
In addition, DP matching of a recognition result 3 is implemented on the word string network based on the recognition result 1 and the recognition result 2. As a result, the word string network is expanded as shown in FIG. 8(B-2). Even if the number of speech recognizers is at least three, therefore, the word string network can be expanded successively in the same way by repeating the above-described procedure.
The recognition result selection unit 304 selects an optimum word string path from the word string network obtained as described above by implementing majority decision in a set of word arcs sandwiched between nodes. As a result, a final recognition result as shown in FIG. 8(C) is output from the recognition result output unit 305. In selecting an optimum word string, an ith optimum word wi is determined according to following [Math. 1], where S(w,i) is the number of times a word w appears in a set of ith word candidate arcs. An optimum word string path is selected by determining wi successively for i=1, 2, . . .
                              w          i                =                              argmax            w                    ⁢                      S            ⁡                          (                              w                ,                i                            )                                                          [                  Math          .                                          ⁢          1                ]            {Citation List}{Non Patent Literature}    {NPL 1}: Jonathan G. Fiscus, “A Post-Processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction (ROVER),” Proc. IEEE ASRU Workshop, 1997, pp. 352-437    {NPL 2}: Steve Young et al., “The HTK Book (for HTK Version 3.3)” Chapter 3, Cambridge University (http://htk.eng.cam.ac.uk/), 2005, pp. 22-25    {NPL 3}: Nelson Morgan et al., “Speech Recognition Using On-Line Estimation of Speaking Rate,” Proc. Euro Speech, 1997    {NPL 4}: N. Minematsu, M. Sekiguchi, and K. Hirose, “Automatic estimation of one's age with his/her speech based upon acoustic modeling techniques of speakers,” Proc. ICASSP 2002, p. I-137-140    {NPL 5}: ETSI ES 202 050 V1.1.1, “Speech processing, Transmission and Quality aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithm,” 2002    {NPL 6}: Frank Wessel et al., “Confidence Measures for Large Vocabulary Continuous Speech Recognition,” IEEE Trans. on Speech and Audio Processing, Vol. 9, No. 3, March 2001