1. Field
The present invention relates to a method and apparatus for improving spontaneous speech recognition performance, and more particularly, to a method and apparatus for enhancing recognition performance for spontaneous speech having various speaking rates.
2. Discussion of Related Art
Generally, various speaking rates are observed in spontaneous speech. Accordingly, a voice recognizer that has learned voices spoken at appropriate rates has reduced spontaneous speech recognition performance. In order to cope with variation in a speaking rate, there is a method of adjusting a length of a voice suitable for an acoustic model in a characteristic region or a signal region.
For example, there is a cepstrum length normalization method for a characteristic region, and there is a Pitch Synchronous Overlap and Add (PSOLA)-based time scale modification method for a signal region. First, a speaking rate should be measured in order to adjust a variation of a cepstrum length or an overlap factor of the PSOLA.
A speaking rate may be determined by estimating the number of syllables spoken in a certain period of time. A syllable typically includes a syllabic nucleus composed of a vowel. A syllabic nucleus has higher energy and periodicity than an onset and a code such that the energy and periodicity decrease or disappear between two syllabic nuclei while increasing at the syllabic nuclei. Since the energy and periodicity reach their peaks at syllabic nuclei, the syllabic nuclei are detected using the energy and periodicity, and the number of peaks is used as the number of syllables.
In detail, a speaking rate is determined by dividing a voice signal into a plurality of frames, extracting energy-related features (entire band energy, sub-band energy, an envelope correlation, low-band modulation energy, etc.) and periodicity-related features (a pitch, a harmonic component magnitude, etc.) for each of the frames, detecting peaks of the features, and dividing the number of peaks by a voice section length. According to a conventional technique, however, when syllabic nuclei are directly connected such as “fruit,” “almost,” and “import” or when a sonorant (Korean characters “,” “□,” “,” and “◯”) is present between syllabic nuclei as an onset and a code, a phenomenon in which energy and periodicity between the syllabic nuclei decrease or disappear and then increase does not occur. Accordingly, it is difficult to detect peaks of energy and periodicity.
A deep neural network, which has been actively studied recently, is a neural network composed of a plurality of hidden layers between an input layer and an output layer and represents a complex relation between an input and an output. In particular, the deep neural network has an advantage capable of precisely representing a relation with an output by utilizing dynamic information between frames of an input signal and extracting characteristics of an implicit input signal. Through this advantage, it is possible to solve a problem of being difficult to detect syllabic nuclei when the syllabic nuclei are connected or when a sonorant is present between the syllabic nuclei.