1. Field of the Invention
This invention relates to a method of the recognition of speech generated from an unspecified speaker.
2. Description of the Prior Art
Some methods of speech recognition use a word spotting technique. T. Kimura et al published "A Telephone Speech Recognition System Using Word Spotting Technique Based on Statistical Measure", Proc. of ICASSP, Dallas, pp. 1175-1178, 1987. S. Hiraoka et al published "A Small Vocabulary Speech Recognizer for Unspecified Speaker Using Word-Spotting Technique", the Japanese Society of Electronics, Information and Communications, SP88-18, 1988.
According to the publication by S. Hiraoka et at, a speaker independent speech recognition method was developed which is relatively immune from noise. The recognition method named CLM (Continuous Linear Compression/Expansion Matching) uses a word spotting technique. The word spotting technique is performed by a new time normalization algorithm based on linear time distortion pattern matching method. Word recognition was carried out by using ten numeral database of 240 persons which was gathered through a telephone line. The resultant word recognition rate was 96.4%. In practical use, the recognition rate was 95.9%.
In the prior art speech recognition by S. Hiraoka et at, unknown input speech is collated with predetermined standard patterns of preset words (recognition-object words) to provide a speech recognition result. The standard patterns are generated on the basis of data of recognition-object words spoken by many speakers. During the generation of the standard patterns, signals of spoken words are visualized, and speech intervals are extracted from the visualized signals. Signal components in the speech intervals are statistically processed to form the standard patterns.
In the prior art speech recognition by S. Hiraoka et al, a word dictionary for an unspecified speaker which contains standard patterns of recognition-object words is formed by using speech data obtained from many speakers, for example, 330 speakers. Specifically, the speakers generate Japanese words representative of numerals of 1 to 10, and the generated Japanese words are converted into speech data. The speech data is visualized into spectrum waveforms, and speech intervals are extracted with observation using human eyes. Each speech interval is divided into unit analysis times. Feature parameters (LPC cepstrum coefficients) of the speech data are derived every unit analysis time. The feature parameters for the respective unit analysis times are arranged into a temporal sequence. The intervals of the speech data represented by temporal sequences of feature parameters are compressed or expanded to a preset speech time which varies from word to word. The absolute values of the resultant speech data are used to form a standard pattern of each recognition-object word.
The prior art speech recognition by S. Hiraoka et al has the following problems. First, many different speakers are necessary to generate a reliable word dictionary containing standard patterns of recognition-object words. Second, it is troublesome to change recognition-object words.