The means by which a human being expresses his or her thoughts to another party can be configured in various forms, but speech is the most basic communication of these means used by human beings.
Speech processing by human beings can be divided into the two aspects of speech production and speech perception. Speech production refers to a series of procedures allowing a speaker to communicate his or her intention, and speech perception refers to a series of procedures enabling spoken contents to be perceived from the speech spoken by another party. Research into these two types of speech aspects has been individually conducted, and has been followed by various academic environments such as linguistics, phonetics, phonology, physiology, and anatomy.
Access methods of performing speech recognition from the standpoint of speech perception of speech processing can be classified into four types of methods, that is, an acoustic-phonetic method, a statistical pattern recognition method, an artificial intelligence method, and a neural network method.
Speech recognition systems using the above-described various access methods can be classified into isolated word speech recognition systems that recognize isolated words according to the type of vocalization, and continuous speech recognition systems that recognize speech made by continuously speaking words.
Of these speech recognition systems, an isolated word speech recognition system is a scheme for performing recognition in the sequence of Voice Activity Detection (VAD) feature extraction→pattern comparison→recognition, and is a technology which is suitable for small computational loads or small-scale speech recognition, but is unsuitable for commercialization because unregistered words are rejected only by confidence detection with the result that errors frequently occur in confidence detection depending on noisy environments, and because a special event such as a recording button for speech recognition is required.
FIG. 1 is a diagram briefly showing the construction of a conventional continuous speech recognition network. The continuous speech recognition network employs a scheme that performs recognition by performing post-processing on a recognized word sequence using a language model. This scheme is mainly used in large-capacity speech recognition systems of ten thousand or more words.
A continuous speech recognition system is a scheme for performing recognition in the sequence of feature extraction→pattern comparison (acoustic model comparison)→language model comparison→post-processing→recognition, is suitable for large-scale speech recognition systems in high-specification server level Personal Computers (PCs), and also enables a real-time large-scale speech recognition system to be implemented with the development of storage media and computational processing abilities. However, there are disadvantages because, in order to install such a continuous speech recognition system in terminals which have gradually become lightweight, still as of yet insufficient storage capacity and complicated floating computation cause problems, and a large computational load and storage capacity for a large amount of data are required.
FIG. 2 is a diagram showing a speech waveform produced by speaking the same command in a vehicle which has not been started, and FIG. 3 is a diagram showing a speech waveform produced by speaking the same command in a vehicle which is traveling with the window open. With regard to the comparison between FIGS. 2 and 3, the performance of Voice Activity Detection (VAD) using existing energy or Zero Crossing Rate (ZCR) remarkably decreases due to the influence of the vibrating sound of a vehicle engine, the output sound of multimedia devices, and the wind which is there when the window of the vehicle is open, in an actual vehicle environment. Therefore, in a conventional speech recognition apparatus in a vehicle, a vehicle driver presses a hot-key to generate a speech recognition event, and then speech recognition is conducted. Such a function causes a user to feel inconvenience even if the conventional speech recognition apparatus is a small-scale speech recognition system having relatively excellent performance such as the electronic control of the vehicle or the menu control of a navigation terminal basically mounted in the vehicle, thus becoming a large obstacle to commercialization.