With the development of speech recognition and speech synthesis, there is a growing need for a multimodal interface which uses an additional input means other than speech, for terminals such as a mobile terminal, home network terminal, robot, etc.
Multimodal is a channel achieved by modeling human sensory channels such as the sense of sight, sense of hearing, sense of taste, sense of smell, etc. with a plurality of modalities and converting the modeled sensory channels through a mechanical device. Interchange of modalities is referred to as a multimodal interaction.
Speech recognition is a process through which a computer maps an acoustic speech signal to text. That is, speech recognition is a process of converting an acoustic signal obtained through a microphone or a telephone into a word, word set or text. A speech recognition result can be used as a final result in applications such as command, control, data input, text preparation, etc. and can be used as an input of a language processing procedure in a field such as speech understanding. Accordingly, speech recognition enables natural communication between peoples and computers and enriches human life.
Speech synthesis refers to automatic generation of speech waveforms using a mechanical device, electronic circuit or computer simulation. TTS (text-to-speech), a speech synthesis technology, converts input text data into speech by mechanically analyzing and processing the input text data.
Publication or information exchange through electronic documents composed of text information is common now. Electronic documents are provided to users through computers, TV receivers or mobile terminals including a display and users edit electronic documents including text information using a mouse, keypad, etc.