Conventionally, various server-client type speech recognition apparatuses are well known, and are mainly classified into three types as will be described hereinbelow.
In a first conventional server-client type speech recognition apparatus, the speech is detected on the terminal side (client side) apparatus, waveform data after detection is transmitted to a server side apparatus, and the transmitted data is analyzed and recognized on the server side apparatus. As one example of the conventional first server-client type speech recognition apparatus, a speech recognition apparatus using a Dialogic CSP (Continuous Speech Processing) is well known.
In a second conventional server-client type speech recognition apparatus, the speech is detected on the terminal side apparatus, waveform data after detection is compressed, the compressed waveform data is transmitted to the server side apparatus, the sever side apparatus decompresses transmitted waveform data, detects the speech for recognition again, analyzes and recognizes the waveform data after detection. Here, as a well-known technology in which the speech is detected on the terminal side apparatus, the waveform data after detection is compressed, and the compressed waveform data is transmitted to the server side apparatus, a document 1 (Nikkei Internet Technology, pp. 75 to 93, on May, 1998) specifically discloses VoIP (Voiceover Internet Protocol).
A third conventional server-client type speech recognition apparatus is recently proposed in a standardization project operated by Working Group of ETSI (the European Telecommunications Standards Institute)-STQ Aurora DSR (Distributed Speech Recognition) as one of working groups of 3GPP (Third Generation Partnership Project). That is, in the third conventional server-client type speech recognition apparatus, the speech is detected and analyzed on the terminal side apparatus, a parameter after analysis (characteristic vector) is transmitted to the server side apparatus, and the speech is recognized on the server side apparatus.
However, the first through third conventional server-client type speech recognition apparatuses have the following problems.
That is, in the first conventional server-client type speech recognition apparatus, inasmuch as the waveform data detected on the terminal side apparatus is transmitted to the server side apparatus without the compression thereof, there is a problem that time and costs for transmission are increased.
The second conventional server-client type speech recognition apparatus has a problem that, inasmuch as the server side apparatus re-detects the speech for recognition of the data which is subjected to the speech recognition and the decompression on the terminal side apparatus, the overlapped speech detection is unnecessary. Further, the second conventional server-client type speech recognition apparatus has a problem that the speech for recognition is detected on the server side and therefore in the case of canceling the detecting the start of the short speech on the server side apparatus, the reception of the information on canceling the speech detection on the terminal side is delayed and the operation of application is delayed.
The third conventional server-client type speech recognition apparatus has one problem that inasmuch as a parameter used for recognition (after analysis) is determined, the specific parameter is not used. Further, the third conventional server-client type speech recognition apparatus has another problem that inasmuch as an analysis portion is set to the terminal side apparatus, costs and time for providing a new analysis method to a terminal are increased.
In addition, the following documents are well known as related-art documents of the present invention.
Japanese Unexamined Patent Publication (JP-A) No. 2000-268047 discloses “information providing system, client, information providing server, and information providing method” in which a server system determines the feeling and situation of an operator based on speech information on a call of the operator, positional information, time information, weather information and organic information, and the information provided based on the feeling and situation is transmitted to a client. The information providing system disclosed in this Publication includes the client and the server system. The client comprises a communication unit for transmitting, to the server system via a network, operator information which is information related to an operator, and an output unit for receiving the information for providing via the network from the server system to produce the received providing information. The server system comprises an analysis information storing unit for storing the providing information and analysis information for analyzing the operation information, a selecting server for selecting, from the providing information storing unit, the providing information suitable to the transmission to the client based on the operator information and the analysis information which are transmitted from the client, and an information providing server for transmitting, to the client via the network, the providing information selected by the selecting server.
As disclosed in Japanese Unexamined Patent Publication (JP-A) No. 2000-268047, the server system further comprises a speech recognition server. The speech recognition server receives the speech information transmitted from the client, and recognizes the speech of the speech information which is received based on an acoustic analysis unit, an acoustic model, and a language model and so on. The acoustic analysis unit is a processing unit for extracting a series of the amount of acoustic characteristics for information on input speech. The acoustic model is information used for estimating the similarity between the speech and a pattern of the partly or entirely series of the amount of acoustic characteristics by using an estimation expression for estimating the acoustic similarity and the individual amount of acoustic characteristics extracted by the acoustic analysis unit. Further, the language model is information for supplying the limitation on the connection of the acoustic model.
In addition, Japanese Unexamined Patent Publication (JP-A) No. 2000-194700 discloses “information processing unit and method, and providing medium” in which contents of speech recognition and machine translation are easily changed. As disclosed in this Publication, a terminal is a device having a phone function, and is connected to a network. A user can call phone (speak) via the terminal. The user allows one of three translation service providing devices to translate the contents of speech. The translation service providing device is a server having a speech recognition function, a machine translation function, a speech synthesis function, and history information storage function. The translation service providing device stores, by the history information storage function, the contents of the speech at to the present, executes the translation processing based on the storage contents, and provides the information on the speech history information as needed to another translation service providing device.