A speech recognition technology using an information processing system is a technology for taking out language information included in input speech data. A system using the speech recognition technology can be used as a speech word processor if all the speech data are converted into text, and can be used as a speech command input device if a keyword included in the speech data is extracted.
FIG. 7 illustrates an example of a related speech recognition system. The speech recognition system illustrated in FIG. 7 includes an utterance segment extraction unit, a feature vector extraction unit, an acoustic likelihood computation unit, a hypothesis search unit, and a database for speech recognition.
The speech recognition system including such components operates as follows.
A segment that involves actual utterance (speech segment) and a segment that does not (silent segment) coexist in a sound (speech) input to the speech recognition system, and hence the utterance segment extraction unit is used to take out only the speech segment therefrom.
Subsequently, the speech data within the extracted segment is input to the feature vector extraction unit, and a feature vector is extracted by taking out various features included in the speech at regular time intervals (frames). The features that are often used may be, for example, cepstrum, power, and A power. A combination of a plurality of features is handled as a sequence (vector) and may be referred to as “feature vector”.
The extracted feature vector of the speech is sent to the acoustic likelihood computation unit to obtain likelihood (acoustic likelihood) thereof with respect to each of a plurality of phonemes that are given in advance. Often used as the acoustic likelihood is a similarity to a model of each of phonemes recorded in an acoustic model of the database. The similarity is generally expressed as a “distance” (magnitude of deviation) from the model, and hence “acoustic likelihood computation” is referred to also as “distance calculation”. The phonemes are obtained intuitively by dividing a phonetic unit into a consonant and a vowel, but even the same phoneme exhibits different acoustic features when the preceding phoneme or the following phoneme differs. It is therefore known that such cases are separately modeled to increase precision in the recognition. The phonemes obtained by thus taking the phonemes before and after a phoneme into consideration are referred to as “triphone (trio of phonemes)”. The acoustic model widely used today expresses state transitions among the phonemes by a Hidden Markov Model (HMM). Accordingly, the acoustic model represents a set of HMMs on a triphone-to-triphone basis. In most implementations, each triphone is assigned an ID (hereinafter, referred to as “phoneme ID”), and is handled wholly by the phoneme ID in processing in the subsequent stages.
The hypothesis search unit references a language model regarding the acoustic likelihood obtained by the acoustic likelihood computation unit to make a search for a word string having the highest likelihood. The language model may be considered by being classified into a dictionary and a strict language model. In this case, the dictionary is given a list of vocabulary that can be handled by the (broad-sense) language model. In general, each word entry within the dictionary is assigned a phoneme string (phoneme ID string) of a corresponding word and a representation character string thereof. Meanwhile, the strict language model includes information obtained by modeling the likelihood (language likelihood) that a given word group within the vocabulary continuously appears in a given order. Grammar and N-gram are most often used as the strict language model today. The grammar represents direct descriptions of adequacy of given word concatenations that are made by using words, attributes of words, categories to which words belong, and the like. Meanwhile, the N-gram is obtained by statistically computing an appearance likelihood of each word concatenation formed of N words based on an actual appearance frequency within a large volume of corpus (text data for learning). In general, each entry of the dictionary is assigned an ID (hereinafter, referred to as “word ID”), and the (strict) language model serves as a function that returns a language likelihood when a word ID string is input thereto. In summary, search processing performed by the hypothesis search unit is processing for obtaining the likelihood (acoustic likelihood) of phonemes from a feature vector string, obtaining whether or not to allow conversion into the word ID from the phoneme ID string, obtaining the appearance likelihood (language likelihood) of the word string from the word ID string, and finally finding the word string having the highest likelihood.
A typical example of those kinds of speech recognition systems includes that described by T. Kawahara, A. Lee, T. Kobayashi, K. Takeda, N. Minematsu, S. Sagayama, K. Itou, A. Ito, M. Yamamoto, A. Yamada, T. Utsuro, and K. Shikano in “Free software toolkit for Japanese large vocabulary continuous speech recognition.” In Proc. Int'l Conf. on Spoken Language Processing (ICLSP), Vol. 4, pp. 476-479, 2000 (Non Patent Literature 1).
Note that, there are limitations on the vocabulary and phrases that can be modeled by a single language model. If a larger volume of vocabulary and diverse phrases are to be modeled beyond the limitations, ambiguity in hypothesis search increases, which results in a decrease in recognition speed and a deterioration in recognition precision. Further, it is impossible to collect all an enormous amount of vocabulary in the first place. Accordingly, in a normal case, it is general to customize the language model depending on a task or domain for which the speech recognition technology is to be made use of. For example, to use the speech recognition technology for speech command, a language model formed of only commands that can be received is created. Alternatively, if the speech recognition technology is used to support dictation from minutes of recorded speech in a meeting, a language model is constructed by modeling only words and phrases that appeared in the written minutes of the meeting held in the past and the speech in the meeting along with their related words and phrases. This enables the vocabulary specific to a particular task or domain to be collected and enables appearance patterns thereof to be modeled.
Further, the acoustic model is generally obtained by putting a machine learning technology to full use by use of a large amount of labeled speech data (set of speech data to which information as to which segment of the speech data corresponds to which phoneme is given). In general, such speech data, collection of which requires high cost, is not customized for each user and is prepared individually so as to suit general properties of expected use scenes. For example, the acoustic model learned from labeled data of telephone speech is used for telephone speech recognition. There is sometimes provided an optimization processing function (referred to generally as “speaker learning” function or “enrollment” function) suitable for the speech of the individual users, which is a function of learning difference information between the acoustic model shared by users and the speech of the user, but a basic acoustic model itself is rarely constructed for each user.
The speech recognition is widely applicable to various purposes, but poses a problem of requiring corresponding calculation amount particularly in the above-mentioned hypothesis search processing. The speech recognition technology has been developed by solving mutually contradictory objects to increase the recognition precision and to reduce the calculation amount, but even today, there still remains a problem, for example, that there are limitations on a vocabulary number that can be handled by a cellular telephone terminal and the like. In order to realize the speech recognition which is high in the degree of freedom with a high precision, it is more effective to execute speech recognition processing on a remote server that can process an abundant amount of calculation. For such reasons, in recent years, such an implementation form (client-server speech recognition form) as to execute the speech recognition processing on the remote server and receive only a recognition result (or some action based on the result) on a local terminal is under active development.
Japanese Unexamined Patent Application Publication (JP-A) No. 2003-5949 (Patent Document 1) discloses an example of the speech recognition system having the implementation form described above. As illustrated in FIG. 8, a speech recognition system disclosed in Patent Document 1 includes a client terminal and a server that communicate with each other via a network. The client terminal includes a speech detection unit (utterance extraction unit) for detecting a speech segment from an input speech, a waveform compression unit for compressing the speech data of the detected segment, and a waveform transmission unit for transmitting compressed waveform data to the server. Further, the server includes a waveform reception unit for receiving the compressed waveform data transmitted from the client terminal, a waveform decompression unit for decompressing the received compressed speech, and an analysis unit and a recognition unit for analyzing the decompressed waveform and subjecting the waveform to the speech recognition processing.
The speech recognition system of Patent Document 1 including such components operates as follows. That is, a sound (speech) taken in the client terminal is divided into a speech segment and a non-speech segment by the speech detection unit. Of those, the speech segment is compressed by the waveform compression unit and then transmitted to the server by the waveform transmission unit. The waveform reception unit of the server, which has received this, sends the received data to the waveform decompression unit. The server causes the analysis unit to extract a feature from the waveform data decompressed by the waveform decompression unit, and finally causes the recognition unit to execute speech recognition processing.
Also in a client-server speech recognition technology, an operation itself of a speech recognition unit has essentially the same operation itself as that operating on a single host. In the invention disclosed in Patent Document 1, the processing up to the step of FIG. 7 performed by the utterance extraction unit is executed by the client terminal, and the subsequent steps are executed by the server. In addition thereto, there exists a client-server speech recognition technology in which the processing up to the step corresponding to the feature vector extraction unit is performed on the client terminal.
The client-server speech recognition technology has been developed mainly by assuming the use on mobile terminals (such as cellular telephones, PDAs, PHSs, and netbooks). As described above, an original object thereof is to overcome the problem that the speech recognition is difficult to perform on the mobile terminals having poor processing performance because the calculation amount involved in the speech recognition processing is severe. In recent years, the processing performance of the mobile terminals has improved, while the speech recognition technology has been sophisticated, and hence a client-server speech recognition system is not always necessary. On the other hand, the client-server speech recognition system is drawing much more attention. This is based on a trend (so-called software as a service (SaaS)) wherein various functions heretofore realized in the local terminal are now provided over the network in consideration of an increase of a network bandwidth, management costs, and the like. In a case where the speech recognition technology is provided as a network service, a system therefore is constructed on the basis of the client-server speech recognition technology.