The present invention relates to a speech recognition method, a speech recognition system, and a server thereof, and more particularly to a speech recognition method, a speech recognition system, and a server thereof, improved in speech recognition rate respectively.
Because conference participants are usually requested to record the minutes of the conference proceedings during the conference, it comes to need much labor and care to avoid listening and writing errors. Therefore, there have been proposed various techniques related to speech recognition so far to enable such speech recognition results to be output as text data.
Also, JP-A No. 2006-50500 discloses such a speech recognition related technique for recording minutes of conference proceedings. FIG. 1 shows a system for recording minutes of conference proceedings disclosed in the patent document of JP-A No. 2006-50500. This system includes a plurality of client units 907 and a conference server 905 which sends speech information to those client units from itself and controls the whole conference proceedings. This system works as follows. Each client unit 907 receives speeches of conference participants through an input unit as text or voice data in real time. Each speech received in such a way is translated into text data in a speech recognition process by a speech-text translation unit. Then, a speech text editing/management unit displays at least part of the translated text data for the conference participants and the person in charge of the conference while accepting inputs of correction or approval for the text data from the participants or the person in charge of the conference in real time during the conference. Because this system supports the conference proceedings in real time, this system improves the quality of the conference including the judgments and speeches of the participants and the conference time is reduced and the minutes of the conference proceedings are recorded efficiently.
On the other hand, JP-A No. 2005-284209 discloses a speech recognition related technique for improving the recognition rate by updating the language model. FIG. 2 shows a speech recognition system disclosed in JP-A No. 2005-284209. This system includes a correlation unit 911, an important word extraction unit 914, a text DB 916, and a language model learning unit 915. The correlation unit 911 makes correlation between an input speech and an acoustic model 913 with use of a language model 912. The important word extraction unit 914 extracts an important word representing a conference subject from a correlation result. The text DB 916 stores text data related to each important word. The language model learning unit 915 searches target text data in the text DB 916 based on an important word extracted by the important word extraction unit 914. This speech recognition system learns and generates a language model on the basis of the searched text data.
The speech recognition system shown in FIG. 2 works as follows. The correlation unit 911 makes correlation between an input speech and an acoustic model 913 representing the characteristics of a speech with use of a language model 912. The initial language model 912 is generated by learning news items, etc. The correlation unit 911 obtains and sends a recognition result consisting of a word string and a very high correlation score to the important word extraction unit 914. The important word extraction unit 914 then extracts a conference subject from the recognition result received from the correlation unit 911 and sends the extracted important word and a degree of its importance to the language model learning unit 915. The language model learning unit 915 searches target text data in the text DB 916 based on a keyword which is an important word extracted by the important word extraction unit 914, and obtain its related text data, and then calculates a connection possibility of a word on the basis of the obtained text data to learn a language model. The language model learning unit 915 updates the language model 912 with the language model generated by learning. This speech recognition system uses the updated language model 912 and the acoustic model 913 to make next speech recognition. This speech recognition system extracts text data related to the relevant conference subject and learns the language model and makes speech recognition with use of the language model, thereby improving the accuracy of the text to be used.
JP-A No. 2002-091477 also discloses a speech recognition related technique for improving the recognition rate by updating a language model with use of a speech recognition technique. FIG. 3 shows a speech recognition system disclosed in JP-A No. 2002-091477. This speech recognition system is composed of an acoustic model management server 952, a language model management server 953, and an speech recognition unit 951. In the fourth embodiment of the invention, this system further includes a user utilizing text storing means and a user utilizing depending language model building-up means characteristically. In this fourth embodiment, the system refers to a user utilizing text and the latest updated language data 934 to build up a language model appropriately to the user utilizing text. This speech recognition system works as follows. The user utilizing text obtaining means of the language model management server 953, receiving a language model update instruction 932, scans a file and a directory specified by the user in advance to read a text file referred to or described by the user. The user utilizing text storing means stores texts collected by the user utilizing text obtaining means. The user utilizing text depending language model building-up means refers to the user utilizing text and the updated language data 934 to build up a language model so as to improve the recognition rate. In a process for building up a language model with use of a user utilizing text, for example, the user utilizing text is regarded as a text to be identified to build up a language model which depends on the user utilizing text. A language model built up in such a way includes the characteristics of texts referred to by the user or existing texts, the language model includes language characteristics with which the user might make a speech at a higher possibility, thereby the language model enables recognition results to be obtained at a higher accuracy.
As described above, according to JP-A No. 2006-50500, it enables to learn a speech recognition dictionary and provided with a related document DB, a technical terms dictionary DB, and a conference keywords DB so as to store information required for speech recognition. According to JP-A No. 2005-284209, the system is provided with a text DB for storing text data related to important words and enabled to search text data in the text DB and learn a language model. Still JP-A No. 2002-091477 builds up a language model which depends on user utilizing texts.
However, none of those patent documents describes any means for keeping dictionary data accurately at a constant amount with respect to means for adding and updating information of each language model. Therefore, if dictionary data exceeds a certain amount, the speech recognition speed and recognition rate are lowered. If a conference subject/contents are changed after a language model is created, the user is required to add/update the necessary data manually. The language model is also required to be improved through learning in conferences. Otherwise, recognition results are not displayed correctly. Also, any of those patent documents describes how to cope with participants who speak different dialects simultaneously.
In such a way, according to any of the above described speech recognition related techniques, if created dictionary data is registered in a language model, the recognition speed and recognition rate are lowered as the number of vocabularies increases so as to correspond to widely discussed conference subjects. This has been a problem. Also, even when there are prepared a plurality of language models, it has been required much time and labor to switch among those language models and this results even in switching errors. The recognition rate also depends on what dialect is spoken. Even when language and acoustic models are switched to dialect ones, much labor and time are required to make manual switching and this often causes switching errors.
Under such circumstances, it is an object of the present invention to provide a speech recognition method, a speech recognition system, and a server thereof, improved respectively in recognition rate by optimizing language and acoustic models.