As a method of complementing word information of a voice recognition target when a language model for voice recognition is generated, there is a method of collecting information similar to the content of the target from web pages on the Internet and generating a language model.
Generally, when the retrieval is performed by a retrieval system that retrieves a world wide web (WWW) based on a designated retrieval word, links of web pages which are arranged according to rank, decided by a predetermined evaluation criterion, are output on a page obtained as a retrieval result. Examples of an evaluation index include; appearance frequency of the retrieval word, metadata of a hypertext markup language (HTML), the number of page links, the presence and absence of a link from a page having many user references, and the like. In order to generate the language model, a web page of a link destination linked from the retrieval result page is acquired. However, as the web page of the link destination, web pages including the content similar to a voice recognition target may be provided, but in most cases, web pages including a plurality of topics or mentioning specialized fields are provided. Thus, if the language model is generated without selecting the web page to acquire, the degree of recognition accuracy of voice recognition is lowered.
For this reason, various techniques for selecting the web page and extracting words involved in selection of the web page have been suggested.
For example, Non-Patent Literature 1 discloses a technique for extracting a word whose part-of-speech information represents a noun from a word string as a result of voice recognition, retrieving a news site on the internet using the extracted word as a retrieval word, and collecting similar web pages. In a technique disclosed in Non-Patent Literature 2, in order to collect a medical related corpus, only the word “medical” is used as the retrieval word, and information is collected up to below two layers of the link destination of the retrieval result. In a technique disclosed in Non-Patent Literature 3, words whose appearance frequencies in the recognition result are ranked within top five are extracted as the retrieval words.
Further, Patent Literature 1 discloses a technique of preventing a concatenation of words including a word with a high appearance frequency from having an unreasonably high language probability when the language model is generated. Patent Literature 2 discloses a technique of changing retrieval priority according to the background color of a character string inside an image in a system of retrieving information on a network using the character string. In a user interface design tool capable of designing voice recognition, a voice rule synthesis, or the like disclosed in Patent Literature 3, a designer can set a character recognition part and set a recognition mode to “hiragana” or the like.