1. Technical Field
The present invention relates to a speech recognition device and a speech recognition method that are for recognizing speech and providing a response or performing processing in accordance with the recognition result, as part of human interface technology applied to vending machines, home electrical appliances, household fixtures, in-vehicle devices (navigation devices and the like), mobile terminals, and the like. Furthermore, the invention relates to a semiconductor integrated circuit device or the like for use in such a speech recognition device.
2. Related Art
Speech recognition is a technology of analyzing an input speech signal, checking a feature pattern that is obtained as a result of the analysis against a standard pattern (also referred to as “template”) that is prepared in a speech recognition database on the basis of a pre-recorded speech signal, and thereby obtaining a recognition result. However, if no restrictions are imposed on the range of standard patterns to be checked against, the number of combinations of the feature pattern and standard patterns to be compared will be huge. Thus, it takes much time to obtain a recognition result, and the recognition rate tends to be low because the number of words or sentences that have similar standard patterns also increases.
As an example of related art, JP-A-2005-85433 (Abstract) discloses a playback device that is designed to enable designation of a content to be played, such as a music track, by voice without the need for advance preparation or a large/expansive dictionary. This playback device loads title data contained in TOC data stored in a CD, converts the title data into the same format that the speech recognition results will be in, and retains the converted data as candidate data. When a title of a track is input by voice, the playback device performs speech recognition processing, compares the speech recognition result with the candidate data, and plays the track corresponding to the highest matching candidate data. Consequently, the track that should be played can be designated by the user's voice, and thus the number of user operations, such as display confirmations and button operations, is reduced.
According to the playback device disclosed in JP-A-2005-85433 (Abstract), options in speech recognition are limited to those tracks that are stored in the CD, and the title data, which is text information, is converted into candidate data in the same format as the speech recognition results. However, the processing load for converting text information such as title data into candidate data is considerable, and it is difficult for devices that perform a wide variety of information processing, typified by navigation devices, to quickly perform operations associated with this conversion processing, such as creation or updating of a speech recognition dictionary, in parallel with other information processing that was already being performed. For this reason, a problem arises in that the speech recognition processing is prolonged.
JP-A-2011-39202 (paragraphs 0004 to 0010) discloses an in-vehicle information processing device that is designed to enable speech recognition to be performed during updating of a speech recognition dictionary that is used for speech recognition. This in-vehicle information processing device includes a connecting unit to which an information terminal having information data and attribute data containing identification information for specifying that information data is connected, a speech recognition dictionary creating unit that creates a speech recognition dictionary by acquiring the attribute data in the information terminal, converting a part of that attribute data into speech recognition information, and associating the speech recognition information with the identification information, a dictionary storing unit that stores the created speech recognition dictionary, a speech recognition processing unit that performs speech recognition processing of processing input speech and detecting the identification information associated with the speech recognition information corresponding to that speech from the speech recognition dictionary, and an information data acquiring unit that, as a result of the detected identification information being set therein, acquires the information data in the information terminal on the basis of that identification information. The in-vehicle information processing device outputs information that is based on the acquired information data.
The in-vehicle information processing device disclosed in JP-A-2011-39202 (paragraphs 0004 to 0010) includes a judgement unit that, when the speech recognition dictionary creating unit is in the process of creating a speech recognition dictionary, causes speech recognition processing to be performed with a speech recognition dictionary stored in the dictionary storing unit and judges whether or not identification information that is detected by that speech recognition processing matches the identification information in the information terminal. The identification information to be set in the information data acquiring unit is changed depending on whether or not they match, resulting in different information data being acquired. However, there are cases where, after new attribute data has been acquired, a favorable speech recognition result cannot be obtained even if speech recognition processing is performed with an unupdated speech recognition dictionary that is stored in the dictionary storing unit.
Moreover, in speech recognition, the degree of exactness or fuzziness with respect to recognition accuracy that is required when a word or sentence is to be recognized on the basis of a speech signal is set constant irrespective of the number of words or sentences having similar display patterns.
As an example of related art, JP-A-2008-64885 (paragraphs 0006 to 0010) discloses a speech recognition device that is designed to recognize a user's speech accurately even if the user's speech is ambiguous. This speech recognition device determines a control content of a control object on the basis of a recognition result regarding input speech and includes a task type determination unit that determines a task type indicating the control content on the basis of a given determination input and a speech recognition unit that recognizes the input speech using a task of the type that is determined by the task type determination unit as a judgement object.
According to the speech recognition device disclosed in JP-A-2008-64885 (paragraphs 0006 to 0010), when user's words are favorably recognized on the basis of a speech signal, even if what is to be controlled is not specified in the user's words, the control content of the control object can be determined by limiting the recognition object in accordance with an indication of how the control object is to be controlled. However, the degree of exactness or fuzziness with respect to the recognition accuracy that is required when user's words are to be recognized on the basis of a speech signal is set constant, and the recognition rate of speech recognition cannot be improved.
Generally, option information for use in speech recognition is contained in a speech recognition dictionary, but updating of the speech recognition dictionary takes time, and thus it has been difficult to update the option information during execution of speech recognition processing. For example, in the case where a plurality of questions are asked to judge a speaker's purpose from replies to the respective questions, even though speech recognition scenarios in which the plurality of questions and a plurality of options for those questions are set are prepared, it has been difficult to change option information indicating a plurality of options for a plurality of questions that follow a great number of scenarios. Thus, an advantage of some aspects of the invention is to facilitate updating of the option information in speech recognition, thereby appropriately restricting the range of the option information and improving the recognition rate, or enabling a deep speech recognition hierarchical menu to be handled.
As described above, in speech recognition, the degree of exactness or fuzziness with respect to the recognition accuracy that is required when a word or a sentence is recognized on the basis of a speech signal is set constant irrespective of the number of words or sentences that have similar display patterns. For this reason, speech recognition is performed under the same recognition conditions in both cases where the number of options is large and where it is small or in both cases where the options include a large number of similar words and where they include a small number of similar words. Thus, there has been a problem in that the recognition rate of speech recognition does not improve.