The present invention relates to a man-machine system having speech recognition functions, and more specifically, to a man-machine system in which a user can input desired instructions in a simple manner at the user side, and in which desired processes can be performed properly in accordance with the user instructions at the man-machine system side.
Although the concept of man-machine systems initially concerned a system (device) constructed for enhancing the respective advantages of human and computers, nowadays it is said that this concept also embraces systems which facilitate relationships between human and more general machines (machines in a broader sense) as well as computers.
Man-machine systems, such as systems equipped with a speech recognition device in which a speaker (user) can instruct (command) its intention through voice input, are known. For example, a navigation system using a GPS (Global Positioning System) cruising scheme for automobiles is known. In this navigation system, users can designate a destination, etc., through voice input.
When a user pronounces a desired destination, this navigation system speech-recognizes the destination; searches a driving path from the present location to the destination; and displays the searched driving path on a map through a display device.
For example, the navigation system recognizes the destination designated by the user by performing predetermined steps shown in FIG. 13. Suppose that the user wishes to know a driving path to xe2x80x9cMeguro stationxe2x80x9d, which is the destination. First, a voice synthesizer in the speech recognition device generates a synthesized sound of xe2x80x9cPlease enter the namexe2x80x9d in order to request the user to voice-input (pronounce) a specific destination name. If the user pronounces xe2x80x9cMeguro stationxe2x80x9d in response, the speech recognition device extracts the characteristics of the voice of xe2x80x9cMeguro stationxe2x80x9d, and temporarily stores the extracted characteristic parameters D1 in a memory part or the like. That is, at the first step, the speech recognition device only extracts the characteristics of the voice of xe2x80x9cMeguro stationxe2x80x9d without performing final-stage recognition.
Next, at the second step, the voice synthesizer generates a synthesized sound of xe2x80x9cPlease enter a genrexe2x80x9d in order to request the user to pronounce a genre, which is a higher level concept than the specific destination the user desires.
If the user pronounces xe2x80x9ctrain station namexe2x80x9d in response, the speech recognition device extracts the characteristics of this voice of xe2x80x9ctrain station namexe2x80x9d to generate the corresponding characteristic parameters D2. Further, the speech recognition device compares the characteristic parameter D2 with recognition reference vocabularies in a recognition word dictionary which has been pre-installed in the speech recognition device, and selects a recognition reference vocabulary LD2 which is most similar to the characteristic parameter D2, thereby conducting speech recognition of the voice of xe2x80x9ctrain station namexe2x80x9d pronounced by the user.
Next, at the step 3, the voice synthesizer generates a synthesized sound of xe2x80x9cPlease enter a prefecture namexe2x80x9d to request the user to pronounce a region name.
If the user pronounces xe2x80x9cTokyoxe2x80x9d in response, the speech recognition device extracts the characteristics of this voice of xe2x80x9cTokyoxe2x80x9d to generate the corresponding characteristic parameters D3. Further, the speech recognition device compares the characteristic parameter D3 with recognition reference vocabularies in the recognition word dictionary, and selects a recognition reference vocabulary LD3 which is most similar to the characteristic parameter D3, thereby conducting speech recognition of the voice of xe2x80x9cTokyoxe2x80x9d pronounced by the user.
Next, at the step 4, among recognition reference vocabularies in the recognition word dictionary, the speech recognition device narrows down recognition reference vocabularies to the ones belonging to the categories of the recognition reference vocabularies LD2 and LD3. Further, the speech recognition device compares the characteristic parameter D1 With the narrowed-down recognition reference vocabularies to select a recognition reference vocabulary LD1 which is most similar to the characteristic parameters D1, thereby conducting speech recognition of the voice of xe2x80x9cMeguro stationxe2x80x9d pronounced at the first step.
That is, at the first step where the lower level concept of the name xe2x80x9cMeguro stationxe2x80x9d is pronounced, it is in general difficult to identify the recognition reference vocabulary LD1 corresponding to xe2x80x9cMeguro stationxe2x80x9d, which exists within the region the user desires.
Because of this difficulty, the characteristic parameters D1 of the pronounced voice of xe2x80x9cMeguro stationxe2x80x9d are first stored in the memory part. Then, at the second through fourth steps, a searching range for recognition reference vocabularies in the recognition word dictionary is narrowed down by receiving voices of the genre and region name from the user. Then, by comparing the characteristic parameters D1 with the thus narrowed-down recognition reference vocabularies, the recognition reference vocabulary LD1 corresponding to xe2x80x9cMeguro stationxe2x80x9d is relatively easily identified.
Finally, based upon the selected recognition reference vocabularies LD3 and LD1, a synthesized sound of xe2x80x9cIt is ◯◯◯ in xcex94xcex94xcex94, isn""t it?xe2x80x9d is generated to provide the user with the recognition result. That is, when the recognition reference vocabularies LD3 and LD1 are properly recognized as xe2x80x9cTokyoxe2x80x9d and xe2x80x9cMeguro stationxe2x80x9d, respectively, the synthesized sound of xe2x80x9cIt is ◯◯◯ in xcex94xcex94xcex94, isn""t it?xe2x80x9d becomes a synthesized sound of xe2x80x9cIt is Meguro station in Tokyo, isn""t it?xe2x80x9d, and is presented to the user as such.
Thus, the speech recognition device merely selects the recognition reference vocabularies LD1 to LD3 from the recognition word dictionary as the vocabularies most similar to the respective words pronounced by the user. Accordingly, there is an inevitable possibility that xe2x80x9cMeguro stationxe2x80x9d is wrongly recognized as xe2x80x9cMejiro stationxe2x80x9d, or xe2x80x9cTokyoxe2x80x9d is wrongly recognized as xe2x80x9cKyotoxe2x80x9d, etc., in the case where the user""s pronounced voice was not clear or in some other circumstances. If such misrecognition occurs, a synthesized sound of xe2x80x9cIt is Mejiro station in Kyoto, isn""t it?xe2x80x9d would be presented to the user. Thus, the synthesized sound is generated based on recognition reference vocabularies LD3 and LD1 in order to ask for the user confirmation of the recognition results, as described above.
If the user determines that correct speech recognition is performed by hearing this synthesized voice thus presented, the user pronounces xe2x80x9csearch startxe2x80x9d, for example. Then, the speech recognition device recognizes this, and the navigation system receives a confirmation instruction and searches a driving path from the current location to the Meguro station in Tokyo. The navigation system then displays the searched driving path on a map through a display device.
On the other hand, if the user determines that the recognition is wrong, the user indicates so by pronouncing xe2x80x9creturnxe2x80x9d. Receiving that instruction, the speech recognition device restarts speech recognition, and repeats the speech recognition until it receives the instruction of xe2x80x9csearch startxe2x80x9d from the user with respect to re-presented recognition result.
As explained above, the navigation system possesses a superior functionality in that it enables conversational operations by the combination of a speech recognition device and a voice synthesizer.
Also, because the user is lead to pronounce words, which become keywords, in the order which matches the user""s thought characteristics, the system provides the user with improved convenience. In other words, in designating the desired destination, the user designates the most specific destination (Meguro station in the example above), and then designates its genre and region name where that destination exits. Thus, the man-machine system matches the user""s thought characteristics.
More specifically, this information search system employs an efficient information management scheme in which a category of the highest level concept, is determined, and then information of an intermediate level concept and a lower level concept which relate to the higher level concept of the category, is managed in a hierarchical manner. By adopting such a hierarchical structure, when a user searches specified information from a large amount of lower level concept information, the target information is narrowed-down by utilizing the higher level concept and the intermediate level concept, thereby enabling rapid access to the desired specified information.
However, when a man-machine system is constructed using search procedures similar to, but different from such an information search system, there are situations where the user""s thought characteristics are not properly respected. An example of such cases is as follows. Referring to the navigation system, suppose that the higher level concept category, xe2x80x9cgenrexe2x80x9d, is first asked to the user, and the user pronounces xe2x80x9ctrain station namexe2x80x9d in response; then the intermediate level concept, xe2x80x9cprefecture namexe2x80x9d, is asked to the user, and the user pronounces xe2x80x9cTokyoxe2x80x9d in response; and finally the lower level concept, xe2x80x9cspecific train station namexe2x80x9d is requested to the user, and the user pronounces xe2x80x9cMeguro stationxe2x80x9d in response. In this case, the inquiries are made in the order different from the user""s thought characteristics, and as a result, the user is given awkward feeling.
From this point of view, the conventional navigation system causes the user to input user desired items in the order which is not awkward, and accordingly provides improved convenience to users.
However, even in the conventional navigation system, there are cases where the following drawbacks occur due to the employment of the speech recognition scheme which matches the user""s thought characteristics.
For example, in the case of FIG. 13, the pronounced sound of xe2x80x9cMeguro stationxe2x80x9d is not speech-recognized at the first step. The sound of xe2x80x9cMeguro stationxe2x80x9d is speech-recognized and the recognition result is presented only after the narrowing-down is performed at the second through fourth steps.
When a recognition error occurs, the instruction of xe2x80x9creturnxe2x80x9d is received, and the speech recognition is repeated to correct the error.
However, the instruction, xe2x80x9creturnxe2x80x9d, means a command: xe2x80x9creturn to the one-step previous process and reinitiate the processxe2x80x9d. Because of this, if the destination xe2x80x9cMeguro stationxe2x80x9d is wrongly recognized, the user must pronounce xe2x80x9creturnxe2x80x9d three times to return to the first step from the fourth step in order to repeat the processes of the first through fourth steps shown in FIG. 13. This is a significant drawback because the user is forced to conduct cumbersome operations. Similarly, if xe2x80x9ctrain station namexe2x80x9d is wrongly recognized, the user must pronounce xe2x80x9creturnxe2x80x9d twice to return to the second step from the fourth step in order to repeat the processes of the second through fourth steps shown in FIG. 13, thereby forcing the user to conduct cumbersome operations, which is undesirable.
Thus, the conventional navigation system responds to recognition errors by providing the function of rewriting (replacing) the previously voice-inputted information with newly voice-inputted information when xe2x80x9creturnxe2x80x9d is pronounced. However, this function simply amounts to repeating of the speech recognition, and does not provide for functions by which the user can instruct correction through simple operations.
Accordingly, it has the drawback of forcing users to perform cumbersome operations.
The present invention is provided to obviate the problems of the conventional art. An object of the present invention is to provide a man-machine system equipped with a speech recognition device, which enables users to conduct easy conventional operations (correction, etc., for example).
To achieve the object, the present invention provides, a man-machine system equipped with a speech recognition device. The speech recognition device has one or more of processing functions and performing the one processing function in a conversational manner using voice as information communication medium. The speech recognition device comprises a control part, wherein the control part pre-stores control words which correspond to respective the processing functions, and wherein the control part presents the one processing function. When voice input information having instruction information which designates the one processing function is inputted from outside in response to the presentation, the control part recognizes the voice input information, and performs the one processing function in accordance with the control words which correspond to the instruction information.
Also, as a further construction, the control words may be a combination of a control command word instructing operation of the processing function and a controlled object word indicating an object to be processed by the control command word. Then, when the voice input information having the instruction information which indicates the controlled object word and the control command word is inputted from outside, the control part performs the processing function in accordance with the control words, which comprise the controlled object word and the control command word and which correspond to the instruction information.
Also, the control words may be a combination of a control command word instructing operation of the processing function and a controlled object word indicating an object to be processed by the control command word. The controlled object word may be determined by the instruction information included in the voice input information. Then, when the voice input information having instruction information of the control command word is inputted after the voice input information having instruction information of the controlled object word is inputted, the control part performs the processing function in accordance with the control words which comprises the controlled object word and the control command word.
According to these constructions, the control part receives instruction information included in the voice input information as corresponding to control words, and performs the processing function instructed by the instruction information based upon the control words. Thus, by registering various kinds of control command words which control respective processing functions in these control words, a variety of processes corresponding to instruction information can be performed.
In another aspect, the present invention provides a man-machine system equipped with a speech recognition device. The speech recognition device has one or more of processing functions and performs the one processing function in a conversational manner with voice as information communication medium. The speech recognition device includes a memory part pre-storing a plurality of reference information; and a control part having a recognition result storage region for storing one or more of reference information which have a similarity higher than a predetermined similarity standard as recognition information by comparing voice input information inputted by the voice with the reference information stored in the memory part. The control part further possesses control words which correspond to respective the processing functions. Here, the control part presents-the one processing function, and when voice input information having instruction information which designates the one processing function is inputted from outside in response to the presentation, the control part performs the one processing function with respect to the recognition information stored in the recognition result storage region in accordance with the control words which correspond to the instruction information.
With this construction, similarly to above, by registering various kinds of control command words which control respective processing functions in the control words, a variety of processes corresponding to instruction information can be performed.
Further, the recognition information may be one or more of vocabulary information obtained by comparing the voice input information formed by vocabularies pronounced by a speaker with the reference information, and the control words may be instruction information for correcting the vocabulary information.
The instruction information for correcting the vocabulary information may be control information which designates and corrects one of the one or more vocabulary information.
The instruction information for correcting the vocabulary information may be control information which corrects the one or more vocabulary information by selecting the next candidate successively.
Furthermore, the control words may be the instruction information of a combination of a control command word instructing an operation of correction corresponding to the processing function and a controlled object word corresponding to the vocabulary information which is an object to be processed by the control command word. In this case, when the voice input information having the instruction information is inputted from outside, the control part may perform the correction in accordance with the control words, which comprise the controlled object word and the control command word and which correspond to the instruction information.
Moreover, the memory part may store the plurality of reference information in a hierarchical structure which has plural classifications in terms of attribute categories ranging from an upper level concept to a lower level concept. The memory part may further have an information storage part for storing the voice input information formed of vocabularies pronounced by the speaker. The control part presents the processing function corresponding to two or more of the attributes, and in response thereto, stores the vocabularies corresponding to the two or more of the attributes in the recognition result storage region as the recognition information for respective the attributes. When an instruction to correct the recognition information having an attribute of the higher level concept is given by the instruction information thereafter, the control part performs correction of the recognition information having the attribute of the higher level concept, and compares the voice input information, which is stored in the information storage part and has an attribute which is a lower level concept than the higher level concept, with the reference information, which is stored in the memory part and which has an attribute of a lower level concept which depends from the higher level concept. The control part thereby re-stores one or more of reference information which has a similarity higher than a predetermined similarity standard in the recognition result storage region as recognition information.
Here, the control part may be constructed such that when the voice input information is re-compared, the control part refers to reference information, which is reference information having the same attribute as the recognition information of the lower level concept, and which excludes reference information identical to the recognition information of the lower level concept.
Also, the memory part may store the plurality of reference information in a hierarchical structure which has plural classifications in terms of attribute categories ranging from an upper level concept to a lower level concept. The memory part may further have an information storage part for storing the voice input information formed of vocabularies pronounced by the speaker, wherein the control part presents the processing function corresponding to two or more of the attributes, and in response thereto, stores the vocabularies corresponding to the two or more of the attributes in the recognition result storage region as the recognition information for respective the attributes. When an instruction to correct the recognition information having an attribute of the lower level concept by selecting a next candidate successively is given by the instruction information, the control part presents recognition information which is selected as the next candidate as new recognition information.
According to these constructions, the instruction information for correction to be inputted from outside is received as the command words having information regarding various correction processes. Then, the various correction processes are conducted based on the control words. In particular, when recognition information becomes meaningful only when the information is hierarchically organized, correction of recognition information having an attribute of a higher level concept affects recognition information having an attribute of a lower level concept. Conversely, correction of recognition information having an attribute of a lower level concept affects recognition information having an attribute of a higher level concept. Thus, appropriate processing becomes necessary. In that case, by providing control words with functions capable of performing appropriate correction operations, it becomes possible to perform speedy and appropriate correction processing.
For example, when a destination is inputted in a navigation system, and if recognition information as the recognition result is in error, or the destination needs to be changed, prompt correction processing is desirable. In such a case, by using control words, it becomes possible to conduct prompt conversational operations.