1. Technical Field
The present invention relates to a humanoid robot and a method of controlling the same. More particularly, the present invention relates to a technology applicable to, and effective in, realization of natural motion of the robot and enhancement of voice recognition accuracy therein.
2. Description of the Related Art
Humanoid robots have been developed in recent years and they are causing a stir. A humanoid robot is quite different from application-specific or function-specific robots, such as assembly robots or welding robots, used mainly in production lines or the like. The humanoid robot has a head, a body, and limbs modeled after a human being. The humanoid robot also includes sensors corresponding to acoustic, optical, and tactile senses. An attempt also has been made to allow the humanoid robot to perform voice recognition using an acoustic sensor corresponding to the acoustic sense.
With regard to humanoid robots and performance of speech recognition, the robot is expected to be able to recognize voices given thereto from an arbitrary direction. To perform voice recognition, voice capture with microphones is required. Omnidirectional microphones are not preferable therein since the omnidirectional microphones capture noise and un-targeted sounds. Therefore, it is desirable that a direction of a sound source be estimated by use of a microphone array, for example, whereby means for discretionarily varying directivity through the use of beam forming can be adopted. Through beam forming, a gain of a sound from a targeted direction can be increased. For example, a S/N ratio can be enhanced.
In general, a time difference (a phase difference) in signals captured by a plurality of microphones can be utilized for the directional estimation of a sound source with the microphones array. Specifically, as shown in FIG. 9, an assumption is made that an acoustic wave is made incident at an angle θ with respect to normals of the microphone array composed of a microphone 1 (MIC 1) and a microphone 2 (MIC 2). Now, assuming that a distance from the sound source is sufficiently large with respect to a space “d” between the microphone 1 and the microphone 2, then an incident acoustic wave can be presumed to be a plane wave. Accordingly, when an acoustic signal captured with the microphone 1 is x1(t), an acoustic signal x2(t) to be captured with the microphone 2 is defined as:x2(t)=x1(t−τs)  (Formula 1)Here, τs is a time difference between x1(t) and x2(t). When the acoustic velocity is denoted as c, then it is obvious from the drawing that:τs=(d×sin θ)/c  (Formula 2)Therefore, the direction θ of the sound source can be found by measuring the time difference τs, as defined by the following formula of:θ=sin−1(c×τs/d)  (Formula 3)
The time difference τs can be found out from a cross-correlation function of a plurality of captured signals or a maximum value of power of a delay sum thereof. In the case of using the cross-correlation function, for example, a cross-correlation function φ12(τ) of x1(t) and x2(t) is defined as:
                                                                        ϕ                ⁢                                                                  ⁢                12                ⁢                                  (                  T                  )                                            =                            ⁢                              E                ⁡                                  [                                                            x1                      ⁡                                              (                        t                        )                                                              ·                                          x2                      ⁡                                              (                                                  t                          +                          T                                                )                                                                              ]                                                                                                        =                            ⁢                              E                ⁡                                  [                                                            x1                      ⁡                                              (                        t                        )                                                              ·                                          x1                      ⁡                                              (                                                  t                          +                          T                          -                                                      T                            ⁢                                                                                                                  ⁢                            S                                                                          )                                                                              ]                                                                                                        =                            ⁢                              ϕ                ⁢                                                                  ⁢                11                ⁢                                  (                                      T                    -                                          T                      ⁢                                                                                          ⁢                      S                                                        )                                                                                        (                  Formula          ⁢                                          ⁢          4                )            In other words, φ12(τ) is an autocorrelation function of x1(t), which is expressed as φ11(τ−τs). Note that E[·] herein denotes an expected value.
Since the autocorrelation function φ11(τ) takes the maximum value at τ=0, the above-mentioned formula takes the maximum at τ=τs. Accordingly, τs can be obtained by calculating φ12(τ) from x1(t) and x2(t) and finding τ that gives the maximum value of φ12(τ).
As described above, it is possible to estimate the direction of the sound source by use of the microphone array. Moreover, beam forming is feasible by calculating the delay sum of the signals corresponding to the direction of the sound source and by using the power of the delay sum as a signal.
The beam forming can be performed by calculation in a high speed digital signal processor (DSP); therefore, the directivity can be varied rapidly in comparison with motion of a robot. Moreover, a directional beam needs to have proper hysteresis so as not to sensitively respond to sporadic sounds. However, as a direction of the directional beam is invisible, a speaker cannot recognize a directivity direction (the direction of voices to be recognized by the robot) of the robot. As a result, there are situations in which the robot recognizes voices from unexpected directions, or where the robot does not sufficiently recognize voices from the direction that the speaker expects (generally the direction along a visual line of the robot). Such aspects may cause discomfort to the speaker, who is expecting natural motion of the robot.
Moreover, accuracy of the above-described directional estimation of the sound source is restricted by a frequency bandwidth of the signal. In short, in the above-described mode, the time difference τs is found by detecting a peak value of the cross-correlation function. However, a peak of φ12 becomes gentle if bandwidths of the signals x1 and x2 are narrow, and the peak of φ12 becomes sharp if the bandwidths are wide. Since the more detection accuracy is improved, the sharper the peak obtainable becomes, the accuracy of the directional estimation of the sound source may be resultantly differentiated by the signal bandwidths.
Although an increase in the number of the microphones or widening of the space “d” between the microphones may conceivably enhance the accuracy for the direction θ of the sound source, such a mode may incur an increase in a physical scale of the microphone array. Such increased size can be unsuitable for a small system.