1. Field of the Invention
The present invention relates to the technical field of speech recognition and, more particularly, to a system and a method for obtaining reliable speech recognition coefficients in noisy environment.
2. Description of Related Art
Due to the progress of speech recognition technology, the use of speech recognition to control various machines has made our life more convenient. For example, in an office environment, it is able to correctly perform data input, identity recognition, controlling computer, etc., by speech recognition. However, in a noisy environment, such as in a car, the accuracy of recognition is seriously degraded as the noise enters the recognition system. As a result, the effect of speech recognition application is not satisfactory.
In addition, according to the actual driving test, the use of speech recognition to control the car can indeed effectively reduce the number of errors made by the driver. Furthermore, the combination of car and navigation system or intelligent road safety system will be a major issue in the development of car or information technology. Therefore, to conveniently and safely obtain network information has become an important topic for the driver. Because of the directness and convenience of communication by speech, such a speech technique will doubtless play an important role in obtaining information from a mobile network. However, different from the general office environment as aforementioned, speech recognition in a car environment must encounter a more severe noise problem. Besides, under the consideration of cost, the hardware resource is also restricted.
According to the prior art, the slope of speech energy waveform is an important coefficient for speech recognition. With reference to FIG. 1, in a car environment, the contour of speech is completely destroyed due to strong noise, resulting in an invalid contour identification. As known in the prior art, a typical speech energy Et can be expressed as follows:
                              E          t                =                              1            N                    ⁢                                    ∑                              i                =                1                            N                        ⁢                                                  ⁢                                          x                t                2                            ⁡                              [                i                ]                                                                        (        1        )            where N is the number of speech samples in a frame and xt[i] is the i-th speech sample. The frequently used first and second orders of delta coefficient representing dynamic features that describe the variation rate as time goes can be expressed as follows:
                                                        ⅆ                              log                ⁡                                  (                                      E                    t                                    )                                                      ⁢                                                                      ⅆ            t                          ≅                              1                          T              D                                ⁢                                    ∑                              i                =                                  -                  D                                            D                        ⁢                                                  ⁢                          i              ⁢                                                          ⁢                              log                ⁡                                  (                                      E                                          t                      +                      i                                                        )                                                                                        (        2        )                                                                                    ⅆ                2                            ⁢                              log                ⁡                                  (                                      E                    t                                    )                                                      ⁢                                                                      ⅆ                          t              2                                      ≅                                                            ⅆ                                  log                  ⁡                                      (                                          E                                              t                        +                        1                                                              )                                                              ⁢                                                                                  ⅆ              t                                -                                                    ⅆ                                  log                  ⁡                                      (                                          E                                              t                        -                        1                                                              )                                                              ⁢                                                                                  ⅆ              t                                                          (        3        )            where D is the number of speech frames across and
      T    D    =            ∑              i        =                  -          D                    D        ⁢                  ⁢                  i        2            .      In a less noisy environment, a combination of the above dynamic features and coefficient vectors consisting of other spectrum coefficients can increase the speech recognition accuracy. However, with reference to FIG. 2, in the car environment, a logarithmic energy contour is obtained by calculating
      E    t    =            1      N        ⁢                  ∑                  i          =          1                N            ⁢                          ⁢                                    x            t            2                    ⁡                      [            i            ]                          .            This contour is not desirable because severe noise has completely destroyed the contour of the speech energy obtained from equation (1), resulting in an invalid contour identification.
Therefore, it is desirable to provide a novel system and method for obtaining reliable speech recognition coefficients in a noisy environment so as to increase the speech recognition accuracy.