1. Field
The following description relates to speech detection, and more particularly, to an apparatus and method for detecting speech to determine whether an input signal is a speech signal or a non-speech signal.
2. Description of the Related Art
Generally, voice activity detection (VAD) algorithms may be used to extract a section of speech from a signal that includes a mix of speech and non-speech sections. VAD extracts feature information such as energies and changes in energy of an input signal at various time intervals, for example, every 10 ms, and divides the signal into speech sections and non-speech sections based on the extracted feature information. For example, according to G.729, which is one example of an audio codec standard, a speech section is detected using energies extracted, a low-band energy, and a zero crossing rate (ZCR). The payload size for G.729 is 20 ms. Therefore, the G.729 standard may extract energies, low-band energy, and ZCR from a signal during a time interval of 20 ms, and detect a speech section from the signal.
A system for speech detection extracts feature information with respect to individual frames, and determines whether each frame includes speech based on the extracted feature information. For example, feature information such as the energy of the signal or a ZCR of the signal may be used to detect speech from an unvoiced speech signal. Unlike a voiced speech signal that has periodicity useful to the speech detection, an unvoiced speech signal does not have periodicity. Feature information used to detect speech may differ with the type of noise signal. For example, it may be difficult to detect speech using periodicity information when music sounds are input as noise. Therefore, feature information, for example, spectral entropy or a periodic/aperiodic component ratio, which is generally less affected by noise, may be extracted, and may be used. Also, a noise level or a feature of noise may be estimated, for example, by a noise estimation module, and a model or parameters may be changed, according to the estimated information.