Speech recognition is a technical field which encompasses various techniques and systems for identifying human speech from received audio signals and may also include identifying words sequences from those signals for allowing identifying the content of the detected speech. Automatic identification of words spoken by one or more speakers in an area, in which audio signals are picked, is a challenging task, typically done using sophisticated identification engines that can recognize phonemes, syllables and semi-syllables of a specific language.
A phoneme is defined as the smallest segment of sound that is produced when speaking. Most languages only include a much limited number of possible phonemes than other speech segments such as, for instance, the number of possible syllables of the specific language and therefore word identification is usually done using phoneme identification.
Speech recognition engine and/or algorithms usually receive an audio signal pattern of the detected sound and analyze the signal pattern both for speech recognition as well as for identification of phonemes.
In some speech recognition systems it is very difficult to identify each phoneme in a maximal certainty and often each segment in the audio pattern is associated with more than one optional phoneme that may fit the sequence. This is caused due to various factors that influence the identification quality such as (i) the sound quality of the audio signal, which can depend on noise and the quality of noise reduction as well as the number of speakers in the area speaking simultaneously etc; (ii) the language; (iii) the analysis algorithm; and the like. It is often very difficult to identify where one word ends and another one begins especially when several speakers are speaking simultaneously in continuous speech. Some word identification engines use probability calculations in which a phoneme is selected out of several optional ones according to its probability to exceed its previous detected phoneme or word. For example, if the already identified preceding word is “I” and the optional next phonemes are “ah” or “ee” than it is much more likely that “ah” is the right next phoneme for making up the word “am”. The systems often use one or more data sources such as vocabularies of words, syllables and phonemes where each must include an indication of the interrelations between probable sequential words segments (such as phonemes and/or syllables). This leads to many complicated algorithms that rely upon linguistic studies and statistics of phonemes, syllables and\or word combinations of each language and therefore these algorithms take up a considerable storage space, and calculation time and often fail to output a word sequence that makes sense if the audio signal is noisy, or if a slightly unusual word combination and phrasing are used.
Most of the speech identifying tools for words identification are extremely sensitive to noise.
Therefore, noise reduction is often carried out prior to identifying the speech segments content from the audio signal. A Voice Activity Detection (VAD) for reducing noise that is unrelated to speech for providing noise robust audio signals for word identification. For example, in an article by Tomas Dekens et. al (Dekens Tomas, Werner Verhelst, Francois Capman and Frederic Beaugendre, “Improved speech recognition in noisy environments by using a throat microphone for accurate voicing detection,” in 18th European Signal Processing Conference (EUSIPCO), Aallborg, Denmark, August 2010, pp. 23-27), audio signals are detected using one or more microphones that are connected to a speaker's vibrating body parts such as the user's throat, mouth etc. VAD is carried out over the audio signals that are detected by the microphones to identify speech in the audio signal. The non-speech part identified in the VAD process are then cut out of the audio signal, resulting in audio files that only represent the identified speech parts thereof.
This technique may improve identification of words in some cases, especially when the speaker is pronouncing separated short words such as counting from 0-10 but may affect the word identification process which is a complicated linguistic analysis if the speaker speaks continuously and freely since some major part of the information is lost when the audio signal is cut. Additionally, when fragmenting a sentence into very small parts, e.g. between words, information relating to articulation and cross word context can be lost. Furthermore, the language model effectiveness decreases since the relations between words is lost.