Social networking applications are widely deployed today on different types of computer systems, and there are generally the following two methods for performing an audio search.
In the first method, each target audio input is converted into a corresponding target text input in a word format using an automatic voice transcription technology. Then an index of the target text inputs is created using a text search technology. During a search process, a search term is entered in the text-form, and the search term and each target text input is compared. The target text inputs are sorted according to the extent of similarity, so that a target text input which is most similar to the search term can be found. A target audio input corresponding to the most similar target text input is identified in this way. Alternatively, during the search process, a search is performed using an audio input. The audio input is converted into a corresponding text input, and then the corresponding text input is compared with each target text input. A target audio input corresponding to the target text input that is most similar to the corresponding text input can be identified.
In the second method, each target audio input is converted into a syllable/phoneme sequence. During a search process, a search term entered in text-form or in audio form is converted into a syllable/phoneme sequence. A target audio input which is most similar to the search term can be obtained by calculating and comparing the similarities between syllable/phoneme sequences of the target audio input and the search term.
The foregoing two methods have the same disadvantage: the target audio input, and the search term in text-form or the audio input need to be converted into a word or syllable/phoneme form. Because natural speeches have problems such as varied accents and complex background noise and environments, the voice conversion can be inaccurate, resulting in low accuracy of audio search.