1. Technical Field
The present invention generally relates to communication devices and in particular to mechanisms and methodology for performing audio content search by voice query on communication devices.
2. Description of the Related Art
Cellular phones and other types of mobile communication devices are becoming increasingly pervasive devices in every day usage. Spurring the proliferation of these devices is the ability to conduct voice communication, which is a fundamental part of the daily communication that occurs on the devices. In addition to enabling voice communication (i.e., calls), many of these devices can provide additional functionality, including the ability of the user to record and store pictures and video clips with voice (or speech) based content. In such devices, the user is able to tag existing content (or currently recorded content) such as a photo with a voice tag, recorded as an audio file. Once stored on the device, the user typically retrieves the stored content by performing a manual search or some other form of search.
Thus, cellular phones and other communication devices typically provide a search function on the device support for performing searches within content that is stored/maintained on the device. These search functions cab be performed using a text-based search technology. In text based search technology, “words” (or character combinations) plays a critical role. These words may be manually inputted into the device using the devices input mechanism (keypad, touch screen, and the like); It is well-known that the it is a challenge task for user to enter text on mobile devices such as cell-phone. Therefore, it is desirable and more convenient that the words are provided as audio data that is spoken by the user and detected by the devices microphone. In view of the following sections, it is also necessary that voice be used as a query form where user can easily mimic the sound stored as content.
With existing technology, when a search is to be conducted on stored audio data, performing the search requires both the audio data and the audio query be converted into their respective text representation, which are then utilized to complete the search via text matching. That is, the searching methodology is based on speech-to-text such as a dictation system, wherein speech is first converted into text using a dictionary of known spoken words/terms. One of the methods utilized relies on a use of phonemes derived from the audio data to perform searches and is referred to as a phoneme-based approach (as opposed to a manually-input text based approach). However, the process of discovering “words” from audio data input remains a challenging task on mobile communication devices. It is also a difficult task on a server-based computer system because the performance of the speech recognition system is dependent on the language coverage and word-coverage of the dictionaries and the language models.
Another recent phoneme-based approach to deciphering audio data (for searching) does not need actual word discovery. But, the approach makes uses of very limited contextual information, such as one phoneme or two phoneme segments in the phoneme lattice as feature vector, and involves sequentially processing the features of audio data. The approach thus needs to sequentially process the features of the audio data, and the limited locality information results in an expensive fine match.