1. Technical Field
The present invention generally relates to communication devices and in particular to mechanisms and methodology for performing content search by voice query on communication devices.
2. Description of the Related Art
Cellular phones and other types of mobile communication devices are becoming increasingly pervasive devices in every day usage. Spurring the proliferation of these devices is the ability to conduct voice communication, which is a fundamental part of the daily communication that occurs on the devices and data services such as web, email and Multi-media messaging service. In addition to enabling voice and data communication (i.e., calls), many of these devices can provide additional functionality, including the ability of the user to record and store pictures and video clips with voice (or speech) based content, and the ability to allow live video playback of special events such as Olympic matches, or IPTV feed on the device. In such devices, the user is able to tag existing content (or currently recorded content) such as a photo with a voice tag, recorded as an audio file. Once stored on the device, the user typically retrieves the stored content by performing a manual search or some other form of search.
Thus, cellular phones and other communication devices typically provide a search function on the device support for performing searches within content that is stored/maintained on the device. The majority of these search functions are performed using a text-based search technology. In text based search technology, “words” (or character combinations) plays a critical role. These words may be manually inputted into the device using the devices input mechanism (keypad, touch screen, and the like); however, in more advanced devices, the words are provided as audio data that is spoken by the user and detected by the devices microphone.
With existing technology, when a search is to be conducted on stored audio data, performing the search requires both the audio data and the audio query be converted into their respective text representation, which are then utilized to complete the search via text matching. That is, the searching methodology is based on voice-to-text, wherein words are first converted into text using a dictionary of known spoken words/terms. The method commonly utilized relies on a use of phonemes derived from the audio data to perform searches and is referred to as a process of discovering “words” from audio data input remains a challenging task on mobile communication devices. It is also a difficult task on a server-based computer system because the performance of the speech recognition system is dependent on the language coverage and word-coverage of the dictionaries and the language models.
A recent phoneme-based approach to deciphering audio data (for searching) does not need actual word discovery. But, the approach makes uses of very limited contextual information, and involves sequentially processing the features of audio data. The approach thus needs to sequentially process the features of the audio data, and the limited locality information results in an expensive fine match.