The recent rise of speech-driven digital personal assistants has shown the validity of speech recognition devices becoming part of our daily lives. However, programming and training speech recognizers, so far, has been the sole domain of the companies providing the speech recognizers. Part of the reason for this is that programming and training speech recognizers requires a speech recognition expert as the tasks involve, among others phonetization of sentences, training of acoustic models, training of language models, designing natural language processing filters to correct spelling and check the viability of sentences.
Currently, commercially available speech-based personal assistants do not allow for a configuration of the understood sentences or changing the call-sign. Further, some open-source speech-based personal assistants like Jasper (https://jasperproject.github.io/) allow to change the keywords spotted by the underlying speech recognizer by directly modifying the Python code of Jasper. However, Jasper does not support changing the call-sign, except through getting back to training the underlying speech recognizer. Additionally, Jasper does not allow for the recognition of full, temporally dependent sentences.
Further, state-of-the art speech recognizer libraries like Sphinx, Pocketsphinx, or KALDI come with default settings and APIs but require extensive knowledge to be programmed. For example, a user needs to understand how a language model works and how to phonetize sentences. Earlier, open-source libraries like micro speech, pre-dating Jasper, only allowed phoneme-level recognition based on very simple matching algorithms. However, such implementations could not be called personal assistants. Also, they don't have voice feedback.
As an example, hardware support allows EasyVR (https://www.sparkfun.com/products/13316) to extend the functionality to a selection of 36 built-in speaker independent commands (available in US English, Italian, Japanese, German, Spanish, and French) for basic controls. It supports up to 32-user-defined speaker dependent triggers by allowing the user to record their own waveform Therefore, an overall number of up to 336 additional commands may be added. However, EasyVR does not support a call-sign or temporally-dependent sentences. Programming is mostly performed through a serial interface command line. Further, EasyVR does not support voice feedback.
As another example, BitVoicer (http://www.bitsophia.com/Files/BitVoicer_v12_Manual_en.pdf) is a commercial speech recognition library that allows the training of both words as sentences. However, word dependencies and sentences need to be configured using a sophisticated graphical interface. BitVoicer does not support a call-sign. Among other things, BitVoicer needs to account for all possible slight variations in sentences and generate anagrams which makes the training process cumbersome because BitVoicer can only do a ‘hard matching’ of sentences. Further, BitVoicer does not support a call-sign or voice feedback.
In view of the foregoing, there is a need for improved methods and systems for facilitating construction of a voice dialog interface for electronic systems.