Speech recognition is an inter-disciplinary sub-field of computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as “automatic speech recognition” (ASR), “computer speech recognition”, or just “speech-to-text” (STT). Early approaches to speech recognition relied heavily on enrollment or training efforts. Although this improved speech recognition accuracy, this made speech input quite limited and cumbersome to use. Speaker independent, i.e. no training, speech recognition systems alleviated some of this concern, but were still limited.
Speech recognition was deployed on mobile wireless communications device, but this application was limited due to the limited computational resources available on the mobile device. This limitation was ever present until the wide deployment of high bandwidth fourth generation wireless networks. With the increased bandwidth and improved reliability of these new networks, it was now practical to quickly upload high quality recorded speech to a main server with sufficient computational resources to perform speech recognition.
As the cloud based voice recognition became ubiquitous, several approaches to the virtual assistant were deployed, for example, the Google Assistant (as available from Alphabet, Inc. of Mountain View, Calif.), Siri (as available from Apple, Inc. of Cupertino, Calif.), and Alexa (as available from Amazon, Inc. of Seattle, Wash.). The virtual assistants were quite helpful, but were limited to the mobile device of the user, which includes a limited battery power, an ill-suited microphone (i.e. ill-suited for receiving any spoken commands outside of a few feet of the mobile device), and a small speaker.
An approach to these limitations is the smart speaker device. The smart speaker device includes a set of high quality microphones, which permits receipt of voice commands spoken at a distance, a wireless transceiver to connect to the Internet, and a grid power source. Basically, the smart speaker provides a stationary interface to the available virtual assistants.
Due to privacy concerns, most smart speakers attempt to only pass voice commands (i.e. audio recordings) to the cloud when activated. The activation method is a spoken activation command, for example, “Alexa”, “Siri”, or “Hey Google”. Nevertheless, even when not activated, the smart speaker is always listening to ambient audio, but with the understanding that recognition of the spoken activation command is entirely local to the smart speaker. Needless to say, although the smart speaker provides convenient access to the virtual assistants, there is great concern regarding user privacy.