1. Field of the Invention
The present invention relates to speech processing and more specifically relates to providing speech processing in a user interface of a client device via a common network node that receives and processes speech and returns text to the client device.
2. Introduction
The present Disclosure generally relates to a desire and a need in the speech environment to improve on the ability of individuals and companies to create voice enabled services over a network. For example, typically, companies that utilize voice enabled services from such companies as Nuance and AT&T may often need to invest a large amount of money in a customized system. In a standard spoken dialog system, there are many components that need training and development in order to operate effectively to both receive speech from a user and generate it in an intelligent and conversational synthetic response. An automatic speech recognition (ASR) module converts a user's audible voice input into text. The text can be transmitted to a spoken language understanding (SLU) module which will seek to identify the intent or the purpose of the words spoken by the user. The output from the SLU module is communicated to a dialog management (DM) module which processes the meaning identified by the SLU module and generates an appropriate response. The substance of the response is transmitted to a text to speech synthesis (TTS) module which will synthesize an audio output that is communicated to and heard by the user. Various training data is utilized to communicate with each of these modules in order to enable the experience to be as life-like as possible for the user. For many companies, there is a large barrier to entry for building voice enabled services. Due to the high degree of expertise needed to provide any services utilizing such features as speech recognition or speech synthesis, the barrier can be very high. Complex components include speech processing engines, hardware, a large database of speech in order to make the experience realistic enough for users to be used and profitable, and so forth. A large investment in money and expertise is needed prior to generating any revenue for any aspect of a voice enabled service.
Because of this barrier, very few companies are capable of affording and building voice enabled services that don't own the engine or the servers. Those that do not own the speech processing engines, however, do have many profitable technologies that do not relate to voice enabled services. For example, many companies may know how to build and deploy a messaging system, communication system, or particular websites for performing a wide variety of web-based services. Websites such as Amazon.com and Travelocity.com have pioneered web-based processes for purchasing products online and reserving airfare, car rentals and hotel rooms.
What is needed in the art is an improved mechanism for enabling companies that already have expertise in one particular area to be able to build in a voice component into their website or other user interface without the need of spending a large amount of money to custom design, buy or license the complex engines and servers necessary for voice enabled services. Accordingly, what is needed generally in the art is an improved ability for users to be able to easily implement voice enabled services especially in the context of a browser on a desktop or laptop computer or via a mobile device.