1. Field of the Invention
The present invention relates to network-based talking heads and more specifically relates to an architecture to reduce the latency of talking head animation in a network environment.
2. Introduction
A growing number of websites use natural language interfaces to communicate with their customers, to guide customers for more successful self-service and to enhance the user experience. In some cases, some sites provide varying images of talking heads to express moods (happy, offended, sad) in addition to the text displayed in the browser window. In the progression of this technology, animated talking faces for customer service and sales applications on the Internet further enhance the communication between an organization and its customers.
The use of natural language interfaces in web-based interactions typically consists of several major components and steps: (1) the client uses a regular web browser such as Internet Explorer or Netscape; (2) the user types text into a text box on a web page; (3) this text is sent to the server; (4) the server transmits the text to a dialog manager which consists of several modules including natural language understanding, dialog control and natural language generation; and (5) the dialog manager transmits responsive text to the server which forwards the text with the appropriate web page(s) to the client. Compared to simple websites that serve up web pages without further processing at the server, the latency of the server response as perceived by the client is increased by the response time of the dialog manager.
In cases where the user interaction with the website further includes a talking face, two additional steps must occur: (1) speech needs to be synthesized using a speech synthesizer (TTS); and (2) based on the phonemes created by the TTS, a renderer animates the face. While speech synthesis can be done faster than realtime, the latency of a TTS system (Time to first audio) usually exceeds 0.3 seconds. In web interactions, people are often exposed to considerable latencies due to slow download speeds; but as web interactions become more like face-to-face conversations, low latencies are essential. Delays above 0.3 seconds in response are noticeable and irritate the user.
Based on the phoneme and related information from the TTS, the talking head is animated. While face animation can be done in real time, the face renderer also adds latency to the system. Depending on the face model, time to first video can exceed 0.5 seconds. High quality face animation systems use coarticulation models to compute the mouth shapes. The current mouth shape depends on previous sounds. Furthermore, the mouth moves in anticipation of sounds adding further to the latency of the face animation. The current invention solves these problems by introducing caches at the client and server side that can present talking head animations with a low latency while the server is generating new parts of the animation.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The features and advantages of the invention may be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. These and other features of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth herein.