The present invention relates to animation systems and, more particularly, to a method and apparatus for generating an animated sequence having synchronized visual and audio components.
Existing technology related to Internet communication systems includes such applications as pre-animated greetings, avatars, e-mail web based audio delivery and video conferencing. Originally, e-mail messages were sent through the Internet as text files. However, soon the commercial demand for more visual stimulus and the advances in compression technology allowed graphics in the form of short pre-animated messages with imbedded audio to be made available to the consumer. For example, software packages from Microsoft Greetings Workshop allow a user to assemble a message with pre-existing graphics, short animations and sound. These are multimedia greeting cards that can be sent over the Internet but without the voice or gesture of the original sender.
Existing software in the area of video conferencing allows audio and video communication through the Internet. Connectix, Sony Funmail and Zap technologies have developed products that allow a video image with sound to be sent over the Internet. Video Email can be sent as an executable file that can be opened by the receiver of the message without the original software. However, video conferencing requires both sender and receiver to have the appropriate hardware and software. Although video e-mail and conferencing can be useful for business applications many consiers have reservations about seeing their own image on the screen and prefer a more controllable form of communication.
In the area of prior art Internet messaging software, a variety of systems have been created. Hijinx Masquerade software allows text to be converted into synthetic voices and animated pictures that speak the voices. The system is designed to use Internet Relay Chat (IRC) technology. The software interface is complicated and requires the user to train the system to match text and image. The result is a very choppy animated image with mouth shape accompanied by a synthetic computer voice. The software is limited by its inability to relay the actual voice of its user in sync with a smooth animation. In addition, a Mitsubishi technology research group has developed a voice puppet, which allows an animation of a static image file to be driven by speech in the following manner. The software constructs a model using a limited set of the speaker""s facial gestures, and applies that model to any 2D or 3D face, using any text, mapping the movements on to the new features. In order to learn to mimic someone""s facial gestures, the software needs several minutes of video of the speaker, which it analyzes, maps and stylizes. This software allows a computer to analyze and stylize video images, but does not directly link a user""s voice to animation for communication purposes. Geppetto software also aids professional animators in creating facial animation. The software helps professionals generate lip-sync and facial control of 3D computer characters for 3D games, real-time performance and network applications. The system inputs the movement of a live model into the computer using motion analysis and MIDI devices. Scanning and motion analysis hardware capture a face and gestures in real time and then records the information into a computer for animation of a 3D model.
Prior art software for Internet communication has also produced xe2x80x9cavatarsxe2x80x9d, which are simple characters that form the visual embodiment of a person in cyberspace and are used as communication and sales tools on the Internet. These animations are controlled by real time commands, allowing the user to interact with others on the Internet. Microsoft""s V-Chat software offers an avatar pack, which includes downloadable characters and backgrounds, and which can be customized by the user with a character editor. The animated character can be represented in 3D or in 2D comic style strip graphic with speech bubbles. It uses the Internet Relay Chat (IRC) protocol and can accommodate private or group chats. The user is required to type the message on a keyboard and if desired choose an expression from a menu. Accordingly, while chatting the user must make a conscious effort to link the text with the appropriate character expression, since the system does not automatically perform this operation. In addition, the animated characters do not function with lip-synced dialogue generated by the user.
A number of techniques and systems exist for synchronizing the mouth movements of an animated character to a spoken sound track. These systems, however, are mainly oriented to the entertainment industry, since their operation generally requires much technical sophistication to ultimately produce the animated sequence. For example, U.S. Pat. No. 4,360,229 discloses a system where recorded sound track is encoded into a sequence of phoneme codes. This sequence of phoneme codes is analyzed to produce a sequence of visual images of lip movements corresponding to the sound track. These visual images can then be overlaid onto existing image frames to yield an animated sequence. Similarly, U.S. Pat. No. 4,913,539 teaches a system that constructs a synchronized animation based upon a recorded sound track. The system disclosed therein uses linear prediction techniques, instead of phoneme recognition devices to code the sound track. This system, however, requires that the user xe2x80x9ctrainxe2x80x9d the system by inputting so-called xe2x80x9ctraining utterancesxe2x80x9d into the system, which compares the resulting signals to the recorded sound track and generates a phonetic sequence.
Furthermore, speech-driven animation software has been developed to aid in the laborious task of matching specific mouth shapes to each phoneme in a spoken dialogue. LipSync Talkit and Talk Master Pro work as plugins for professional 3D animation programs such as 3D Studio Max and Lightwave 3D. These systems take audio files of dialogue, link them to phonemes and morph the 3D-speech animation based on facial bone templates created by the animator. Then the animation team assembles the remaining animation. These software plugins, however, require other professional developer software to implement their functionality for complete character design. In addition, they do not function as self-contained programs for the purpose of creating eech driven animations and sending these animations as messages through the Internet.
The user of prior art speech-driven animation software generally must have extensive background in animation and 3D modeling. In light of the foregoing, a need exists for an easy-to-use method and system for generating an animated sequence having mouth movements synchronized to a spoken sound track inputted by a user. The present invention substantially fulfills this need and a tool for automated animation of a character without prior knowledge of animation techniques from the end user.
The present invention provides methods, systems and apparatuses directed toward an authoring tool that gives users the ability to make high-quality, speech-driven animation in which the animated character speaks in the user""s voice. Embodiments of the present invention allow the animation to be sent as a message over the Internet or used as a set of instructions for various applications including Internet chat rooms. According to one embodiment, the user chooses a character and a scene from a menu, then speaks into the computer""s microphone to generate a personalized message. Embodiments of the present invention use speech-recognition technology to match the audio input to the appropriate animated mouth shapes creating a professional looking 2D or 3D animated scene with lip-synced audio characteristics.
The present invention, in one embodiment, creates personalized animations on the fly that closely resemble the high quality of hand-finished products. For instance, one embodiment of the present invention recognizes obvious volume changes and adjusts the mouth size of the selected character to the loudness or softness of the user""s voice. In another embodiment, while the character is speaking, the program initiates an algorithm that mimics common human gestures and reflexesxe2x80x94such as gesturing at an appropriate word or blinking in a natural way. In one embodiment, the user can also add gestures, facial expressions, and body movements to enhance both the natural look of the character and the meaning of the message. Embodiments of the present invention also includes modular action sequencesxe2x80x94such as running, turning, and jumpingxe2x80x94that the user can link together and insert into the animation. The present invention allows several levels of personalization, from the simple addition of voice and message to control over the image itself. More computer-savvy users can scan in their own images and superimpose a ready-made mouth over their picture. The software can also accept user-created input from standard art and animation programs. More advanced audio controls incorporate pitch-shifting audio technology, allowing the sender to match their voice to a selected character""s gender, age and size.
The present invention combines these elements to produce a variety of communication and animation files. These include a deliverable e-mail message with synchronized video and audio components that a receiver of the message can open without the original program, an instruction set for real-time chat room communications, and animation files for web, personal animation, computer game play, video production, training and education applications.
In one aspect the present invention provides a method for generating an animated sequence having synchronized visual and audio characteristics. The method comprises (a) inputting audio data; (b) detecting a phonetic code sequence in the audio data; (c) generating an event sequence using the phonetic code sequence; and (d) sampling the event sequence. According to one embodiment, the method further comprises (e) constructing an animation frame based on the sampling step (d); and (f) repeating steps (d)-(e) a desired number of times to create an animation sequence.
In another aspect, the present invention provides an apparatus for generating an animated sequence having synchronized visual and audio characteristics. According to this aspect, the apparatus comprises a mouth shape database, an image frame database, an audio input module, an event sequencing module, a time control module, and an animation compositing module. According to the invention, the audio input module includes a phonetic code recognition module that generates a phonetic code sequence from audio data. The event sequencing module is operably connected to the audio input module and generates an event sequence based on a phonetic code sequence. The time control module is operably connected to the event sequencing module and includes a sampling module, which samples the event sequence. The animation compositing module is operably connected to the sampling module and the mouth shape and image frame database. According to the invention, the animation compositing module is responsive to the time control module to receive an event sequence value, retrieve a mouth shape from the mouth shape database and an image frame from the image frame database, and composite the mouth shape onto the image frame.