The present invention relates generally to computerized animation methods and, more specifically to a method and apparatus for creation and control of random access sound-synchronized talking synthetic actors and animated characters.
It is well-known in the prior art to provide video entertainment or teaching tools employing time synchronized sequences of pre-recorded video and audio. The prior art is best exemplified by tracing the history of the motion picture and entertainment industry from the development of the "talkies" to the recent development of viewer interactive movies.
In the late nineteenth century the first practical motion pictures comprising pre-recorded sequential frames projected onto a screen at 20 to 30 frames per second to give the effect of motion were developed. In the 1920's techniques to synchronize a pre-recorded audio sequence or sound track with the motion picture were developed. In the 1930's animation techniques were developed to produce hand drawn cartoon animations including animated figures having lip movements synchronized with an accompanying pre-recorded soundtrack. With the advent of computers, more and more effort has been channeled towards the development of computer generated video and speech including electronic devices to synthesize human speech and speech recognition systems.
In a paper entitled "KARMA: A system for Storyboard Animation" authored by F. Gracer and M. W. Blasgen, IBM Research Report RC 3052, dated Sep. 21, 1970, an interactive computer graphics program which automatically produces the intermediate frames between a beginning and ending frame is disclosed. The intermediate frames are calculated using linear interpolation techniques and then produced on a plotter. In a paper entitled "Method for Computer Animation of Lip Movements", IBM Technical Disclosure Bulletin, Vol. 14 No. 10 Mar., 1972, pages 5039, 3040, J. D. Bagley and F. Gracer disclosed a technique for computer generated lip animation for use in a computer animation system. A speech-processing system converts a lexical presentation of a script into a string of phonemes and matches it with an input stream of corresponding live speech to produce timing data. A computer animation system, such as that described hereinabove, given the visual data for each speech sound, generates intermediate frames to provide a smooth transition from one visual image to the next to produce smooth animation. Finally the timing data is utilized to correlate the phonetic string with the visual images to produce accurately timed sequences of visually correlated speech events.
Recent developments in the motion picture and entertainment industry relate to active viewer participation as exemplified by video arcade games and branching movies. U.S. Pat. Nos. 4,305,131; 4,333,152; 4,445,187 and 4,569,026 relate to remote-controlled video disc devices providing branching movies in which the viewer may actively influence the course of a movie or video game story. U.S. Pat. No. 4,569,026 entitled "TV Movies That Talk Back" issued on Feb. 4, 1986 to Robert M. Best discloses a video game entertainment system by which one or more human viewers may vocally or manually influence the course of a video game story or movie and conduct a simulated two-way voice conversation with characters in the game or movie. The system comprises a special-purpose microcomputer coupled to a conventional television receiver and a random-access videodisc reader which includes automatic track seeking and tracking means. One or more hand-held input devices each including a microphone and visual display are also coupled to the microcomputer. The microcomputer controls retrieval of information from the videodisc and processes viewers' commands input either vocally or manually through the input devices and provides audio and video data to the television receiver for display. At frequent branch points in the game, a host of predetermined choices and responses are presented to the viewer. The viewer may respond using representative code words either vocally or manually or a combination of both. In response to the viewer's choice, the microprocessor manipulates pre-recorded video and audio sequences to present a selected scene or course of action and dialogue.
In a paper entitled "Soft Machine: A Personable Interface", "Graphics Interface '84", John Lewis and Patrick Purcell disclose a system which simulates spoken conversation between a user and an electronic conversational partner. An animated person-likeness "speaks" with a speech synthesizer and "listens" with a speech recognition device. The audio output of the speech synthesizer is simultaneously coupled to a speaker and to a separate real-time format-tracking speech processor computer to be analyzed to provide timing data for lip synchronization and limited expression and head movements. A set of pre-recorded visual images depicting lip, eye and head positions are properly sequenced so that the animated person-likeness "speaks" or "listens". The output of the speech recognition device is matched against pre-recorded patterns until a match is found. Once a match is found, one of several pre-recorded responses is either spoken or executed by the animated person-likeness.
Both J. D. Bagley et al and John Lewis et al require a separate format-tracking speech processor computer to analyze the audio signal to provide real-time data to determine which visual image or images should be presented to the user. The requirement for this additional computer adds cost and complexity to the system and introduces an additional source of error.