As the computer hardware and software technologies progress rapidly, the accumulated knowledge of human race is also stored digitally in a rapid manner, which is usually expressed as multimedia, such as text, audio, image, video, and so on. The development of wired and wireless network further eliminates the restriction of time and geographical location on the learning and knowledge delivery. The era of digital learning appears to have arrived. However, to promote the digital learning, it is important to facilitate the learning through natural interaction in addition to improve the technologies for knowledge categorization, lookup and reference mechanism. This is especially true for behavior learning.
According to the social learning theory of Professor Bandura of Stanford University, the individual learning process starts with the observation of a target model, memorization and storage for later mimicking. In other words, the learners learn the behavior through watching how the target model behaves. However, as it is difficult for the learners to distinguish the subtle differences between the observed behavior and the mimicking behavior, the learning effectiveness is usually poor if the observed model is not present to interact with the learners to give advice and assistance. Therefore, the present invention uses the action analysis and synthesis technologies to develop a virtual tutor mechanism to assist the learners in self-learning process.
U.S. Pat. No. 6,807,535 disclosed an intelligent tutoring system 100, including a domain module 110 and a tutor module 120, constructed on a platform 130 with processor and memory, as shown in FIG. 1. The tutor module uses fuzzy logic to dynamically select appropriate knowledge from domain module 110 to teach the learner in accordance with the learner's level of understanding. The main feature of the patent is on the selection of the appropriate knowledge.
US. Publication No. 2005/0,255,434, Interactive Virtual Characters for Training including Medical Diagnosis Training, disclosed an interactive training system 200, as shown in FIG. 2. The system analyzes the user's behavior to find the user's intention, and then uses a computer-synthesized virtual character to respond accordingly. The system is applied in the medical training. The synthesized patient 210 and the tutor 220 can interact with the medical trainee 240 on the screen 230.
US. Publication No. 2006/0,045,312 disclosed an image comparison device for providing real-time feedback to the user. In the training stage, a sequence of behavior of the user 310 is recorded. In the test stage, another sequence of behavior of the user is recorded again. Through the comparison of the recorded image sequences, the device can find the discrepancy between the user's behavior in the training and the test stages.
Image-based videorealistic speech animation has drawn wide attention due to its supreme visual realism. This technique is originated from the video rewrite technique of C. Bregler. Triphone, a concatenation of three phonemes, is taken as the basic unit to collect the facial image during the target's speech. During the speech sequence synthesis, the image segments of the same triphone utterance are directly taken from the video corpus for concatenation.
AT&T also develops a similar technique using Viterbi dynamic programming algorithm to allow more flexibility in the length of the concatenating sequences for visual speech synthesis. These two approaches directly reuse the images in the pre-recorded video corpus without using any generative models for speech animation synthesis, resulted in two following problems. Firstly, the effectiveness of both approaches depends on the matched images found in the pre-recorded video corpus. Therefore, large amount of video corpus is required to ensure for the availability of any triphone-based phonetic combination in the novel sentence to be synthesized. Secondly, it is not possible to transfer the speaker to another person without recollection of a large video corpus. This poses large cost for the video recording and processing time, and the economical burden for the data space used.
Tony Ezzat et al. of MIT proposed a trainable videorealistic speech animation using the machine learning mechanism to construct the image-based videorealistic speech animation. Although this technique also requires collecting the facial video corpus of the specific person for training, only a small amount of learned model is kept for visual speech synthesis of novel sentences once the training is complete. The following describes the two core techniques, namely multidimensional morphable model (MMM) and trajectory analysis and synthesis.
MMM was proposed by M. Jones and T. Poggio of MIT in 1998, where the visual information of an image is represented by shape and texture components. The image analysis and recognition are done based on the composite coefficients of these two components. In the trainable videorealistic speech animation, however, MMM is used to parameterize the image for image synthesis application. Firstly, a set of prototype images is automatically selected from the video corpus by k-means algorithm. Then, each prototype is decomposed into a motion component represented by optical flow and a texture component. Each synthesized image can then be modeled as a linear combination of the motion and texture components of the selected prototype images.
More formally, when given a set of M prototype images {IPi}i=1M and the prototype flow {CPi}i=1M, each novel synthesized image can be modeled as:
                                          C            syn                    =                                    ∑                              i                =                1                            M                        ⁢                                          α                i                            ⁢                              C                                  P                  i                                                                    ,                            (        1        )                                                      I            syn                    =                                                    ∑                                  i                  =                  1                                M                            ⁢                                                β                  i                                ⁢                                  I                                      P                    i                                    warped                                                      =                                          ∑                                  i                  =                  1                                M                            ⁢                                                β                  i                                ⁢                                                      W                    F                                    ⁡                                      (                                                                  I                                                  P                          i                                                                    ,                                                                        W                          F                                                ⁡                                                  (                                                                                                                    C                                syn                                                            -                                                              C                                                                  P                                  i                                                                                                                      ,                                                          C                                                              P                                i                                                                                                              )                                                                                      )                                                                                      ,                            (        2        )            where Csyn and Isyn are the motion and texture components of the novel image respectively, WF(p,q) is the forward-warp operation that warps vectors p according to flow vector q. Conversely, given a set of MMM parameter {αi,βi}i=1M, a new mouth image can be synthesized by warping and blending the prototype images.
The goal of trajectory analysis and synthesis is to learn a phoneme model and use it to synthesize novel speech trajectories in the MMM parameter space. The characteristics of the MMM parameters for each phoneme are examined from corresponding image frames according to the audio alignment result. For simplicity, the MMM parameters for each phoneme are modeled as a multidimensional Gaussian distribution with mean vector μp and diagonal covariance matrix Σp. A trajectory of a novel speech sequence is derived by minimizing the following objective function:Es=(y−μ)TDTΣ−1D(y−μ)+λyTWkTWky,  (3)where the synthesized MMM parameter y is obtained by minimizing the distance to the cascaded target mean vector μ (weighted by the duration-weighting matrix D, and the inverse of the covariance matrix Σ), while also retaining the smoothness concatenation controlled by the k-th order difference matrix Wk.
However, the synthesized MMM parameters tend to be under-articulated when the mean and the covariance for each phoneme are directly calculated from the pooled MMM parameters for each phoneme. To resolve the problem, gradient descent learning is employed to refine the phoneme by iteratively minimizing the difference between the synthesized MMM trajectories y and the real MMM trajectories z. The error between the real and synthesized trajectories is defined by:Ea=(z−y)T(z−y)  (4)and the phoneme model is refined by:
                                                        μ              p              new                        =                                          μ                p                old                            -                              η                ⁢                                                      ∂                                          E                      a                                                                            ∂                                          μ                      p                                                                                                    ;                                    ∑              p              new                        ⁢                          =                                                ∑                  p                  old                                ⁢                                                      -                    η                                    ⁢                                                            ∂                                              E                        a                                                                                    ∂                                              ∑                        p                                                                                                                                ,                            (        5        )            where η is a small learning rate parameter.
In summary, the trainable videorealistic speech animation requires two sets of parameters: a set of M prototype images and prototype flows to represent the texture and flow of the subject's mouth, and a set of phoneme models to model each phoneme in the MMM space using a Gaussian distribution for trajectory analysis and synthesis.