The present invention discloses methods of communication using electronic messages such as texts or emails where the composer of the electronic message selects an animation character to present the message to the recipient in spoken words. The electronic message may be an instant message, personal message, text message, e-mail, or voicemail. The animation character may be the image of oneself or a “selfie,” a well-know human character like the actor Clint Eastwood, a cartoon character like Jerry the Mouse, an animal character like a cat or a dog, or an avatar character like the character in the movie Avatar. The text of the message is converted into speech. The image of the animation character is animated using computer animation or CGI. The recipient's device displays the moving images or video of the animation character and outputs the speech. The speech is synthesized and may be in any language.
A device such as a smart phone can be used to implement the methods. A user can utilize the device, which includes a processing unit and program code, to compose an electronic message and select an animation character. The device converts the text of the electronic message into speech. The device further generates moving images of the animation character. The speech and the moving images are transmitted via the device. For instance, the user selects a photograph of herself, inputs the phrase “you complete me,” and the recipient of the electronic message is presented with moving images of the user uttering the words “you complete me.” In a preferred embodiment, the speech is synthesized in such a way that is characteristic of the voice of the actress “Renee Zellweger” who uttered the phrase “you complete me” in the movie “Jerry Maguire.”
In different embodiments, conversion of text to speech and generation of moving images can occur on different devices. The computer network system includes the sender's device, recipient's device, servers and communication network. Servers include dedicated computer systems and associated software for speech synthesization and animation of animation character, as specified herein. The communication network comprises at least one of the Internet, Wi-Fi, ground-based communication devices and software, routers, cables, interface software, air-based communication devices, satellites, and satellite transceivers. In one preferred embodiment, voice synthesization and image animation is performed by servers and transmitted to recipients' devices.
In a preferred embodiment, the recipient's device receives the sender's text and animation character and performs the operations of voice synthesization and animation generation. For instance, the sender inputs the text “love means never having to say you're sorry” and selects the character “Ryan O'Neal.” The text and image (or in an alternative embodiment, an image identifier which identifies the image of Ryan O'Neal) is transmitted via the user's device through the communication network and are received by the recipient's device, such as an iPhone. The recipient's iPhone notifies the recipient, through badges, alerts, or banners, that the recipient has received an electronic message. An App, embodying this preferred embodiment, is activated and the recipient sees an animated Ryan O'Neal speaking the words “love means never having to say you're sorry.” In a preferred embodiment, the catch-phrase “love means never having to say you're sorry” which was uttered by the actor Ryan O'Neal in the movie “Love Story” is selectable and it does not have to be inputted by the sender.
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.
Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely “synthetic” voice output.
The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer. Many computer operating systems have included speech synthesizers since the early 1990s.
A text-to-speech system (or “engine”) is composed of two parts: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the target prosody (pitch contour, phoneme durations), which is then imposed on the output speech.
The present invention implements these existing technologies (Text-to-Speech, or TTS systems) to synthesize the voice of the animation character. Synthesization of the animation character speech can be created by either concatenating pieces of recorded speech that are stored in a database, or by incorporating a model of the vocal tract and other animation character voice characteristics to create a completely synthetic voice output.
Computer animation or CGI animation is the process used for generating animated images by using computer graphics. The more general term computer-generated imagery encompasses both static scenes and dynamic images, while computer animation only refers to moving images. Modem computer animation usually uses 3D computer graphics, although 2D computer graphics are still used for stylistic, low bandwidth, and faster real-time renderings. Sometimes the target of the animation is the computer itself, but sometimes the target is another medium, such as film.
Computer animation is essentially a digital successor to the stop motion techniques used in traditional animation with 3D models and frame-by-frame animation of 2D illustrations. Computer generated animations are more controllable than other more physically based processes, such as constructing miniatures for effects shots or hiring extras for crowd scenes, and because it allows the creation of images that would not be feasible using any other technology. It can also allow a single graphic artist to produce such content without the use of actors, expensive set pieces, or props. To create the illusion of movement, an image is displayed on the computer monitor and repeatedly replaced by a new image that is similar to it, but advanced slightly in time (usually at a rate of 24 or 30 frames/second). This technique is identical to how the illusion of movement is achieved with television and motion pictures. For 3D animations, objects (models) are built on the computer monitor (modeled) and 3D figures are rigged with a virtual skeleton.
For 2D figure animations, separate objects (illustrations) and separate transparent layers are used, with or without a virtual skeleton. Then the limbs, eyes, mouth, clothes, etc. of the figure are moved by the animator on key frames. The differences in appearance between key frames are automatically calculated by the computer in a process known as tweening or morphing Finally, the animation is rendered.
For 3D animations, all frames must be rendered after modeling is complete. For 2D vector animations, the rendering process is the key frame illustration process, while tweened frames are rendered as needed. For pre-recorded presentations, the rendered frames are transferred to a different format or medium such as film or digital video. The frames may also be rendered in real time as they are presented to the end-user audience. Low bandwidth animations transmitted via the internet (e.g. 2D Flash, X3D) often use software on the end-users computer to render in real time as an alternative to streaming or pre-loaded high bandwidth animations.
The present invention implements these existing technologies (computer animation or CGI animation systems) to generate moving images of the animation character. Animation of the animation character can be in 2D or 3D which mimics the traditional frame-by-frame or stop-motion techniques, respectively.
In most 3D computer animation systems, an animator creates a simplified representation of a character's anatomy, analogous to a skeleton or stick figure. The position of each segment of the skeletal model is defined by animation variables, or Avars. In human and animal characters, many parts of the skeletal model correspond to actual bones, but skeletal animation is also used to animate other things, such as facial features (though other methods for facial animation exist). The character “Woody” in Toy Story, for example, uses 700 Avars, including 100 Avars in the face. The computer does not usually render the skeletal model directly (it is invisible), but uses the skeletal model to compute the exact position and orientation of the character, which is eventually rendered into an image. Thus by changing the values of Avars over time, the animator creates motion by making the character move from frame to frame.
There are several methods for generating the Avar values to obtain realistic motion. Traditionally, animators manipulate the Avars directly. Rather than set Avars for every frame, they usually set Avars at strategic points (frames) in time and let the computer interpolate or ‘tween’ between them, a process called keyframing. Keyframing puts control in the hands of the animator, and has roots in hand-drawn traditional animation. Accordingly, the present invention may implement this technique to create and change the values of Avars over time to generate moving images of the animation character.
In contrast, a newer method called motion capture makes use of live action. When computer animation is driven by motion capture, a real performer acts out the scene as if they were the character to be animated. His or her motion is recorded to a computer using video cameras and markers, and that performance is then applied to the animated character. Accordingly, the present invention may implement the motion capture technique to generate moving images of the animation character.
Each method has its advantages, and as of 2007, games and films are using either or both of these methods in productions. Keyframe animation can produce motions that would be difficult or impossible to act out, while motion capture can reproduce the subtleties of a particular actor. For example, in the 2006 film Pirates of the Caribbean: Dead Man's Chest, actor Bill Nighy provided the performance for the character Davy Jones. Even though Nighy himself doesn't appear in the film, the movie benefited from his performance by recording the nuances of his body language, posture, facial expressions, etc. Thus motion capture is appropriate in situations where believable, realistic behavior and action is required, but the types of characters required exceed what can be done through conventional costuming.
3D computer animation combines 3D models of objects and programmed or hand “keyframed” movement. Models are constructed out of geometrical vertices, faces, and edges in a 3D coordinate system. Objects are sculpted much like real clay or plaster, working from general forms to specific details with various sculpting tools. Unless a 3D model is intended to be a solid color, it must be painted with “textures” for realism. A bone/joint animation system is set up to deform the CGI model (e.g., to make a humanoid model walk). In a process called rigging, the virtual marionette is given various controllers and handles for controlling movement. Animation data can be created using motion capture, keyframing by a human animator, or a combination of the two. 3D models rigged for animation may contain thousands of control points—for example, the character “Woody” in Pixar's movie Toy Story, uses 700 specialized animation controllers. Rhythm and Hues Studios labored for two years to create Aslan in the movie The Chronicles of Narnia: The Lion, the Witch and the Wardrobe which had about 1851 controllers, 742 in just the face alone. In the 2004 film The Day After Tomorrow, designers had to design forces of extreme weather with the help of video references and accurate meteorological facts. For the 2005 remake of King Kong, actor Andy Serkis was used to help designers pinpoint the gorilla's prime location in the shots and used his expressions to model “human” characteristics onto the creature. Serkis had earlier provided the voice and performance for Gollum in J. R. R. Tolkien's The Lord of the Rings trilogy. Accordingly, the present invention may implement 3D computer animation technique to generate moving images of the animation character.
Computer animation can be created with a computer and animation software. Some impressive animation can be achieved even with basic programs; however, the rendering can take a lot of time on an ordinary home computer. Because of this, video game animators tend to use low resolution, low polygon count renders, such that the graphics can be rendered in real time on a home computer. Photorealistic animation would be impractical in this context. Professional animators of movies, television, and video sequences on computer games make photorealistic animation with high detail. This level of quality for movie animation would take tens to hundreds of years to create on a home computer. Many powerful workstation computers are used instead. Graphics workstation computers use two to four processors, and thus are a lot more powerful than a home computer, and are specialized for rendering. A large number of workstations (known as a render farm) are networked together to effectively act as a giant computer. The result is a computer-animated movie that can be completed in about one to five years (this process is not composed solely of rendering, however). A workstation typically costs $2,000 to $16,000, with the more expensive stations being able to render much faster, due to the more technologically advanced hardware that they contain. Professionals also use digital movie cameras, motion capture or performance capture, bluescreens, film editing software, props, and other tools for movie animation. Accordingly, the present invention may utilize these types of animation software and hardware to generate moving images of the animation character.
The realistic modeling of human facial features is both one of the most challenging and sought after elements in computer-generated imagery. Computer facial animation is a highly complex field where models typically include a very large number of animation variables. Historically speaking, the first SIGGRAPH tutorials on State of the art in Facial Animation in 1989 and 1990 proved to be a turning point in the field by bringing together and consolidating multiple research elements, and sparked interest among a number of researchers. The Facial Action Coding System (with 46 action units such as “lip bite” or “squint”) which had been developed in 1976 became a popular basis for many systems. As early as 2001 MPEG-4 included 68 Face Animation Parameters (FAPs) for lips, jaws, etc., and the field has made significant progress since then and the use of facial microexpression has increased. In some cases, an affective space such as the PAD emotional state model can be used to assign specific emotions to the faces of avatars. In this approach the PAD model is used as a high level emotional space, and the lower level space is the MPEG-4 Facial Animation Parameters (FAP). A mid-level Partial Expression Parameters (PEP) space is then used to in a two level structure: the PAD-PEP mapping and the PEP-FAP translation model. Accordingly, the present invention may incorporate these facial animation techniques in generating moving images of the animation character.
In 2D computer animation, moving objects are often referred to as “sprites.” A sprite is an image that has a location associated with it. The location of the sprite is changed slightly, between each displayed frame, to make the sprite appear to move. Computer animation uses different techniques to produce animations. Most frequently, sophisticated mathematics is used to manipulate complex three-dimensional polygons, apply “textures”, lighting and other effects to the polygons and finally rendering the complete image. A sophisticated graphical user interface may be used to create the animation and arrange its choreography. Another technique called constructive solid geometry defines objects by conducting Boolean operations on regular shapes, and has the advantage that animations may be accurately produced at any resolution. Accordingly, the present invention may incorporate these computer animation techniques to generate moving images of the animation character.
Computer-assisted animation is usually classed as two-dimensional (2D) animation. Creators drawings either hand drawn (pencil to paper) or interactively drawn (drawn on the computer) using different assisting appliances and are positioned into specific software packages. Within the software package the creator will place drawings into different key frames which fundamentally create an outline of the most important movements. The computer will then fill in all the “in-between frames” commonly known as Tweening Computer assisted animation is basically using new technologies to cut down the time scale that traditional animation could take, but still having the elements of traditional drawings of characters or objects. Two examples of films using computer-assisted animation are Beauty and the Beast and Antz. Computer-generated animation is known as 3-dimensional (3D) animation. Creators will design an object or character with an X, Y and Z axis. Unlike the traditional way of animation no pencil to paper drawings creates the way computer generated animation works. The object or character created will then be taken into a software, key framing and tweening are also carried out in computer generated animation but are also a lot of techniques used that do not relate to traditional animation. Animators can break physical laws by using mathematical algorithms to cheat, mass, force and gravity rulings. Fundamentally, time scale and quality could be said to be a preferred way to produce animation as they are two major things that are enhanced by using computer generated animation. Another great aspect of CGA is the fact you can create a flock of creatures to act independently when created as a group. An animal's fur can be programmed to wave in the wind and lie flat when it rains instead of programming each strand of hair separately. Three examples of computer-generated animation movies are Toy Story, The Incredibles and Shrek. Accordingly, the present invention may incorporate computer-assisted and/or computer-generated animation techniques to generate moving images of the animation character.
Incorporating and applying these technologies to an electronic message can transform plane texts into interesting short films. The present invention seeks to provide methods for communication using electronic messages where users can use their creativity to enhance their contents. Images and videos have much greater impact than simple words. Transforming electronic messages according to the present invention is desirable.
Although various systems have been proposed which touch upon some aspects of the above problems, they do not provide solutions to the existing limitations in providing methods of communication for users to compose electronic messages and select a character where the message is converted into voice and the character is displayed on a screen in motion uttering the message. For example, Coatta et al., U.S. Pat. App. No. 20140129650, discloses: “a wireless communications system that allows a mobile phone, tablet or personal computer user the ability to initiate the sending of a text message or email whereby the sender is able to include photographs, graphs, pie charts and the like within the flow of the actual word by word texting or email writing process, without depending on the traditional necessary step to ‘attach’ the photograph.” However, the disclosure does not provide methods where texts are transformed into moving characters uttering the text.
Mamoun, U.S. Pat. App. No. 20140082520, discloses: “instant messaging applications of all forms, ranging from standard short-message-service (SMS) text messaging to basic multimedia messaging incorporating sounds and images, to myriad ‘chat’ applications, have become a staple form of communication for millions or billions of phone, computer and mobile device users. The following invention is composed of a set of claims that comprise a novel method and system for an enhanced, more expressive system of messaging that combines text and multimedia (audio, images and video) with a gesture-driven, animated interface especially suited for the newest generation of touch-sensitive mobile device screens. An additional set of claims extends the gesture-driven interface to include ‘hands-free’ spatial-gesture-recognizing-devices which can read and interpret physical hand and body gestures made in the environment adjacent to the device without actual physical contact, as well as adaptations for less-advanced traditional computers with keyboard and mouse.” However, Mamoun does not disclose methods where texts are transformed into moving images uttering the text.
According to the surveys, mobile use around the world has been increasing over the past years. While music, games, news and other factors all played a role in the growth, electronic messaging has turned out to be the bulk of what drove the spike in increased usage. The present invention offers a simple, yet efficient, alternative to existing technologies by incorporating methods of voice synthesization and image animation to transform texts into what might be called a very short film! It provides a platform where users may use their creativity to enhance the content of their electronic messages.