The concept of the video-telephone has long been anticipated, including in the serialized novel “Tom Swift and His Photo Telephone” (1914). An early working videophone system was exhibited by Bell Labs at the 1964 New York World's Fair. AT&T subsequently commercialized this system in various forms, under the Picturephone brand name. However, the Picturephone had very limited commercial success. Technical issues, including low resolution, lack of color imaging, and poor audio-to-video synchronization affected the performance and limited the appeal. Additionally, the Picturephone imaged a very restricted field of view, basically amounting to a portrait format image of a participant. This can be better understood from U.S. Pat. No. 3,495,908 (Rea), which describes a means for aligning a user within the limited capture field of view of the Picturephone camera. Thus, the images were captured with little or no background information, resulting in a loss of context.
In the modern world, two-way video communications are now enabled by various technologies. As a first example, cellular phones, including phone-cameras, are widely used. Wile currently many cell phones include cameras for capturing still images, most cell phones still lack live video capture and display capability. However, companies such as Fotonation Ltd. (Ireland) are enabling new technologies for live video phone-cameras, such as face detection and recognition, as well as face tracking, which could enhance the user experience. As an example, U.S. Patent Publication No. 2005/0041840 (Lo) describes a camera phone with face recognition capability. While phone-cameras are easy to use, highly mobile, and have arguably become essential for modern life, the size and cost structure constraints limit their applicability for some uses.
Another realization of a device with these general video-phone like capabilities is the “web-cam,” where a computer, such as a lap-top unit, is equipped with a camera that often has pan, tilt, and zoom capabilities. Companies such as Creative Laboratories (Singapore) and Logitech (Switzerland) presently offer enhanced cameras as computer accessories for web-camera use. These web-cameras can have enhanced audio-capture capability, movement detection, face tracking, and other value-adding features. As an example, U.S. Patent Publication No. 2006/0075448 (McAlpine et al.), describes a system and method for mechanically panning, tilting, and/or zooming a webcam to track a user's face.
Apple Inc. (Cupertino, Calif., U.S.A.) has further extended the web-camera, with the “iSight” and “iChat” products, where the camera is integrated into a lap-top computer, and onboard image processing automatically adjusts the white balance, sharpness, color, focus and exposure and filters out noise to ensure that the transmitted picture provides bright, focused, and true-color imagery. The “iChat” function enables one-to-one chat, multi-way chat, or audio chat with up to ten people. While these video-camera-computer systems are enabling internet-based video-telephony, these technologies have not become ubiquitous like the cell phone has. Certainly, the differential increased cost and size are reasons for this. However, there are many issues related to the user experience with the web-camera that have not yet been adequately addressed. In particular, these systems are not fully optimized for easy use in dynamic environments, such as the home. To accomplish this, technology improvements around the user interface, image-capture, and privacy factors may be needed.
Notably, WebEx Communications (recently acquired by Cisco Systems) has adapted web-camera technology for the purpose of providing inexpensive web-based video-conferencing for conducting meetings, training sessions, webinars, for providing customer support, and for other business purposes. WebEx delivers applications over a private web-based global network purpose-built for real-time communications. However, the WebEx approach, while useful, does not anticipate the concerns and needs that people have when communicating by video on a personal basis.
As another alternative to the phone-camera or the web-cam, a video-phone having a larger screen, a more functional camera with zoom and tracking capability, enhanced audio, and multi-user capability, could provide an enhanced user experience. Such enhanced video-phone devices could be used in the home, office, or school environments, where mobility can be compromised for improved capture and display capabilities. Most simply, such a system could combine a camera and a television, and use a phone or Internet connection to transfer information from one location to another. U.S. Patent Publication No. 2005/0146598 No. (AbbiEzzi et al.), describes a basic home teleconferencing system with this construction. This system indeed contains the basic image capture and display elements for a residential teleconferencing system. Like the web-cameras, the system can capture and display a large field of view, which improves on the contextual capture over the original Picturephone. However, again there are many aspects of residential video-telephony, relative to managing privacy and personal context in a dynamic residential environment that this system does not anticipate.
A system described in U.S. Pat. No. 6,275,258 (Chim) provides an enhanced teleconferencing system in which multiple microphones are used to enable enhanced subject tracking using audio signals. Chim '258 provides an audio-driven enhanced tracking process, which employs multiple microphones to localize and track an individual speaker in their local environment. An audio processor derives an audio tracking signal, which is used to drive a camera to follow the speaker. The field of view captured by the camera can be optimized, by both mechanical movement (pan, tilt, and zoom) and image cropping, to follow and frame a speaker or a sound emitting object in their environment. Thus, image framing in Chim '258 is keyed on including the speaker in the image, where the exemplary speaker is a single user is sitting at a desk, engaged in a teleconferencing activity. Chim '258 does not consider the broader problem of following and framing the activities of one or more users under circumstances where speech is an insufficient cue. In particular, Chim '258 does not provide scene transition analysis and shot framing analysis to switch between shots that are appropriate for video capture of one or more users engaged in informal and unscripted activities within a broad, largely unconstrained, environment. Thus, while Chim '258 suggests that his system might be used in a residential environment, in most respects, the system is really targeted for the corporate conference room or office environments, as the privacy, contextual interpretation, and video capture management aspects are underdeveloped and insufficient for residential use.
As another approach to video communications, enhanced video-telephony has been realized by video-conferencing equipment, which is largely targeted for the corporate market. As an example, companies such as Cisco Systems (San Jose, Calif., U.S.A.); Digital Video Enterprises (Irvine, Calif., U.S.A.); Destiny Conferencing (Dayton, Ohio, U.S.A.); and Teleris (London, United Kingdom), are offering enhanced video-teleconferencing equipment targeted for use by corporate executives. Exemplary teleconferencing prior art patents associated with some of the above companies include U.S. Pat. Nos. 5,572,248 and 6,160,573 (both by Allen et al.), and U.S. Pat. Nos. 6,243,130 and 6,710,797 (both by McNelley et al.). The product offerings of these companies emphasize image and sound fidelity, environmental aesthetics and ergonomics, life-size images, eye contact image capture and display, and the seamless and secure handling of large data streams through networks. For example, improved eye contact is typically achieved by hiding a camera behind a screen or beam splitter, through which it unobtrusively peers.
Although video-conferencing systems are designed to handle multiple participants from multiple locations, the systems are optimized for use in highly controlled environments, rather than the highly variable environments typical to personal residences or schools. In particular, these systems assume or provide standard conference rooms with a central table, or more elaborate rooms, with congress-like seating. As image capture occurs in structured environments with known participants behaving in relatively formalized ways, these conference systems are not enabled with capabilities that could be desired in the dynamic personal environments. These systems can also be equipped to extract the images of the local participants from their contextual backgrounds, so that when the image of that participant is seen remotely, the image appears contextually in the remote environment or in a stylized virtual environment. The cost of teleconferencing systems is often in excess of $100,000, which is not supportable by the residential market.
It is noted that some enhanced teleconferencing systems, which are adaptive to multi-person conversational dynamics, have been anticipated. In particular, a series of patents, including U.S. Pat. No. 6,894,714 (Gutta et al.), and U.S. Pat. Nos. 6,611,281 and 6,850,265 (both by Strubbe), which are all assigned to Philips Electronics (Eindhoven, Netherlands), suggest methods for teleconferencing under dynamic circumstances. As a first example, the Strubbe et al. '281 patent proposes a video-conferencing system having a video locator and an audio locator whose output is used to determine the presence of all participants. In operation, the system focuses on a person who is speaking and conveys a close-up (preferably life size) view of that person based on the video and audio locator outputs. Thereafter, if the person speaking continues to speak or becomes silent for a predetermined time period, the system operates to adjust the camera setting to display other participants in sequence who are not speaking, or it zooms out the camera by a specified amount to include all participants. The system is also configured to capture a new person entering or an existing participant exiting the videoconference session. The videoconference scenario of FIG. 2 of the Strubbe '281 patent, which depicts a conference room like setting with participants sitting around a table, does seem particularly suited to handling a formal or semi-formal corporate meeting event, where the various participants are of relatively equal status, and certain amount of decorum or etiquette can be expected. In such circumstances, the formalism of capturing and transmitting the non-speaking participants in sequence could be applicable and appropriate.
The Strubbe '265 and Gutta '714 patents basically expand upon the concepts of the Strubbe '281 patent, by providing adaptive means to make a videoconferencing event more natural. In the Strubbe '265 patent, the system applies a set of heuristic rules to the functionality provided by the camera, the audio locator, and the video locator. These heuristic rules attempt to determine whether the system should follow a current speaker or a switch to a new speaker. Various factors, such as time gaps between speakers, and 5-degree co-location thresholds are measured and assessed against confidence level estimations to determine whether the system should switch to another individual or switch to wide field of view image capture. The Gutta '714 patent extends the concepts of dynamic videoconferencing further, as it identifies a series of behavioral cues from the participants, and analyzes these cues to predict who the next likely speaker is, and then pro-actively makes a seamless transition in shifting the video-capture from a first speaker to a second speaker. These behavioral cues include acoustic cues (such as intonation patterns, pitch and loudness), visual cues (such as gaze, facial pose, body postures, hand gestures, and facial expressions), or combinations of the foregoing, which are typically associated with an event. As depicted in the respective FIG. 1 of each patent, these patents basically anticipate enhanced video-conferencing appropriate for the conference room or for congress-like seating arrangements, where there is little movement or change of the participants. These patents also seem particularly suited to handling a formal or semi-formal corporate meeting event, where the various participants are of relatively equal status, and certain amount of decorum or etiquette can be expected. Although the Gutta '714 suggests broader applicability, and modestly anticipates (see Col. 11 table) a situation with a child present, the systems proposed in the Strubbe '281, Strubbe '265, and Gutta '714 patents are not targeted to the residential environment. For example, the proposed rules for predicting and redirecting image framing to the next speaker would be undesirable in video capture of chaotic informal activities, where people often interrupt each other. Likewise, these patents do not consider how to frame shots, identify shot or scene transitions, and execute the resulting shot changes for unscripted events likely involving many people. Thus, these patents are not sufficiently adaptive to residential dynamics, and the privacy and context management aspects are also underdeveloped.
Teleconferencing or enhanced video communications has also been explored for the office and laboratory environments, as well as the conference room environment, to aid collaboration between colleagues. The first such example, the “media space”, which was developed in the 1980's at the Xerox Palo Alto Research Center, Palo Alto, Calif., U.S.A., provided office-to-office, always-on, real-time audio and video connections. As a related example, the “VideoWindow”, described in “The Video Window System in Informal Communications”, by Robert S. Fish, Robert E. Kraut, and Barbara L. Chalfonte, in the Proceedings of the 1990 ACM conference on Computer-Supported Cooperative Work, provided fill duplex teleconferencing with a large screen, in an attempt to encourage informal collaborative communications among professional colleagues. Although such systems enabled informal communications as compared to the conference room setting, these systems were developed for work use, rather than personal use in the residential environment, and thus do not anticipate residential concerns.
Prototype home media spaces, for facilitating communications between a telecommuter and in-office colleagues have also been developed. For example, an always-on home media space is described in “The Design of a Context-Aware Home Media Space for Balancing Privacy and Awareness”, by Carman Neustaedter and Saul Greenberg, in the Proceedings of the Fifth International Conference on Ubiquitous Computing (2003). The authors recognize that personal privacy concerns are much more problematic for home users than for office based media spaces. The described system reduces risks of privacy loss using a variety of methods, including secluded home office locations, people counting, physical controls and gesture recognition, and visual and audio feedback mechanisms. However, this system was not optimized for personal communications by the residents and does not necessarily provide adequate privacy controls for home users.
Thus, there is a remaining need and opportunity, which is not anticipated in the prior art, for a residentially targeted system that is generally useful for aiding family video-conferencing or video communications with one or more remote individuals. Such a system should function as seamlessly as is reasonably possible while being adaptable to the dynamic situations present in a residence. In particular, the system should enable the users to readily manage and maintain their privacy, relative at least to image capture, recording, and transmission. This system should also manage the contextual information of the user and their environments, to provide an effective communication experience.
Of course, the enjoyment users experience with such a system will greatly depend on the quality of the images, relative to how they are captured and presented. Consumer captured images, whether still or video, often have uneven image quality and artistic characteristics. This occurs because most consumers are untutored amateur photographers shooting images of unscripted live events. Although image quality characteristics can be improved by auto-focus, auto-flash and aperture control, red-eye reduction, and other technologies, artistic inadequacies are harder to address. In contrast, people attend movies for the entertainment and artistic value, largely based on the acting, plot, and genre. However, the artistic portrayal of the characters and scenes is very much affected by the cinematography. When cinematographers shoot movies, they typically use a series of guidelines concerning shot framing (scale and centering (such as the rule of thirds), shot perspective (camera angle and placement), shot transitions, and camera moves (an, tilt, zoom, and dolly). Certainly, the comparison of consumer photography to cinematography is not entirely fair, as the latter benefits from large budgets, tightly scripted events, complete control of one or more cameras, and the opportunity to re-shoot the scenes for enhanced effect.
However, it may be possible to adapt aspects of cinematography to consumer use, in residential or other environments. Rather than teach consumer to be cinematographers, a better approach would be to adapt cinematographic sensibilities or guidelines into consumer devices. While this has not yet been attempted per se, there are various prior attempts at automated cinematography, videography, or camera selection that are worth considering. For example, prior art U.S. Pat. No. 5,457,370 (Edwards) provides a computerized motion control apparatus for controlling a plurality of degrees of freedom of the positioning and orientation of a camera within a film studio. The apparatus including a dolly, positional along an extensible track, and a camera provided with motors that provide variable control to pan, tilt, and roll. As another example, prior art U.S. Pat. No. 5,900,925 (Navarro) anticipates a hybrid approach, in which a studio camera can be computer driven or operated by a cameraman. In this case, the system responds to the measured or sensed position of a camera support such as a crane or dolly, and determines the desired camera pan, tilt, roll, focus, zoom, etc., based on predetermined correspondence between camera location and camera parameters. U.S. Patent Publication No. 2002/0130955 (Pelletier) anticipates the adaptation of automated cinematographic techniques to television production, In particular, Pelletier '955 anticipates the real-time operation of one or more cameras, relative to panning and zooming, to enable the average or occasional user to produce high quality pleasantly viewable images of a scene without needing expert knowledge. None of these prior art references either explicitly or implicitly anticipate the limitations, differences, or requirements for adapting cinematographic principles for use in residential video communications.
It is also recognized that a few attempts have been made to transfer cinematographic rules to unscripted live events outside the studio environment. As a first example, the paper “Cinematographic Rules Applied to a Camera Network”, by P. Doubek et al., published at Omnivis2004, describes an algorithm based approach in which an imaging system acts as a combination of a director and a cameraman to drive image capture using multiple low-end networked cameras, with the goal of producing an attractive video stream. The system detects humans using background segmentation and skin area modeling to locate hands and heads. The system then determines a best available view or framing to show a person using visibility measures based on object size, object velocity of a tracked object, or the detection of skin. It then provides an appropriate view using images captured from one of the available cameras, or an interpolated view from a virtual camera, which is synthesized from images collected by the multiple cameras. To maximize the artistic and dramatic effect, this system selects a long shot when a subject is moving, moves to a medium shot or close-up for stationary subjects, and then attempts to provide a subjective shot (showing what the subject sees from the subjects point of view) by image interpolation when the subject stops moving. This system also tries to apply other cinematographic conventions, including the action axis rule when changing camera selection, and providing shot sequences that are progressive, regressive, or contrasting to enhance the artistic effect. Doubek et al. does provide a resistance factor, to prevent fast successions of viewpoint changes for the viewer. Accordingly, the shot or framing is changed only if an available new view is better than the current view, based on various visibility measures.
While the approach of Doubek et al. has merit, it seeks an artistic rendering and does not adequately anticipate provide video capture of real time unscripted human activities within a residential environment, potentially involving multiple individuals and limited camera options. Doubek et al. measures subject activity, using a velocity vector approach, which can be insufficient, as factors such as the frequency and distance of subject motion can also be important. Doubek et al. also does not consider other measures of subject activity, such as a reframing frequency, that can indicate a need for changing framing or shot selection. As a result, Doubek et al. does not really develop concepts and metrics to discriminate between changes in user activity consistent with the current scene or shot (and intra-scene transition) and changes in user activity that that are too large for the current shot (an inter-scene transition). Thus, Doubek et al. does not provide a method of shot selection, shot framing, and shot transition management for different amounts of relative subject motion. This is understandable as the cinematography rules applied by Doubek et al. do not provide for imaging moving subjects with any shots tighter than long shot. Thus, Doubek et al. does not anticipate that shot transition timing can depend on both the current shot selection and the amount of subject activity. Finally, Doubek also does not adequately provide for shot selection, framing, and transitions with multiple subjects.
U.S. Pat. No. 7,349,008 (Rui et al.) anticipates an automated camera management system using videography rules, that is targeted for use in recording presentations (such as lectures or classes). The application of the videography rules is based on the type (size) of the presentation rooms and the use of a multitude of cameras. The videography rules in Rui '008 cover placement of the multiple cameras, image capture of presenter behavior, image capture of generalized audience behavior and image capture of audience members questioning the presenter. Rui '008 also anticipates shot transition rules based on shot duration and presenter activity. The videography rules and system of Rui '008 are premised around the formalized setting of a presentation room, and are not conceived to address the issues pertinent to personal video communications, with the spontaneity and limitations of the residential environment. In particular, Rui '008 is not constrained by limited camera placement, does not anticipate shot selection of framing problems for framing one or more moving subjects, and does not develop adequate shot (or scene) transition rules and supporting activity or probability metrics.
U.S. Patent Publication No. 2006/0251384 (Vronay et al.) also explores adapting cinematographic sensibilities to video applications. Vronay et al. '384 describes an automated video editor (AVE) that is principally used in processing pre-recorded video streams that are collected by one or more cameras to produce video with more professional (and dramatic) visual impact. Accordingly, each of the video streams is analyzed using a scene identification module to partition each stream into a sequence of scenes, with this identification based significantly on the determination of whether an individual is speaking or not. A shot identification module then analyzes each scene to identify and rank candidate shots. Each scene is also analyzed by a scene-parsing module to identify objects, people, or other cues that can effect final shot selection. The best-shot selection module applies the shot parsing data, cinematic rules regarding shot selection and shot sequencing, to select the best shots for each portion of a scene. Finally, the AVE constructs a final video and each shot based on the best-shot selections determined fro each video stream.
Vronay et al. '384 also anticipates the AVE being used for various purposes, including for live unscripted events such as teleconferencing and birthday parties. In the case of multiparty live teleconferencing, an example is given where the AVE understands the structure of the communication event in advance, and it applies cinematic rules to deal with adding another remote participant (a third location), to determine speaker selection, and to arrange picture-in-picture or split screen viewing. However, in the case of unscripted events lacking predefined scene structures, such as birthday parties, Vronay et al. '384 anticipates that the users pre-record the video using one or more cameras, and then the users provide input relative to scene selection, person identification, shot selection, and final review during a subsequent video editing process using the AVE. Notably, Vronay et al. '384 does not extend the AVE technology to enhancing the live video capture of unscripted real time events when the video structure is unknown in advance and user behavior is uncertain. However, these are indeed the conditions that can occur during personal video communications as users multi-task in their environments, user events change, or the number and identities of the users change. Therefore, Vronay et al. '384 does not anticipate the scene transition and shot framing concepts, as well as the supporting metrics, that enable automated video capture of real time unscripted and unstructured events.
As another example, somewhat similar to Vronay et al. '384, the paper describes a system in which cameras record video of “unexpected moments in people's lives”, and the video is post-processed using cinematographic principles to create movies that appear as if they were created as real expertly captured film footage. To accomplish this, Kin et al., Cinematized Reality: Cinematographic 3D Video System for Daily Life Using Multiple Outer/Inner Cameras, IEEE Computer Vision and Pattern Recognition Workshop, 2006, populates a living space with a multitude of cameras, including multiple ceiling mounted cameras and an omni-directional camera. Each camera then captures video of the ensuing events, with synchronizing time code data. The video from each camera is the analyzed by an algorithm using cinematographic guidelines regarding shot selection, shot perspective, zooming, panning, indecisive cuts, and the action axis to classify the available shots, as well as potential shots synthesized for a virtual camera. The virtual camera shots are rendered using video from the omni-directional camera and the ceiling mounted cameras in combination as appropriate. Users (the director) then select the preferred shots to compose a movie progressing from scene to scene using video from a real or synthesized virtual camera. While the method of Kim et al. enables cinematization of real time video of unscripted events, the use of multiple liberally distributed cameras and post-processing mean this technique is inappropriate for real time video communications.
There are also various examples of cinematography application to virtual worlds (for animation, gaming, metaverses such as Second Life,etc.), including the principles outlined in the paper “Cinematographic User Models for Automated Realtime Camera Control in Dynamic 3D Environments”, by William H. Bares and James C. Lester, which was published in the Proceedings of the Sixth International Conference on User Modeling (1997). This paper describes a cinematographic user-modeling framework that provides user-sensitive real time camera control in of animations in dynamic 3D environments. Planning camera shots and camera positions in virtual environments, while preserving continuity, requires solving precisely the same set of problems that are faced by cinematographers. Users can provide input by selecting the viewpoint style (informational or dramatic), the camera pacing (slow or fast), and the transition style (gradual or jump). The cinema algorithm of Bares and Lester the applies these user preferences to a user model that enables virtual cameras to track objects by executing cuts, pans, and zooms (pull-ins and pull-outs), and to make on-the-fly decisions about camera viewing angles, distances, and elevations, while statistically holding to the user preferences. Similarly, U.S. Pat. No. 6,040,841 (Cohen et al.), describes a hierarchical approach for applying cinematographic rules in real-time animation to create virtual scene cinematography using virtual cameras and virtual actors. While the methods of Bares and Lester and Cohen '841 adapt cinematography to real-time action, their optimization for virtual worlds means these approaches are not bound by the constraints appropriate to real-time video communications of spontaneous unscripted events with limited cameras in residential environments.
In summary, there is then an opportunity and need to provide a method for automating video capture, including image framing, for applications such as personal video communications, in which video images of real time unscripted events are captured by a constrained set of cameras. Current video communications systems, and other automated videography systems do not satisfy this need either individually or in combination.