Imaging systems introduced in the computer gaming and associated display control field have made a tremendous impact. Systems offered by Microsoft (KINECT) and Sony (PLAYSTATION MOVE) have been disruptive in the marketplace in creating massive sales numbers of new gaming systems. The tremendous popularity of especially the KINECT system can be traced to the root of expanded play and control capacity of hardware within the game environment. Now instead of simply manipulating a keypad, a character (avatar) within a game environment runs, jumps or dances in coordinated action with a player's own digitized body movements.
The PLAYSTATION MOVE is reported as a motion-sensing game controller platform for the PlayStation 3 (PS3). Based on the popular game play style of Nintendo's Wii console, the PlayStation Move uses a camera to track the position of a lighted wand with inertial sensors in the wand to detect its motion. Another wand/object tracking system for video game control is disclosed in U.S. Pat. Nos. 7,843,429 and 8,068,095.
Unique to the KINECT system is the ability to capture and control video (including video game) function by gesture recognition. The KINECT system reported to employ a color camera and depth sensor, where the depth sensor employs an infrared projector and a monochrome sensor. Various patents assigned to Microsoft Corp (e.g., US Publication Nos. 20120050157; 20120047468 and 20110310007) further detail applicable hardware and software for image analysis and capture applied to for purpose of computer or video game navigation or control. US Publication No. 20110301934 addresses gesture capture for the purpose of performing sign language translation. Further examples incorporated by reference in this last publication include: U.S. patent application Ser. No. 12/475,094 entitled “Environment and/or Target Segmentation”, filed 29 May 2009; U.S. patent application Ser. No. 12/511,850, entitled “Auto Generating a Visual Representation”, filed 29 Jul. 2009; U.S. patent application Ser. No. 12/474,655, “Gesture Tool” filed on May 29, 2009; U.S. patent application Ser. No. 12/603,437, “Pose Tracking Pipeline,” filed on Oct. 21, 2009; U.S. patent application Ser. No. 12/475,308, “Device for Identifying and Tracking Multiple Humans Over Time,” filed on May 29, 2009, U.S. patent application Ser. No. 12/641,788, “Motion Detection Using Depth Images,” filed on Dec. 18, 2009, U.S. patent application Ser. No. 12/575,388, “Human Tracking System,” filed on Oct. 7, 2009; U.S. patent application Ser. No. 12/422,661, “Gesture Recognizer System Architecture,” filed on Apr. 13, 2009; U.S. patent application Ser. No. 12/391,150, “Standard Gestures,” filed on Feb. 23, 2009 and U.S. patent application Ser. No. 12/474,655, “Gesture Tool” filed on May 29, 2009.
Whether employing a form of stereo imaging, or using the aforementioned depth sensor to map various z-axis planes on a full color captured image, none of the referenced systems contemplate hardware and software systems as provided herein.
Indeed, given that the commercial embodiment of the KINECT relies on structured light projection technology its 3D depth detection sensitivity is quite limited. The system requires large movements of the hands or body in order to render correct gesture recognition—as do time-of-flight based sensors.
Systems such as the KINECT or others relying on dynamic or passive stereoscopic arrangement rely on feature matching and triangulation calculations to recognize 3D coordinates of objects such as hand or arms. Furthermore, it is important to note that in order for stereo systems to perform the triangulation process, the optical axis of the imaging systems (e.g. dual cameras) and structured-light projection system must be fixed with respect to each other with known calibration parameters. Any deviation from these fixed angles would result in poor depth construction and thus gesture recognition.
Consequently, the elements of the KINECT system are arranged in fixed position, separated across the face of bar-shaped housing. Likewise, other systems used for stereo imaging where various hardware components are combined are designed to be connected to establish a fixed and predetermined relationship between the different camera components. See U.S. Pat. Nos. 7,102,686, 7,667,768; 7,466,336, 7,843,487; 8,068,095 and 8,111,239.
More generally, in a typical stereo imaging system, the cameras are fixed in a position known relative to one another. Over a limited angle range of angles, features (such as by SIFT programming) are extracted from each scene captured from first and second cameras. The feature data is combined with calibration data to extract 3D coordinates from the features and then to coordinate user interface/control based on detected motion or otherwise.
In another multi-camera system, U.S. Pat. No. 8,111,904 describes methods and apparatus for determining the pose, e.g., position along x-, y- and z-axes, pitch, roll and yaw (or one or more characteristics of that pose) of an object in three dimensions by triangulation of data obtained from multiple images of the object. In a method for 3D machine vision, during a calibration step, multiple cameras disposed to acquire images of the object from different respective viewpoints are calibrated to discern a mapping function that identifies rays in 3D space emanating from each respective camera's lens that correspond to pixel locations in that camera's field of view. In a training step, functionality associated with the cameras is trained to recognize expected patterns in images to be acquired of the object. A runtime step triangulates locations in 3D space of one or more of those patterns from pixel-wise positions of those patterns in images of the object and from the mappings discerned during calibration step.
Various multi-camera and/or single-camera, multi-aperture “defocusing” imaging systems to the inventor hereof are also described in the patent literature. See U.S. Pat. Nos. 6,278,847; 7,006,132; 7,612,869 and 7,612,870. These operate in a manner such that the recorded position of matched point/feature doublets, triplets, etc. are measure in relation to one another against a fixed calibration set or otherwise know relationship between/within the image capture means to generate Z-axis values from imaged X-Y coordinate information.
Each of the aforementioned imaging systems is limited in some fashion. Of the commercially-available systems, the PLAYSTATION MOVE requires a wand and the KINECT system offers limited resolution (as further defined herein) and depth of field. All of the stereo imaging approaches further require feature matching and depth extraction by triangulation or other known effect—all leading to computational intensity. Furthermore, it is important to note that in order for stereo systems to perform the triangulation process, the optical axis of the imaging systems (e.g. dual cameras) and structured-light projection system must be fixed with respect to each other with known calibration parameters. Any deviation from these fixed angles would result in poor depth construction and thus gesture recognition. Whether based on stereo imaging triangulation or time-of-flight, such systems require careful calibration, complicated optics, and intensive computational resources to measure the 3D objects in order to capture 3D gestures. Defocusing approaches can also be computationally intensive and may in some cases require marker feature application (e.g., by contrast, projected light, etc.) for best accuracy. The multi-camera pose-based approach described in the '904 patent is perhaps the most computationally intensive approach of all.
In addition to the above, because of their inherent constraints, current systems do not allow for the pairing of arbitrary cameras and display systems in order to render depth measurements, thereby requiring the purchase of a system in its entirety as opposed to creation of a gesture recognition system from separate hardware components. Without the teachings herein, it is not currently possible to take advantage of a gesture recognition system using separate camera hardware components such as a computer and a smartphone or a smart television and a networked camera that can be set up in a matter of minutes with few limits on the placement of the various camera components.
Systems are provided that operate outside of stereo imaging principles, complex computation or calibration requirements and/or noted hardware limitations. Thus, the systems offer advantages as described below and as further may be apparent to those with skill in the art in review of the subject disclosure.