Currently, 3D modelling and the application of full or partial CGI (Computer Generated Imagery) objects, and furthermore that of various 3D effects gain increasing popularity in all areas of film making. The procedure usually applied in the film industry is the subsequent supplementing, even with 3D objects, of the film material recorded on location, and the use of image manipulation techniques, for which the 3D modelling of the real environment, and in this regard the knowing of the current position and viewing direction of the original recording camera are necessary. These procedures are supported by several products on the market, but they have a limited use and/or are very expensive.
Nowadays, the film industry increasingly develops in the direction of actively using the data and structure of the space (environment). The already available approaches may even offer a stereo 3D vision to the spectator in the case of both TV and the movies. No matter if 2D or 3D imaging is involved, the records of course always reflects the 3D world, even if the given film or animation is about a virtual world. In this field, many technical solutions by which a dazzling vision can be conjured up in front of the spectators' eye.
If during the shooting, not only pictures are recorded, but they are also synchronised with the spatial locations of the objects in the real world, as well as the current position and orientation of the camera in space, even virtual 3D, computer-generated and model-based, objects can be inserted simply and quickly, or other special effects may be applied during the post-production of the film. Of course, the manufacturers are making efforts to meet these requirements, but known solutions are burdened with a number of problems for which no complex solution has been found so far. The apparatuses developed for this purpose available on the market are very expensive, and therefore only used for high budget films, or they are inaccurate and therefore cannot be used for professional purposes. Furthermore, they generally require considerable post-production (which represents a high cost by way of example in the case of a film, because of the high number of post-production hours), and therefore they are not adapted for real-time processing (like, for example, a live TV report) either. In addition, the currently available apparatuses generally have a large (extensive) size. For the synchronised recording of the 3D data of the world, i.e. the environment surrounding the recording apparatus, the acquisition of spatial information (data collection) must be carried out on the one hand, and the on-going tracking of the spatial position of the moving camera must be provided for on the other hand. The treatment of these two problems can be combined using the analogy of the related problem family SLAM (Simultaneous Localisation and Mapping) known from robotics.
For mapping the space (the environment) and recording the spatial information, several solutions are known, and for this purpose a number of sensors based on various measuring principles are available. In this field, the most popular remote sensor is the so-called LIDAR (Laser Imaging Detection and Ranging) or laser ranging sensor, which generally measures the distance of the objects in the environment in the nodes of a two-dimensional grid, thereby providing a depth image from the environment. The resolution of laser ranging sensors is heavily limited, and most LIDARs only measure in one plane. Even the more developed and more expensive units (like for example laser ranging sensors with a rotating head) provide data about the environment with a resolution of max. 32 to 64 lines. Laser ranging does not provide data about the trajectory of the recording apparatus.
There are also so-called Kinect-based solutions (S. Izadi et al.: KinectFusion: Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera, in Proceedings of the 24th annual ACM symposium on User Interface Software and Technology, pp. 559-568, 2011); the film industry has also discovered for itself the so-called Kinect sensor adapted also for identifying the space, but the film-related application of the apparatus according to the document and the solutions based on the apparatus are very much limited. The apparatus according to the document may only be used indoors and within a distance of approx. 5 to 8 meters. For generating the 3D world model, i.e. a static voxel model, and for tracking the camera position, it uses the depth data, and the optical information based distance estimation detailed above is not applied. The world model is developed and made more detailed by the continuously obtained fresh data. The world model may also be provided with texture. Virtual objects obtained e.g. through computer modelling may also be integrated into the world model generated by an algorithm according to the document. In the Kinect sensor, the RGB camera (colour video camera) and the depth sensor are arranged on different optical axes, shifted in relation to each other, and therefore the RGB image recorded by the Kinect sensor and the point cloud are not entirely in alignment. The Kinect sensor does not provide information about its own movement, and such systems available on the market only operate with a standing, fixed camera.
Several approaches are known also for tracking the spatial position of a moving camera. From a technical aspect, two kinds of spatial camera position identification methods are broadly used in the film industry. One method involves determining the camera position by the software analysis of its two-dimensional RGB image. This method does not require a lot of facilities, and therefore it is low priced, but it is very much labour intensive, and in many cases not sufficiently accurate. Another method involves determining the camera position independently of the camera, by means of fixed external sensors. This position identification method requires the preliminary installation and calibration of the sensors and the co-ordinated work of several people, and therefore the related cost is high.
The approach to perform the software analysis of the two-dimensional RGB image of a camera is demonstrated below. The two-dimensional images comprise well-identifiable points (for example corner points, contour lines) which can be seen and automatically tracked in many consecutively made images. By making use of their interrelated displacement, i.e. the parallax, taking into consideration the optical characteristics of the camera, the orientation and movement trajectory of the camera (i.e. 6 degrees of freedom trajectory, the position and the orientation as a function of time), as well as the spatial locations of the tracked points can be reconstructed. This approach involves the above mentioned image processing based spatiality measurement, because if the object locations are known, then the camera positions can also be calculated. The advantage of this method is that the execution only requires the normal RGB image recorded by the camera and the analysing software. However, a serious disadvantage is its slowness. In the case of an HD (high definition) image of 1920×1080 pixel resolution, depending on the speed of the computer used for execution of the method, for a simple recording comprising points which can be identified easily, a period equivalent to many times the length of the footage is necessary for reconstructing the trajectory of the camera. In the case of a more complicated recording for which considerable human intervention may also be required, the reconstruction period may even be several hundred times this figure. For the processing of one minute of recording, i.e. for determining the trajectory of the camera, even one day may be needed. A further disadvantage of the method is that only the picture of a camera moving in space can be used for it. It may not be applied either with a fixed (for determining the fixed position), or with a panning camera image. In a so-called ‘green screen’ studio environment, only limited use is possible, because there are no trackable and well-identified points in the homogeneous background. A further problem of the method is that it is necessary to remove from the image somehow, characteristically by a hand-drawn and animated mask, the elements (for example vehicles, people) which are in motion compared to the reference points, because if the points thereon are also tracked by the software, this will result in an erroneous trajectory of the camera. In addition, the so-called pseudo-feature points, i.e. the not well usable identified points like reflections, rounded edges or the so-called T-nodes can be a serious problem, and generally demand manual correction. The applied identified and tracked points only provide low resolution, ad hoc information about the space, and on this basis the model of the space may only be generated by considerable human intervention.
For tracking the camera, fixed installed external sensors fitted independently of the camera can also be applied, and they may be of mechanical or optical type.
The mechanical methods are based on a mechanical sensor environment installed on mechanised camera manipulators, cranes, dollies, and camera heads. The sensors calculate the current position of the camera from displacements measured on the articulation joints of the camera manipulating apparatus. The use of these mechanical means is difficult, and they have a relatively low accuracy. Their use may not be matched to the application of hand-held cameras or cameras moved without special camera manipulators, and therefore the utilisation of a mechanical sensor environment is very expensive. The sensor environment does not provide information on the structure of the space.
The basis for the optical method adapted for tracking a camera is that several external optical sensors monitor continuously the markers fitted on the recording camera, i.e. their movement compared to the sensors are measured. The more optical sensors are used, the more accurately the position of the recording camera can be determined. Practically, this method provides real-time data supply, it can be used with a stationary or panning camera, and also in a ‘green screen’ environment, since it does not require identified and tracked points in the environment. It is a disadvantage that implementation is very costly. Not only the hardware environment applied in the method is expensive, but the installation and calibration of the system also demand the work of several people, and therefore this solution is characteristically used in a studio environment, with fixed installation. When shooting on an external location, the sensor environment should be built up and calibrated at each filming location, followed by the dismounting of the sensor environment. A further disadvantage of this solution is that it only monitors and specifies the camera position, and does not register any data about the structure of the space, and demands that a specified part of the sensors see the tracked camera continuously.
In U.S. Pat. No. 8,031,933 B2 an apparatus for generating a three-dimensional model of the scanned space is disclosed. The recording unit of the apparatus comprises an RGB camera, a stereo camera, a depth sensor and a tilt sensor fixed to the recording unit. The apparatus according to the document generates the three-dimensional model of the space seen by it in a way that it uses information from several sensors of the recording unit. It synchronises the data originating from each sensor by means of timestamps, and furthermore tracks the position and the orientation of the camera, displaying the three-dimensional model and subjecting it to further analysis and post-processing.
In US 2010/0118122 A1 an apparatus is disclosed for generating a three-dimensional model of the part of the space investigated by the sensors in a way that the depth information and the optical recordings are combined. In the apparatus according to the document, the optical camera and the depth sensor may be arranged along one optical axis. After rendering, the processed data are shown on a display.
The system described in U.S. Pat. No. 7,583,275 B2 generates a three-dimensional model of the environment on the basis of depth data, and while recording the depth data, it continuously tracks the position and orientation of the recording apparatus (the orientation by means of an inertial sensor), projecting the image obtained from the optical sensors to the three-dimensional model, making use of the data provided by the tracking, and displaying the so textured three-dimensional model.
The solution described in U.S. Pat. No. 6,160,907 is adapted for constructing a three-dimensional model from the real elements stemming from the environment detected by the recording units and from further virtual elements. Apparatuses adapted for generating a three-dimensional model are disclosed in US 2011/0134220 A1 and US 2008/0246759 A1. A similar apparatus is discloses in U.S. Pat. No. 7,215,430 B2, in which the recording unit comprises an optical camera in addition to the LIDAR supplying depth data. A solution making use of depth data is described in U.S. Pat. No. 7,113,183 B1. In US 2008/0240502 A1 a solution is disclosed, in which a three-dimensional depth map is prepared on the basis of image information obtained optically. In US 2012/0013710 A1 a system adapted for generating a three-dimensional model is disclosed, which also comprises an interconnected space scanner and a two-dimensional sensor. According to the document, the three dimensional model is generated on the basis of the data of the space scanner and the two-dimensional sensor. The distance data of the space scanner are supplemented and improved by distance data obtained from two further cameras.
Solutions adapted for generating three-dimensional models are disclosed in US 2008/0260238 A1, US 2009/0322745 A1, U.S. Pat. No. 7,822,267 B2 and U.S. Pat. No. 7,928,978 B2.
A solution aimed at tracking camera motion is disclosed in U.S. Pat. No. 7,956,862 B2. Solutions related three-dimensional modelling are disclosed in U.S. Pat. Nos. 6,072,496, 6,124,864, 6,208,347 B1, 6,310,620 B1, 6,429,867 B1, 6,853,373 B2, 7,103,211 B1, 7,181,363 B2, US 2009/0080036 A1, U.S. Pat. No. 7,586,489 B2, US 2010/0209013 A1, U.S. Pat. No. 7,974,461 B2, US 2011/0115792 A1, US 2011/0274343 A1, U.S. Pat. No. 8,085,388 B2 and WO 2010/130245 A1
In L. Heng, G. H. Lee, F. Fraundorfer and M. Pollefeys: Real-Time Photo-Realistic 3D Mapping for Micro Aerial Vehicles, in: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4012-4019, 2011 a 3D modelling approach is disclosed in which depth information is constructed based on stereo images. During the development of a 3D model, the position and orientation of the camera is computed based on sensor information of the previous and current frames of the stereo images.
Since also according to the description above, the film industry is strongly committed to 3D, i.e. uses 3D modelling and 3D imaging actively, the need has emerged for a compact and efficient system which is able to record the image and range information of the investigated space, and preferably to display this information in almost real-time, by which the recorded depth data can be handled and processed, synchronised with the already recorded pictures, and a direct feedback can be given about them, with the tasks above solved more efficiently than by the prior art solutions.