In order to measure distance in a conventional visual system, two or more cameras with different perspectives are usually needed. Alternatively, multiple images of the same scene at different settings can be used. There are smart phone applications that perform distance estimation using a known height of the user and determination of camera orientation to estimate the distance of the object, but such approaches are very sensitive to camera height and typically require a reference image to be captured a known camera height and/or camera angle.
The determination of distance is useful in many applications, such as establishing the distance to a target in military applications, modeling of 3D structures (e.g. sizing a room or a building), and in sports (e.g. distance to hole estimation in golf, distance to target measurement in archery or distance to prey assessment in hunting). Some conventional distance measurement instruments use either laser or ultrasound as a mechanism for measuring the distance. The ultrasound-based instruments measure the time of flight for a sound pulse while the laser based systems can utilize either time of flight or phase shift methods to measure distance. While ultrasound and laser based instruments are accurate, each requires specialized equipment to perform the task.
The ability to measure distance using a single camera is an underdetermined problem; there is insufficient information in a single monocular picture to determine distance of an object in an image. In order to measure distance using purely visual (i.e. camera) information, more than one image is needed. At least two images, each with a different perspective angle and known relative positions, are conventionally employed as images of a common scene to allow for the determination of distance of the object. Typically this is done with two cameras, but it can be accomplished using a single camera if the scene is sufficiently static, for example, via depth-from-defocus methods (which estimate the distance of an object to the camera based on its degree of defocus at different camera focal settings) or structure-from-motion (which estimate the 3D structure of an object from tracking information of multiple features representative of the object as the camera moves). These methods, however, are computationally expensive.
Video compression is employed in applications where high quality video transmission and/or archival is required. Video compression is achieved by exploiting two types of redundancies within the video stream: spatial redundancies amongst neighboring pixels within a frame, and temporal redundancies between adjacent frames. This modus operandi gives rise to two different types of prediction: intra-frame and inter-frame. These in turn result in two different types of encoded frames: reference and non-reference frames. Reference frames, or “I-frames” are encoded in a standalone manner (intra-frame) using compression methods similar to those used to compress digital images. Compression of non-reference frames (e.g., P-frames and B-frames) entails using inter-frame or motion-compensated prediction methods where the target frame is estimated or predicted from previously encoded frames in a process that typically entails three steps: (i) motion estimation, where motion vectors are estimated using previously encoded frames. The target frame is segmented into pixel blocks called target blocks, and an estimated or predicted frame is built by stitching together the blocks from previously encoded frames that best match the target blocks. Motion vectors describe the relative displacement between the location of the original blocks in the reference frames and their location in the predicted frame. While motion compensation of P-frames relies only on previous frames, previous and future frames are typically used to predict B-frames; (ii) residual calculation, where the error between the predicted and target frame is calculated; and (iii) compression, where the error residual and the extracted motion vectors are quantized, compressed and stored. Since video compression is typically performed at the camera end prior to transmission over the network, real-time hardware implementations of popular algorithms such as H264 and MPEG4 are commonplace.
There is a need in the art for systems and methods that facilitate performing single camera distance estimation by leveraging information extracted in the real-time video compression process while overcoming the aforementioned deficiencies.