Vision-based systems are becoming more popular nowadays, especially because of the increasing power of devices and the new capabilities for information storage. Such systems are often employed to automatically extract and analyze useful information from images and videos.
Considering the high resolution of recent digital cameras and bandwidth limitations, it is very important to develop solutions that can provide possibilities for reducing the amount of data that must be transferred through the network. On top of that, having less data also reduces the impact of storage requirements in any system. Reducing the images spatial resolution is not an option in this scenario because low-resolution images make most computing vision techniques much less precise. For instance, a minimum resolution is required in order to perform visual sentiment analysis in faces images, i.e., to determine face expressions.
Many scenarios have infrastructure limitations, including poor Internet connections/bandwidth and also few space for storing files. But even when there are no concerns about infrastructure and bandwidth limitations, the transmission and storage of entire raw videos is a challenge, possibly making some systems infeasible in practice because of the large amount of data to be transmitted and stored. As an example, considering the school scenario in which the students' faces need to be extracted from images for later identification, it is recommended by current face recognition softwares that each face should be represented by 30 to 40 pixels horizontally. Faces recorded between 5 and 10 meters away from the camera, with video resolution of 1920×1080, are represented in the final frame with 65 to 30 pixels horizontally, i.e., critically close to the lowest resolution required for identification tasks. Therefore, the video resolution of 1920×1080 would be the minimum required and, in this application scenario, a 30 minutes class would need at least 4 GB of storage space. Considering that multiple classes should be recorded daily and simultaneously, this represents a considerable amount of information to be transmitted and stored. Clearly this huge amount of video information generated is not only a problem in the school scenario.
Current solutions do not address the entire process of optimized creation and compression of images/videos depending on the desired context. Tiled streaming and Region-of-Interest (RoI) video encoding are two related solutions. In order to reduce bandwidth, tiled streaming methods can encode a video sequence by dividing its frames into a grid of independent tiles. An image/video can be initially divided in tiles and then scalably encoded and stored. This content can then be streamed with a spatial or quality resolution compatible with the available bandwidth. For instance, a lower resolution version of the sequence can be initially transmitted until a user zooms in and, after that, only the tiles covering the RoI selected by the user can be transferred in higher resolution. The well-known image codec JPEG-XR is an example of a scalable codec that enables tiling. In RoI video encoding methods, foreground-background identification is conducted so that background regions are more compressed at the encoding step, reducing bandwidth consumption.
As most vision-based systems may require high resolution images/videos to work properly, compression only is not acceptable. An interesting alternative to save storage and still keep enough resolution for computing vision tasks is to create images/videos containing only the objects of interest, and then properly encode these images/videos. By initially generating such images/videos, the following encoding step takes advantage of the similarity and proximity of the objects of interest to perform an even more efficient compression. Therefore, there is a double gain, one related to the content generation and another related to the optimized compression.
In the present invention, as will be further detailed, images/videos from objects of interest encoded with normalized spatial resolution and specific quality resolution depending on the context. The normalized spatial resolution is achieved by up-sampling and down-sampling techniques and the different quality resolutions are achieved by appropriate encoding parameters (e.g. different quantization parameters) selected during the compression process. Therefore, the use of the present invention is an interesting solution for compression while keeping enough resolution for vision-based computing systems.
The paper titled: “Region of Interest Encoding in Video Conference Systems”, published by C Bulla et al., in: The Fifth International Conferences on Advances in Multimedia (MMedia), 2013, presents a region of interest encoding system for video conference applications. The system is divided into two modules: sender and receiver. The sender comprehends a face detector to detect faces in videos as regions of interest (RoIs), a tracking method to track each RoI across the frames, and a RoI encoding scheme which encodes the RoIs in a good quality and the background in a bad quality. The encoded video stream is transmitted to all receiving clients, or receivers, which can decode it, crop out the regions of interest, and render them. The last rendering step is called “Scene Composition” and it is achieved by showing only the detected people. Each person is scaled and placed side by side at the receiving client. Differently from the paper of C Bulla et al., the present invention does the “scene composition” locally, i.e., it groups the regions of interest in a frame before transmitting the video, which permits savings in data transmission. In the paper of C Bulla et al., the scene composition is done at the receiver, meaning that the complete frames are transmitted over the network. The second difference is that the scene composition in the paper of C Bulla et al. depends on visualization parameters, while the present invention depends on parameters defined by the user influenced by the target application, making it broader. The third difference is related to the target application. In the paper of C Bulla et al., the final video is seen by users and, to this end, the scene composition must be visually pleasant, with spatial alignment, spaces between the faces, etc. In the present invention, the objects of interest can be organized in a square grid, for example, to better explore similarities and consequently obtain better compression. Moreover, the method presented in the paper of C Bulla et al. is applicable only to video conferences. All the details were discussed to achieve better results in this scenario. The system at the paper of C Bulla et al. works only for faces, while the present invention can work with any object of interest. The present invention is much more generic in the sense that it can be applied to several other scenarios.
The patent document US 2013/0107948 A1, titled: “Context Based Encoding and Decoding”, published on May 2, 2013, describes a codec that takes into consideration similar regions of interest across frames to produce better predictions than block-based motion estimation and compensation. Similar object instances are associated across frames to form tracks that are related to specific blocks of video data to be encoded. Differently from document US 2013/107948 A1, the present invention does not propose a new codec, but rather presents a data organization scheme that enables current codecs to produce more efficient results.
The patent document WO 2014/025319 A1 titled: “System and Method for Enabling User Control of Live Video Stream(S)”, published on Feb. 13, 2014, describes a system that enables multiple users to control live video streams independently, e.g., to request independent zooming of areas of interest. It considers that a current stream is acquired and stored in a number of video segments in different resolutions. Each frame of the video segments is encoded in a virtual tiling technique where each frame of the encoded video segments is divided into an array of tiles, and each tile comprises an array of slices. Upon user request to zoom in a specific area of interest, the tiles corresponding to that area, in an adequate video segment with higher resolution, are transferred to be displayed to the user. The slices outside the area of interest are removed before the display. The present invention differs from the document WO 2014/025319 A1 in many aspects. First, the present invention creates a unique image or video containing only objects of interest represented with a normalized spatial resolution to be transmitted and stored, and not to store several images/videos with different resolutions. In the document WO 2014/025319 A1, the region of interest, i.e., the area that will have higher resolution, is defined in real time by the user and the resolution of that area is also chosen based on the user request. In the method of the present invention, objects of interest can be detected by applying an object detection algorithm depending on the user specification. The creation of the final image/video containing objects with normalized resolution will be done only once and then it will be transmitted and stored. Another difference is the final application. The solution presented on document WO 2014/025319 A1 has a specific application that relates to displaying an area of interest with a specific resolution. The method of the present invention creates a final image/video with objects represented with normalized resolution to be analyzed by a vision-based system. Therefore, it is clear that the method of the present invention has broader application since its parameters are not limited to specific user requests to control video streams.
The paper titled: “Supporting Zoomable Video Streams with Dynamic Region-of-Interest Cropping”, published by NQM Khiem et al, in ACM conference on Multimedia systems (MMSys), 2010, presents two methods for streaming an arbitrary region of interest (RoI) from a high resolution video to support zooming and panning: tiled streaming and monolithic streaming. The first method relates the present invention because it divides each frame of a video in a grid of tiles. But differently, the tiles are encoded and stored as an independent stream in their highest resolution. In the present invention, all tiles are represented with the same spatial resolution. In the paper of NQM Khiem et al, a user receives from the server a scaled-down version of a video and requests a zoom in a specific area. The tile streams which overlap with the RoI are sent to the user in a higher resolution. In the approach of the present invention, the final image/video is transmitted to the server to be further stored and analyzed by a vision-based system.
The paper titled: “Adaptive Encoding of Zoomable Video Streams Based on User Access Pattern”, published by NQM Khiem, G Ravindra and W T Ooi, in ACM conference on Multimedia systems (MMSys), 2011, presents a method to create zoomable videos, allowing users to selectively zoom and pan into regions of interests within the video for viewing at higher resolutions. The idea is the same as the previous paper of NQM Khiem et al., but instead of dividing each frame into a fixed grid of tiles, user access patterns are taken into consideration. Considering users historical access patterns to regions of a video, the method creates a heat map with the probability of a region to be accessed (zoomed in) by users. The paper of NQM Khiem et al. provides a greedy algorithm to create a tile map so that each tile contains a probable region of interest. Each tile of the high resolution video in the same position, considering all frames, is then encoded in an independent stream. When a user requests a RoI, the overlapping tiles are sent to be displayed with minimum bandwidth because the RoI will probably be entirely inside a tile. The differences to the present invention, besides the ones discussed in the previous paper, are: in the paper of NQM Khiem et al, the tiles are adaptive; the tiles of the present invention are not encoded as different streams; and tiles of the present invention are related to target objects extracted from the input frames.
The paper titled: “Adaptive Resolution Image Acquisition Using Image Mosaicing Technique from Video Sequence”, published by S Takeuchi et al, in Proceedings International Conference on Image Processing, 2000, describes a layered image mosaicing method from a video sequence to acquire an adaptive resolution image. The method considers as input a video sequence captured with a camera which zooms in on certain regions where fine textures are present. Each frame is classified in a layer, depending on its zoom level. The images on each layer are then registered to create a unique image. By doing this, the method creates a layered image in which each layer represents an image with a different resolution. Differently, the method of the present invention composes a final image using a grid containing the objects of interest in a desired resolution.
The patent document U.S. Pat. No. 8,184,069 B1, titled: “Systems and Methods for Adaptive Transmission of Data”, published on Apr. 22, 2012, describes a system and method for transmitting, receiving, and displaying data. It provides a constant data transmission rate to a device and controls bandwidth by presenting information directed to an area of interest to a user. For example, bandwidth can be lowered by presenting high resolution information directed to an area of interest (e.g., an area to which the user is looking), and lower resolution data directed to other areas. To determine the area of interest the method utilizes a heads-up display used by the user and prioritizes data transmission based on this information. Differently, the present invention does not need any user device to detect areas of interest. Furthermore, the document U.S. Pat. No. 8,184,069 B1 does not enclose any specific method to compose the final frames.
The patent document U.S. Pat. No. 8,665,958 B2 titled “Method and Apparatus for Encoding and Decoding Video Signal Using Motion Compensation Based on Affine Transformation”, published on Mar. 4, 2014, presents a video encoding method that can determine whether a block includes an object with an affine transformation. In a positive case, the method generates a prediction block by performing an affine transformation-based motion compensation on the current block, achieving high video encoding/decoding efficiency. The present invention extracts objects from the input frames and creates tiles from them without considering any transformation, just adjusting their resolution. The invention itself of the document U.S. Pat. No. 8,665,958 B2 cannot reach the same outputs obtained by our proposed solution, but it could be applied as an additional/complementary (yet optional) module.