A video conference is a communication session where participants can see and hear each other using video screens, microphones and cameras as schematically illustrated in FIG. 1. Examples of captured video in native format are illustrated in FIG. 2A and FIG. 2B.
When displaying participants in a video conference, participants often manually adjust camera viewing angles and camera zoom levels in order to capture one or more participants for the video conference. Some existing techniques try to automate this manual adjustment by using microphones and image receiver for scaling the image as illustrated in FIG. 3A and FIG. 3B, automatically and digitally cropping as illustrated in FIG. 4, or controlling a pan, tilt, zoom and focus settings of the camera to the active speaker. Such solutions are disclosed by e.g. patent documents U.S. Pat. No. 6,275,258B1, U.S. Pat. No. 6,469,732B1, and U.S. Pat. No. 8,314,829B2.
Patent documents U.S. Pat. No. 8,488,840B2 and WO2010141023A1 further discloses solutions for detecting and cropping the region(s) of interest from a video stream of the participant and arranging the cropped video stream of the participant with additional cropped video streams of additional participants for display in the video conference. An example of such arrangement is given in FIG. 4. WO2010141023A1 describes a video conference system that determine a participant distance and aligning a region of interest on the participant using one or more depth cameras: creating a cropped video stream of the participant by cropping the region of interest from a video stream of the participant; and arranging the cropped video stream of the participant with additional cropped video streams of additional participants for display in the video conference.
Disadvantages with this solution is that the video is cropped and thus the field of view of the camera is severely restrict, the viewer is prevented to see the other parts of the video that might be of interest to the user (non-detected regions of interest), and it only consider having one rectangular region of interest (restrictive). If two persons are “of interest” on respectively the left and right parts of the video, the detected region of interest will be almost the complete video since the method use a rectangular cropping.
Other techniques aim at recognizing important parts in the image and preserve those regions while scaling the image. In a video conference that would aim at scaling the images so that the people are displayed while not so important parts of the images is hidden as schematically illustrated in FIG. 5.
In Patent document EP2417771A1 it is disclosed a method for performing vector retargeting process with respect to video frame, involves determining importance values for one or more identified objects. The method involves identifying one or more objects within a vector video frame. The importance values for one or more identified objects are determined. The video frame is retargeted based on importance values corresponding to the identified object.
However the disclosed non-uniform image resealing is only possible on vector images and not raster/matrix images. The proposed way to solve this by converting raster/matrix images to vector results in limited quality especially for natural videos such as conferencing videos. On a raster image, the image is segmented (background, objects) and each segment is non-uniformly scaled or simplified according to a spatial budget. A further disadvantage is the method cannot have a more fine grain scaling within one segment. Also, even though the method detect important objects in the (vector or raster) video, it does not detect and treat differently the active speakers.
In patent document EP2218056B1 and the literature “A System for Retargeting of Streaming Video”, Philipp Krähenbühl, Manuel, Alexander, Markus Gross, SIGGRAPH 2009, two content-aware resizing of images and videos algorithms are presented. These algorithms are different versions of non-uniform video retargeting among others. Disadvantages with these solutions are based on the fact that they are not optimized for video conference applications, e.g. they do not take into account the active speakers and during a video conference application the active speakers are the most important region of interest, and they do no preserve participant's bodies and in a video conference the body language is very important.
In a video conference, another problem concerns the screen size adaptation. It exists various sizes of screen and several aspect ratios (4:3, 16:9, etc.) and if one wants to display a specific content acquired at a certain aspect ratio on a display having a different aspect ratio, one has to adapt the video stream to the display aspect ratio. Most of the time, the video players linearly scale up or down to adjust to the screen size and either insert black borders on the top and bottom of the display as illustrated in FIG. 3A, or either crop the video top/bottom parts in order to fix the aspect ratio issue.
Inserting black borders results in a reduced field of view and thus induces a lower quality of experience. On the other hand, cropping completely removes parts of the video that might be of interest to the user and thus might induce an even lower quality of experience.
Below will follow some definitions and descriptions of existing technology:
Image Cropping, Uniform Scaling
Cropping refers to the removal of the outer parts of an image to improve framing, accentuate subject matter or change aspect ratio. The character * means multiplication with.
Let us define I, an image of size W*H.
Cropping consists in extracting a rectangular region R (Xr,Yr,Wr,Hr) of the image I:                Icropped=I(x,y), for all x: Xr<x<Xr+Wr and        for all y: Yr<y<Yr+Hr.        
Linear, uniform scaling consists in resizing the image I to a new size W2*H2.
Iscaled=sample(I, x*W2/W,y*H2/H).
Where sample( ) is a function that linearly sample the image. Such as for instance, the bilinear interpolation which is an extension of linear interpolation for interpolating functions of two variables (e.g., x and y) on a regular 2D grid.
Content-Aware Image/Video Retargeting
Video retargeting aims at non-uniformly adapting a video stream in a context-sensitive and temporally coherent manner to a new target resolution. E.g. to be able to resize, as well as change aspect-ratio, or zoom into one or several parts of the video at the same time, while scaling away unimportant parts. We are trying to find a spatio-temporal warp wt R2->R2, i.e., a mapping from coordinates in It (image I at time t) to new coordinates in It such that It*wt=Ot represents an optimally retargeted output frame with respect to the desired scaling factors and additional constraints.
Image warping is a non-linear deformation which maps every point in one image to a point in another image.
The following approach of “A System for Retargeting of Streaming Video”, Philipp Krähenbühl, Manuel, Alexander, Markus Gross, SIGGRAPH 2009 is a good example of video retargeting. Given a current frame It of the video stream the system automatically estimates visually important features in a map (Fs) based on image gradients, saliency, motion, or scene changes. The saliency map (Fs) is estimated in order to detect where the content can be distorted and where it should be avoided. Next, a feature preserving warp wt to the target resolution is computed by minimizing an objective function Ew which comprises different energy terms derived from a set of feature constraints. The optimal warp is the one minimizing a combined cost function (a.k.a energy) Ew such that:Ew=Eg+λuEu+λbEb+λsEs+cEc 
Where Eg is the global scale energy, Eu the uniform scale constraint containing the saliency map values, Eb the bending energy, Es edge sharpness energy and Ec is the bilateral temporal coherence energy. The equations are further defined by Krähenbühl et. al. These energies measure local quality criteria such as the uniformity of scaling of feature regions, the bending or blurring of relevant edges, or the spatio-temporal smoothness of the warp.
Finding the best warp wt is then obtained by solving the following problemwt=argminw(Ew),where all energies are written in a least square manner and the system is solved using a non-linear least square solver. Also a different number and type of energies may be used.
There exists different video retargeting methods such as seams carving, many of them described in the survey “A survey of image retargeting techniques”, Daniel Vaqueroa, Matthew Turka, Kari Pullib, Marius Ticob, Natasha Gelfandb 2010.
Sound Source Localization
Sound source localization aims at locating the sound or speaker in a video conferencing scenario based on a set of microphones.
Traditionally, algorithms for sound source localization rely on an estimation of Time Difference of Arrival (TDOA) at microphone pairs through the GCC-PHAT (Generalized Cross Correlation Phase Transform) method. When several microphone pairs are available the source position can be estimated as the point in the space that best fits the set of TDOA measurements by applying Global Coherence Field (GCF), also known as SRP-PHAT (Steered Response Power Phase Transform), or Oriented Global Coherence Field (OGCF). The point can be estimated in a 3D space if the microphones are not aligned.
FIG. 6A illustrates the geometry used for calculating sound direction based on interaural delay. Calculation of the interaural time difference (ITD) between two microphones specifies a hyperbolic locus of points upon which the corresponding sound source may reside. For target distances (DL and DR) much greater than the microphone spacing DM, the target bearing angle may be approximated as
  θ  ≅            sin              -        1              ⁡          (                                    D            L                    -                      D            R                                    D          M                    )      
Rewriting the difference in target distance in terms of the interaural time delay, one obtains
  θ  ≅            sin              -        1              ⁡          (                                    V            sound                    ·          ITD                          D          M                    )      where Vsound for a comfortable indoor environment is approximately 344 m/s.
Several types of ITD features may be extracted from a microphone pair. One technique is Cross-Correlation.
The windowed cross-correlation rlr(d) of digitally sampled sound signals l(n) and r(n) is defined as
            r      lr        ⁡          (      d      )        =            ∑              n        -                  N          1                            N        2              ⁢                  ⁢                  l        ⁡                  (          n          )                    ⁢              r        ⁡                  (                      n            -            d                    )                    
where N1 and N2 define a window in time to which the correlation is applied. The value of d which maximizes rlr(d) is chosen as the interaural delay, in samples. Cross-correlation provides excellent time delay estimation for noisy sounds such as fricative consonants. For voiced consonants, vowel sounds, and other periodic waveforms, however, cross-correlation can present ambiguous peaks at intervals of the fundamental frequency. It also provides unpredictable results when multiple sound sources are present. Finally, sound reflections and reverberation often found in indoor environments may corrupt the delay estimation.
Another formulation of the positioning problem is described in the paper “Robust Sound Source Localization Using a Microphone Array on a Mobile Robot”, Jean-Marc Valin, Franc, ois Michaud, Jean Rouat, Dominic L'etourneau: 
Once TDOA estimation is performed, it is possible to compute the position of the source through geometrical calculations. One technique based on a linear equation system but sometimes, depending on the signals, the system is ill-conditioned and unstable. For that reason, a simpler model based on far field assumption is used, where it is assumed that the distance to the source is much larger than the array aperture
FIG. 6C illustrates the case of a 2 microphone array with a source in the far-field. Using the cosine law, we can state that:
      cos    ⁢                  ⁢    ϕ    =                              u          →                ·                              x            →                    ij                                                            u            →                                    ⁢                                  ⁢                                                      x              →                        ij                                          =                            u          →                ·                              x            →                    ij                                                          x            →                    ij                            
where {right arrow over (x)}ij is the vector that goes from microphone i to microphone j and {right arrow over (u)} is a unit vector pointing in the direction of the source. From the same figure, it can be stated that:
      cos    ⁢                  ⁢    ϕ    =            sin      ⁢                          ⁢      θ        =                  c        ⁢                                  ⁢        Δ        ⁢                                  ⁢                  T          ij                                                          x            →                    ij                            
where c is the speed of sound. When combining the two equations, we obtain:{right arrow over (u)}·{right arrow over (x)}ij=cΔTij 
which can be re-written as:u(xj−xi)+v(yj−yi)+w(zj−zi)=cΔTij 
where {right arrow over (u)}=(u, v, w) and {right arrow over (x)}ij=(xj−xi, yj−yi, zj−zi), the position of microphone i being (xi, yi, zi). Considering N microphones, we obtain a system of N−1 equations:
            [                                                  (                                                x                  2                                -                                  x                  1                                            )                                                          (                                                y                  2                                -                                  y                  1                                            )                                                          (                                                z                  2                                -                                  z                  1                                            )                                                                          (                                                x                  3                                -                                  x                  1                                            )                                                          (                                                y                  2                                -                                  y                  1                                            )                                                          (                                                z                  3                                -                                  z                  1                                            )                                                            ⋮                                ⋮                                ⋮                                                              (                                                x                  N                                -                                  x                  1                                            )                                                          (                                                y                  N                                -                                  y                  1                                            )                                                          (                                                z                  N                                -                                  z                  1                                            )                                          ]        ⁡          [                                    u                                                v                                                w                              ]        =      [                                        c            ⁢                                                  ⁢            Δ            ⁢                                                  ⁢                          T              12                                                                        c            ⁢                                                  ⁢            Δ            ⁢                                                  ⁢                          T              13                                                            ⋮                                                  c            ⁢                                                  ⁢            Δ            ⁢                                                  ⁢                          T                              1                ⁢                N                                                          ]  
In the case with more than 4 microphones, the system is over-constrained and the solution can be found using the pseudo-inverse, which can be computed only once since the matrix is constant. Also, the system is guaranteed to be stable (i.e., the matrix is non-singular) as long as the microphones are not all in the same plane.
The linear system expressed by the system above is theoretically valid only for the far-field case. In the near-field case, the main effect on the result is that the direction vector {right arrow over (u)} found has a norm smaller than unity. By normalizing {right arrow over (u)} it is possible to obtain results for the near-field that are almost as good as for the far-field. Simulating an array of 50 cm×40 cm×36 cm shows that the mean angular error is reasonable even when the source is very close to the array, as shown by FIG. 6D. Even at 25 cm from the center of the array, the mean angular error is only 5 degrees. At such distance, the error corresponds to about 2-3 cm, which is often larger than the source itself. For those reasons, we consider that the method is valid for both near-field and far-field. Normalizing {right arrow over (u)} also makes the system insensitive to the speed of sound because Equation 13 shows that c only has an effect on the magnitude of {right arrow over (u)}. That way, it is not necessary to take into account the variations in the speed of sound.
Face Detection
A face detection algorithm aims at locating faces in an image or video. The output of this type of algorithm is often a set of rectangles {R (Xr,Yr,Wr,Hr)} positioned exactly onto the detected faces and centered onto the noise, wherein Xr and YR means the coordinates in X and Y plane, Wr indicates the width and Hr the height of the rectangle.
A fast and efficient method is called Haar face detection. Haar-like features are digital image features used in object recognition. They owe their name to their intuitive similarity with Haar wavelets and were used in the first real-time face detector.
Body Detection
A body detection algorithm aims at locating not only faces, but also parts of or the whole body in an image or video.
Body Detector/Tracker
A body detector is any device that can localize the static or moving body of a person (shape) over time. It may also be called body sensor or body tracker, or only tracker.
Video Conference
It is well-known that in a video conference application the active speakers are the most important region of interest that is likely to be observed/focused on by a viewer, and that the body language is an important factor of communication and thus one has to avoid altering it.