Public venues such as shopping centres, parking lots and train stations are increasingly subject to surveillance using large-scale networks of video cameras. Application domains of large-scale video surveillance include security, safety, traffic management and business analytics.
A key task in many of these applications is rapid and robust object matching across multiple camera views. In one example called “hand-off”, object matching is applied to persistently tracked multiple objects across a first and second camera with overlapping fields of view. In another example called “re-identification”, object matching is applied to locate a specific object of interest across multiple cameras in the network with non-overlapping fields of view. In the following discussion, the term “object matching” will be understood to include the terms “hand-off”, “re-identification”, “object identification”, and “object recognition”.
A camera captures an image at a time. An image is made up of visual elements. The terms “pixel”, “pixel location” and “image location” are used interchangeably throughout this specification to refer to one of the visual elements in a captured image. Each pixel of an image is described by one or more values characterising a property of the scene captured in that particular pixel. In one example, a single intensity value characterises the brightness of the scene at the pixel location. In another example, a triplet of values characterise the colour of the scene at the pixel location. Furthermore, a “region”, “image region” or “cell” in an image refers to a collection of one or more spatially adjacent visual elements.
A “bounding box” refers to a rectilinear image region enclosing an object in an image. In the present disclosure, the bounding box encompasses the object of interest, which is usually a pedestrian in the application of video surveillance.
A common approach for object matching includes the steps of extracting an “appearance signature” for each object and using the appearance signature to compute a similarity between different objects. Throughout this description, the term “appearance signature” refers to a set of values summarizing the appearance of an object or region of an image, and will be understood to include the terms “appearance model”, “feature descriptor” and “feature vector”.
One of the steps of obtaining the appearance signature is segmenting one region of the captured image that belongs to the object itself (also known as the foreground), and another region of the captured image that belongs to the scene (also known as the background). This process is commonly known as “foreground segmentation”, or “foreground background classification”.
One commonly used tool to perform such an analysis is using an artificial neural network (ANN). An artificial neural network includes a set of nodes, a set of weights, and a set of edges, also referred to as connections. Each of the edges is weighted and connects two nodes of the ANN. The weighted edges are also referred to as weighted connections. The artificial neural network is trained using a set of training input and output instances. For example, the training input could be the RGB pixels of an image, and the output could be the likelihood for each pixel to be part of the foreground. Hereinafter, such an output is called “foreground mask.”
One type of artificial neural network is called Convolutional Neural Networks (CNN). CNN arranges the ANN nodes along with the weights into layers. Operators such as “convolution”, “max pooling”, “Rectified Linear Unit (ReLU)” and “softmax” are performed by one or more layers (also called sub-networks) of the CNN. Each of the layers and sub-networks calculates the node input values of the next layer and sub-network, respectively, of the CNN.
In an example where each layer performs an operation, the first layer is the input to the CNN that could be, for example, the image data. Through each operator (i.e., layer in this example), the CNN calculates the node input values of the next layer. The last layer is the output layer, which could be the likelihood of each pixel to be part of the foreground of the image data (which is the input to the first layer). The CNN for foreground segmentation commonly uses “deconvolution” operator as well as the operators above. The CNN may be trained using one dataset prior to using another dataset. This process is commonly known as “pre-training”. The pre-training provides better initial weights for the following training and ultimately the foreground segmentation of an image.
The following describes some of the operations that the CNN can perform.
Convolution is a commonly known filter operation, which is illustrated in FIG. 10. FIG. 10 shows a “conv3×3” operation, which means a 3×3 linear filter 1010 that is being applied to a given two-dimensional layer 1020. The application of the 3×3 linear filter 1010 to the two-dimensional layer 1020 results in the forming of a new two-dimensional layer 1030.
For example, let I(x,y) be an two-dimensional layer 1020 with coordinates (x,y), and let f(u,v) (u=−1,0,1, v=−1,0,1) be a “3×3 kernel” 1010. The values of f(u,v) is also known as the “weights” of the kernel. The output of applying conv3×3 to the layer 1020, denoted by (I*f) is:(I*f)(x,y)=Σu=−11Σv=−11I(x−u,y−v)f(u,v)  Eq. 1
It is possible to have a convolution kernel of different sizes, other than 3×3. Further, it is possible to have convolution applied to a three-dimensional layer:(I*f)(x,y,z)=Σu=−11Σv=−11Σw=1cI(x−u,y−v,w)f(u,v,w,z)  Eq. 2where the input three-dimensional layer has size W×H×C.
Deconvolution, denoted by deconv, is a commonly known filter operation. One example of the deconvolution operation is illustrated in FIG. 11. Given an input of a two-dimensional layer 1110, zero paddings are inserted to between elements of the two-dimensional layer 1120. The value of each element 1111 is simply copied across to a new position 1121 with zeros valued elements in between. A convolution (see FIG. 10) is then applied to the padded layer, to form the deconvolved layer 1130. Different size kernel and different number of padding elements could be applied. In particular one could obtain a deconvolved layer 1130 that is exactly twice the size as the input layer 1110, by padding an extra row and column of zero elements in the two-dimensional layer 1120.
Maxpooling is a filter to shrink a two-dimensional layer. Assuming a filter of 2×2, the maxpooling operation would divide the two-dimensional layer into many adjacent 2×2 non-overlapping regions. The max element in each region forms the element to the resultant two-dimensional layer. The resultant layer would have half of the dimension as the input layer.
Fully connected layer, commonly denoted by fc, is a filter to apply linear operation. Let x(i), y(j) (i=1,2,3, . . . ,I, j=1,2,3, . . . ,J) be input and output vector, respectively. Let w(j,i), b(j) be weights and bias, respectively. The output y(j) is:y(j)=Σiw(i,j)x(i)+b(j)  Eq. 3
Even armed with tools like the CNN, foreground segmentation is still a challenging problem. One challenge is adapting the CNN to different context information, as a CNN trained for foreground segmentation with one environment may not function well for a different environment. Context information (also referred to as context in the present disclosure) refers to additional information that is related to an image, but is not part of the visual elements or metadata of that image. The term context or context information will be discussed in detail below.
One conventional method for crowd counting uses an adaptive convolutional neural network (ACNN) to adapt to multiple context. The ACNN has convolution layers (or sub-networks) that can change weights of convolution according to the context information. However, it is difficult for these convolution layers to be pre-trained without context information because the weight is controlled by the context information.
Computational cost is another challenge. The CNN suffers from large computation cost to process a large number of images. The problem becomes more challenging if the CNN needs to be embedded in a portable product, e.g. a camera, which has less processing resources than a desktop computer.