Detecting from video footage when two or more people interact with each other, or when a person interacts with an object of interest is a technically important and yet challenging task. Detecting interaction from video data has application in areas such as sport analytics, surveillance, safety and security monitoring.
In the present disclosure, the term action refers to an act of doing something in order to make something happen, and the term interaction refers to a reciprocal act to an action involving more than one person or a person and one or more objects. For example, in a soccer game, players interact with ball, for example by kicking the ball with a player's foot, or trapping the ball with a player's chest, and players also interact with each other by passing the ball between them.
For instant interactions, that is when the duration of the interaction is smaller than that discernible by a monitoring system under consideration, such as someone hitting an object, determining “time localisation” of the interaction refers to determining the time at which the interaction occurs. For continuing interactions, that is when the duration of the interactions is non-trivial, such as someone holding an object, determining time localisation of the interaction refers to determining the time at which the interaction starts and ends. The determined time localisation may be in the form of relative time of the interaction compared to a reference starting time, such as the start of recording. When a corresponding video recording of the scene also exists, time localisation may also be expressed as the frame number at which the interaction occurs. In the present disclosure, determining time localisation of an interaction is referred to as ‘temporal localisation’.
Action or interaction detection may also include classifying the interaction. Classification (also called categorisation) is the task of assigning a class label to an input instance of the action or interaction. For example, ‘successful pass’ and ‘failed pass’ are two examples of class labels in a sport analysis application, and ‘meeting’, ‘passing an object’ and ‘separating’ are examples of class labels in a surveillance application. Action or interaction classification accuracy typically improves significantly if the temporal localisation is accurate, since irrelevant background content could behave as noise and adversely affect the accuracy of pre-trained models. Similarly, when some parts of the action or interaction are not included in the input instance due to imperfect segmentation and localisation, the classification accuracy would typically be lower.
Temporal localisation of actions and interactions is a challenging task, as interactions often occur quickly. Detecting interactions in video recording of scenes is also challenging due to a limited field of view of each camera capturing the scenes, substantial occlusions, and visual similarity of different actions and interactions, especially when fine-grain detection is required. Fine-grain interaction detection refers to temporal localisation and/or classification of interactions that are visually similar, such as distinguishing between a successful pass, and a failed pass in a soccer game.
A known technique for temporal action or interaction localisation in video contents trains an action/interaction classifier using pre-segmented interaction instances. At recall/test stage, the pre-trained classifier is applied to fixed length video segments and often overlapping temporal segments of the video. The pre-trained classifier localises the action using greedy localisation techniques such as non-maximum suppression. Often multiple cameras are required to cover a large scene, such as a rugby field or a soccer field, or a fairly large surveillance site. Existing temporal localisation from video content techniques would be relatively slow and inefficient to use in such multi-camera systems, as the video generated by each camera is generally processed independently, and the final detections generated by fusing multiple single camera view detections. In addition to computational inefficiency, the existing temporal interaction localisation solutions can have low accuracy as the whole interaction may not be visible in any single video.
An alternative known technique for temporal action/interaction localisation, also from video contents, is based on a proposal-classification approach. Instead of applying a pre-trained classifier in sliding window (the technique described above), proposal-classification techniques include an action proposal part. The action proposal part is usually a deep artificial neural network trained to generate class agnostic action or interaction candidates. The generated candidates are further evaluated with a pre-trained classifier for the action or interaction classes of interest. Existing proposal-classification approaches are designed and generally applied to single view videos. Computationally efficient extension of existing proposal-classification techniques to multiple camera views is not developed or published. Additionally, action proposal techniques are usually designed for visually distinct actions (e.g., running, vs. high jump, vs. climbing, vs. golf swing), and the techniques are not efficient or accurate for temporal localisation of fine-grained interactions.
Yet another known technique for action or interaction localisation from videos uses a temporal sequence model such as a neural network with recurrent architecture, for example a long-short term memory (LSTM) network. However, computationally efficient extension of temporal sequence model techniques to multiple camera views is not developed or published. Further, temporal sequence model techniques are not efficient for fine-grain interaction localisation.
Thus, there is a need for efficient and accurate interaction localisation technique which can be used when visual content is not available, as well as in systems with multiple cameras covering a large scene.