Classification (also called categorisation) is the task of assigning an input to a certain group (also called class or category). The output of classification is the label of the group that the input has been assigned to. The assignment of an input to a class is generally based on certain characteristics of the input which are called features. When classes are formed based on some ontology, the classification provides semantic understanding. Semantic classes are often arranged into a hierarchical structure. For example, a taxonomy is a set of classes arranged in a tree structure.
In one approach to a classification, a label of each test instance (e.g., a video or a segment of a video) is determined independently of labels of all other test instances. However, such an approach fails to exploit logical or statistical interdependencies between labels of multiple instances, resulting in reduced classification accuracy. Classification approaches that exploit logical or statistical interdependencies are called joint classifications. Structured classification is another term commonly used for joint classification.
In machine learning, a probabilistic classifier is a classifier that is able to provide, given a sample input, a probability distribution over a set of predicted classes. Probabilistic classifiers represent a classification task as a random variable (e.g., Y) and the result of a classification process (i.e., the label inferred for a test instance) is the value of the random variable; e.g. Y=y means the outcome of classification, modelled as Y, is the state (i.e., label) y. A probabilistic classifier may be considered as a conditional distribution P(Y|x), meaning that for a given input x∈X, a probability is assigned to each y∈Y. A classification method may use a probabilistic classifier to determine a classification by choosing the label, y, which the probabilistic classifier assigns the highest conditional probability. This is known as the maximum a posteriori (MAP) solution to the joint probabilistic model. The MAP solution to a probabilistic model is a state (y*) that maximises the posterior probability distribution (Y|x); i.e., y*=argmaxy P(Y=y|x). The variable x is often called observed variable or feature.
In one approach, probabilistic joint classification is performed using a probabilistic graphical model. A probabilistic graphical model is a probabilistic model for which a graph expresses the conditional interdependencies between random variables. A probabilistic graphical model breaks up the joint probability distribution into smaller factors, each over a subset of random variables. The overall joint distribution is then defined as the normalised product of these factors. The function modelling the dependencies between the random variables in a factor is called potential function.
Types of probabilistic graphical models include Bayesian networks and Markov networks, also called Markov Random fields (MRF). An MRF conditioned on the value of observed variables is called a conditional Markov random field (CRF). The distinction between CRF models and MRF models is that a CRF model is conditioned on an input observed variable while an MRF is not. Once all input observed variables of a CRF model are accounted for, the CRF model is an MRF model. For that reason, this disclosure makes no distinction between a CRF model and an MRF model. Thus any use of the term MRF is understood to mean CRF or MRF.
An MRF consists of an undirected graph in which the nodes represent random variables, and the edges represent factors or potential functions over a pair of variables. A potential function including any number of observed variables and only one non-observed (i.e., output) variable is called unary potential function. A potential function including two output variables is called a pair-wise potential function. A pair-wise potential function may also be called a binary potential function. A potential function (i.e., factor) including more than two variables is often called a high order clique. An MRF is a tree-structured MRF when the dependency modelled in the probability distribution can be shown with a tree structure graph. In graph theory, a tree is an undirected graph in which any two vertices are connected by exactly one path.
Any MRF with positive potential functions can be converted into a log-linear representation in which the potential functions are represented as exponential of linear combination of feature functions. Each feature, similar to factors, has a scope. Different features can have the same scope.
To construct an MRF model, the number of random variables and the corresponding observed feature values must be known prior to the use of the MRF model. MRF models capture interdependencies between labels of multiple instances, but the interdependences are undirected (e.g., non-causal). For example, in computer vision, MRF models are used in object detection to capture correlation between labels of objects in an image.
Known methods do not teach how to construct a probabilistic model so that the model may be used to, efficiently and accurately, jointly infer classifications for a composite action (also known as complex action) and the sequence of its constituent primitive actions (also known as action units). Known methods fail to exploit the dependencies between the different time scales to achieve accurate classification.
One approach is to model sequential data using a CRF model. A sequence memorizer is used for modelling long term dependencies as the transition potentials. This approach performs an approximate inference only, and this approach does not model dependencies at different scales. This modelling approach may lead to lower accuracy as it disregards interdependencies between classifications at different time scales.
Another approach is to model composite actions as temporally structured processes using a combination of context free grammar (CFG) and Hidden Markov Models (HMM). The HMMs are used to classify action units. Context free grammar is used to recognise the class of the composite action from detected action units. This modelling approach requires all possible compositions of action units into composite actions to be defined a priori. The lack of robustness to changes in the actions results in lower classification accuracy.
Thus, there exists a need for accurate classification of composite actions and action units in a video which is robust to different time scales and which is robust to changes in the composition of action units, while enabling efficient and accurate inference.