Classification (also called categorisation) is the task of assigning an input to a certain group (also called class or category). The output of classification is the label of the group that the input has been assigned to. The assignment of an input to a class is generally based on certain characteristics of the input which are called features. When classes are formed based on some ontology, the classification provides semantic understanding. Semantic classes are often arranged into a hierarchical structure. For example, a taxonomy is a set of classes arranged in a tree structure.
In one approach to a classification, a label of each test instance (e.g., a video or a segment of a video) is determined independently of labels of all other test instances. However, such an approach fails to exploit logical or statistical interdependencies between labels of multiple instances, resulting in reduced classification accuracy. Classification approaches that exploit logical or statistical interdependencies are called joint classifications. Structured classification is another term commonly used for joint classification.
In machine learning, a probabilistic classifier is a classifier that is able to provide, given a sample input, a probability distribution over a set of predicted classes. Probabilistic classifiers represent a classification task as a random variable (e.g., Y) and the result of a classification process (i.e., the label inferred for a test instance) is the value of the random variable; e.g. Y=y means the outcome of classification, modelled as Y, is the state (i.e., label) y. A probabilistic classifier may be considered as a conditional distribution P(Y|x), meaning that for a given input x∈X, a probability is assigned to each y∈Y. A classification method may use a probabilistic classifier to determine a classification by choosing the label, y, which the probabilistic classifier assigns the highest conditional probability. This is known as the maximum a posteriori (MAP) solution to the joint probabilistic model. The MAP solution to a probabilistic model is a state (y*) that maximises the posterior probability distribution (Y|x); i.e., y*=argmaxy P(Y=y|x). The variable x is often called observed variable or feature.
In one approach, probabilistic joint classification is performed using a probabilistic graphical model. A probabilistic graphical model is a probabilistic model for which a graph expresses the conditional interdependencies between random variables. Types of probabilistic graphical models include Bayesian networks and Markov networks, also called Markov Random fields (MRF). An MRF conditioned on the value of observed variables is called a conditional Markov random field (CRF). The distinction between CRF models and MRF models is that a CRF model is conditioned on an input observed variable while an MRF is not. Once all input observed variables of a CRF model are accounted for, the CRF model is an MRF model. For that reason, this disclosure makes no distinction between a CRF model and an MRF model. Thus any use of the term MRF is understood to mean CRF or MRF.
A Bayesian network is a directed acyclic probabilistic graphical model which represents a set of random variables and their conditional dependencies via a directed acyclic graph. Bayesian networks that model sequences of random variables are called dynamic Bayesian networks (DBN). DBNs are used for time sequence modelling and temporal pattern recognition such as speech, handwriting, gesture and action recognition. Hidden Markov models (HMM) is a simple and common form of DBN. Common DBN models such as HMM and 2 time slice Bayesian networks (2TBN) are first order Markov models, meaning that they model a predicted state of a system at time t, conditioned on an understood state of the system at a previous time t−1. Adding higher order temporal dependencies (which go beyond the previous time slice) makes the inferencing of a DBN model intractable. A classification of a static concept must remain the same over time. The use of static concepts imposes long-term temporal constraints that cannot be efficiently incorporated into a DBN model.
An MRF consists of an undirected graph in which the nodes represent random variables, and the edges represent interdependencies between random variables. The interdependencies are represented as ‘potential functions’. To construct a MRF model, the number of random variables and the corresponding observed feature values must be known prior to the use of the MRF model. MRF models capture interdependencies between labels of multiple instances, but the interdependences are undirected (e.g., non-causal). For example, in computer vision, MRF models are used in object detection to capture correlation between labels of objects in an image.
A solution space of an MRF model consists of all possible states of a probabilistic model of the MRF model, each state being a combination of possible classification labels for each random variable of the MRF model. A solution space of an MRF model grows exponentially as the number of variables increases. Thus, using an exhaustive search for MAP inferencing is intractable when the number of variables is not small. However, efficient exact inference algorithms exist for tree-structured MRF models and for binary MRF models that have so called ‘submodular’ potential functions. Markov random fields can represent arbitrary interdependences among random variables. Designing an MRF model, however, generally requires making decisions that trade-off across (i) having a simple model, (ii) having higher order interdependencies, and (iii) having additional interdependencies. Having a simple model may lead to inaccuracies in inferred classifications. Having higher order interdependencies or additional interdependencies may require cycles in the graph which may lead to the inferencing time to be too long or necessitate the use of an approximate inferencing method. An approximate inferencing method may be fast but may be too inaccurate. In general, approximate MAP inference of an MRF model with densely connected random variables has significantly higher computational cost compared to an efficient MAP inferencing algorithm for an MRF model with tree or a chain structured graph of connections.
Known methods do not teach how to construct a probabilistic model of one or more time sequences so that the model may be used to, efficiently and accurately, jointly infer classifications for static concepts and dynamic concepts. One modelling approach is to disregard the temporal dependency of dynamic concepts and instead to jointly model a snapshot of a time sequence and thus ignore interdependencies between snapshots. This modelling approach may lead to lower accuracy as it disregards temporal interdependencies of classifications in a time sequence.
Thus, there exists a need for accurate classification of static and dynamic concepts in a video which accounts for temporal and non-temporal interdependencies while enabling efficient and accurate inference.