Gesture recognition is an example of an application using an efficient temporal segmentation, or the task of finding gestures within a flow of human motion, as a pre-processing step. Usually performed in an unsupervised manner, the step of temporal segmentation facilitates subsequent recognition of gestures.
Gesture recognition and segmentation can be performed either in a simultaneous or sequential fashion. For examples machine learning frameworks capable of modeling time aspects directly, such as hidden Markov models (HMMs), continuous-time recurrent neural networks (CTRNNs), dynamic Bayesian network (DBNs) or conditional random, fields (CRFs) can be used for simultaneous gesture recognition and segmentation. Temporal segmentation has also been studied independently of its recognition counterpart. Nevertheless, when it occurs, two main approaches predominate, namely temporal clustering and change-point detection.
Temporal clustering (TC) refers to the factorization of multiple time series into a set on non-overlapping segments that belongs to k temporal clusters. Being inherently offline, the approach benefits from a global point of view on the data and provides cluster labels as in clustering. However, temporal clustering may not be suitable for real-time applications.
Change-point methods rely on various tools from signal theory and statistics to locate frames of abrupt change in pattern within the flow of motion. Although change-point methods can be restricted to, univariate series with parametric distribution assumption (which does not hold when analyzing human motion), the recent use of kernel methods released part of these limitations, change-point methods have been recently applied to the temporal segmentation problem. Unlike temporal clustering, the change-point approach often results in unsupervised online algorithms, which can perform real-time, relying on local patterns in time-series.
Although significant progress has been made in temporal segmentation, the problem still remains inherently challenging due to viewpoint changes, partial occlusions, and spatio-temporal variations.