Classifiers are statistical models, typically implemented as computer programs executed on computer systems, used to classify real world events based on a set of features of a real world event. A real world event is an instance of any entity or event in the real world. An instance of a person and an instance of a hockey game are both real world events. However, real world events can be works of imagination, such as book of fiction, a fake news story, an abstract painting, or a computer-generated digital image. Each of these events are still instances of their respective types.
Videos are one type of real world event that can be classified based on a set of features. Videos have various features, which can be based on attributes or elements of the video. An attribute is a numerical or qualitative aspect of a video, for example, a video can have attributes such as an average number of shots, average pitch, an average luminance, a texture parameter, or the like. An element refers to a sub-part of a video. Elements of a video could include a frame, a sequence of frames or a sound bite.
In video classification, statistical models are generated which reflect the probability that a video belongs to class of videos based on its set of features. Videos may be labeled according to any system which creates distinct classes of videos that can be characterized by a set of features. Classes can be based on the type of event depicted within the video, a person in one or more frames of the video, the genre of the video or style of the video. Classes may also be based on a type of content contained in the video. For instance, videos may be classified as to whether they contain inappropriate content such as adult content, violent content or hateful content based on features within the videos that characterize this type of content. The statistical models generated in classification identify and apply the features with the strongest discriminative value in the differential determination of classes of events. The discriminative value of a feature is a function of a feature's association with a class and the ability to discriminate members of the class based on the feature.
Features used in video classification are time series features, meaning they are generated and evaluated over a series of time points either sampled from the video or determined continuously for the video. The manipulation and comparison of time series feature data creates several challenges in the classification of videos and other time series events. One problem associated with the representation of features over a series of time points is that features which have strong discriminative value for a class can be found at multiple different time scales of a video or other times-series event. For instance, some features with a strong discriminative value may occur for only a small time interval or scale (e.g. at the millisecond scale) and other features with strong discriminative value may occur over a larger time interval or scale (e.g. at a scale of minutes or the entire duration of time series event). For instance, a maximum value over a small interval of time (e.g. a high sound pitch caused by a scream in a horror movie) may have equal discriminatory value as an average feature value taken over several minutes of a video (e.g. the number of different shots in a video showing a sporting event).
Inappropriate videos are videos which contain content that is inappropriate for public viewing, for example at a video website, due to objectionable or unacceptable content. Inappropriate content includes but is not limited to: hate speech, violence and pornography. Generally, the provider of the videos (e.g. an administrator of a video website) establishes a set of criteria or guidelines for determining what types of videos and subject matter are deemed inappropriate and prohibiting these videos. Based on this set of criteria, the provider selects a set of videos to train a statistical classifier to recognize inappropriate content. This set of criteria corresponds to features in the video that distinguish the videos as containing inappropriate content. For instance, the provider's specification of punching or kicking as violent acts may correspond to motion models of the same which distinguish the video as inappropriate content.
The order of the time series values over time creates additional problems in inappropriate video classification. Time series features are typically represented as an ordered vector of values corresponding to features over time or space. While order is important in determining time series features, often features with high discriminatory value for a label can occur in different portions of the video. For instance, inappropriate or adult content is often spliced into videos at different time points making it more difficult to detect using time series features that are bound to a temporal model.
Other problems in classifying inappropriate videos based on time series features are caused by periodicity and sparseness of the time series features. Certain features may have discriminative value based on their periodicity or recurrence over semi-regular time intervals. For instance, inappropriate videos containing hate speech may only the periodic and re-current use of language that is distinctive of hate speech, which acts as a recurrent and periodic event that can be used to discriminate these types of videos from other types of videos. Other time series features may be sparse, meaning that the occurrence of the time series feature is sporadic over the video or other time series event and/or occurs over a brief interval of time.