With the growth of video data generated by many devices, such as cameras, smartphones and closed-circuit televisions—CCTVs, allied with the Internet as a fast spreading venue, smart and continuous content filtering becomes paramount. In this context, classification of sensitive media content (e.g., pornography, violence, crowd) retains a considerable amount of attention because of its applications: it can be used for detecting, via surveillance cameras, inappropriate behavior; blocking undesired content from being uploaded to (or downloaded from) general purpose websites (e.g., social networks, online learning platforms, content providers, forums), or from being viewed on some places (e.g., schools, workplaces); preventing children from accessing adult content on personal computers, smartphones, smart glasses, tablets, cameras, Virtual Reality devices or smart TVs; and avoiding that improper content is distributed over phones by sexting, for instance.
Sensitive media content may be defined as any material that represents threats to its audience. Regarding digital video, the typical sensitive representatives include pornography and violence, but they may also cover disgusting scenes and other types of abstract concepts. Therefore, automatically filtering sensitive content is a hard and challenging task, because of its high-level conceptual nature.
Most of the recent approaches on classification of sensitive content are typically composed of three steps: (i) low-level visual feature extraction, (ii) mid-level feature extraction, and (iii) high-level supervised classification. The low-level features are extracted from the image pixels. They are still purely perceptual, but aim at being invariant to viewpoint and illumination changes, partial occlusion, and affine geometrical transformations. Mid-level features aim at combining the set of low-level features into a global and richer image representation of intermediate complexity. The mid-level features may be purely perceptual or they may incorporate semantic information from the classes, the former case being much more usual in the literature. Finally, the goal of supervised classification is to learn a function which assigns (discrete) labels to arbitrary images. That step is intrinsically semantic, since the class labels must be known during the training/learning phase.
Bag-of-Visual-Words (BoVW) is the most popular mid-level image representation and the most widely used for sensitive content classification problem. Inspired by the Bag-of-Words model from textual Information Retrieval, where a document is represented by a set of words, the BoVW representation describes an image as a histogram of the occurrence rate of “visual words” in a “visual vocabulary” induced by quantizing the space of a local feature (e.g., SIFT—Scale Invariant Feature Transform, SURF—Speeded Up Robust Features, HOG—Histogram of Oriented Gradients)). The visual vocabulary of k visual words, also known as visual codebook or visual dictionary, is usually obtained by unsupervised learning (e.g., k-means clustering algorithm) over a sample of local descriptors from the training/learning data.
The BoVW representation has important limitations (such as quantization error, spatial information loss), and several alternatives have been developed. One of the best mid-level aggregate representations currently reported in the literature, the Fisher Vector is based upon the use of the Fisher kernel framework with Gaussian mixture models (GMM) estimated over the training/learning data. For sensitive media content classification, no commercial solutions took advantage from that tip-top mid-level aggregate representation.
Other approaches have employed audio features (e.g., MFCC—Mel-frequency Cepstral Coefficients, loudness, pitch) to improve the classification of sensitive videos. The addition of audio analysis to the context of sensitive media detection can be critical to detect challenging cases, which can be a lot harder using visual features only (e.g., breastfeeding, hentai movies, gun shots). In addition, most of the visual local descriptors are static (i.e., it does not take into account temporal information). Audio features, in the other hand, are purely temporal, since no spatial information is available to be analyzed. Therefore, audio features can overcome static visual descriptors when the nature of the sensitive content is fundamentally temporal (e.g., blows in a fight). However, in the context of sensitive media analysis, the audio information should only be used along with the visual information. The audio features alone can be misleading and unreliable since it often does not correspond to what is being visually displayed. It is very common, for example, in movies, where there is plenty of background music which sometimes overlaps the action that is going on visually. Despite its importance and faster processing time, no commercial solutions took advantage from the audio information.
In addition, some approaches are based on multimodal fusion, exploiting both auditory and visual features. Usually, the fusion of different modalities is performed at two levels: (i) feature level or early fusion, which combines the features before classification and (ii) decision level or late fusion, which combines the scores from individual classifier models.
The feature level fusion is advantageous in that it requires only one learning phase on the combined feature vector. However, in this approach it is hard to represent the time synchronization between the multimodal features. In addition, the increase in the number of modalities makes it difficult to learn the cross-correlation among the heterogeneous features.
The strategy of the decision level fusion has many advantages over feature fusion. For instance, unlike feature level fusion, where the features from different modalities (e.g., audio and visual) may have different representations, the decisions (at the semantic level) usually have the same representation. Therefore, the fusion of decisions becomes easier. Moreover, the strategy of the decision level fusion offers scalability in terms of the modalities used in the fusion process, which is difficult to achieve in the feature level fusion. Another advantage of late fusion strategy is that it allows us to use the most suitable methods for analyzing each single modality, such as hidden Markov model (HMM) for audio and support vector machine (SVM) for image. This provides much more flexibility than the early fusion.
The method of the present invention relies on multimodal fusion of visual, auditory, and textual features for a fine-grained classification of sensitive media content in video snippets (short temporal series of video frames).
Patent document US 2013/0283388 A1, titled “Method and system for information content validation in electronic devices”, published on Oct. 24, 2013 by SAMSUNG ELECTRONICS CO., LTD, proposes a method and system for content filtering in mobile communication devices. An eventual similarity with the method of the present invention is the fact that the method of document US 2013/0283388 analyzes the information content—including image, video and audio information in real-time. However, this approach does not go into details. For example, for analyzing image/video content, the authors only mentioned that an image analysis engine comprises an “Image and Video Filtering module from IMAGEVISION located at Anna, Tex., 75409, U.S.A”. Nothing is specified in document US 2013/0283388 for analyzing audio information. Moreover, in contrast with the present invention, US 20130283388 does not fuse information for classification and it does not classify sensitive content within a video timeline.
Patent document US 2012/0246732 A1, titled “Apparatus, Systems and Methods for Control of Inappropriate Media Content Events”, published on Sep. 27, 2012, by BURTON DAVID ROBERT, proposes systems and methods to prevent presentation of inappropriate media content. The media content analysis logic of document US 2012/0246732 A1 may comprise the audio recognition logic, the object recognition logic, the text recognition logic, and/or the character recognition logic. However, it is not clear how sensitive content is analyzed. The method of the present invention exploits and evaluates a plurality of different characteristics (i.e., fuse information), and also classifies sensitive content within a video timeline, in contrast to the method proposed in document US 2012/0246732.
Patent document US 2009/0274364 A1 titled “Apparatus and Methods for Detecting Adult Videos” proposes apparatus and methods for analyzing video content to determine whether a video is adult or non-adult. Using a key frame detection system, the method of document US 2009/0274364 generates one or more models for adult video detection. According to the inventors, any suitable key frame features may be extracted from each key frame—17 image/video analysis techniques are described (including spatial and/or color distribution features and texture features). One drawback related to the present invention is that such techniques are typically not robust to the changes in video resolution, frame rate, cropping, orientation, or lighting. Differently, the method in the present invention proposes an end-to-end BoVW-based framework, which preserves more information while keeping the robustness to different changes in video. Moreover, in contrast with the present invention, document US 2009/0274364 does not use audio and/or textual content, does not fuse information and it does not classify sensitive content within a video timeline.
Patent document U.S. Pat. No. 8,285,118 B2 titled “Methods and Systems for Media Content Control”, published on Jan. 14, 2010, by NOVAFORA, INC, proposes methods and systems to control the display of media content on media player. The video media content is analyzed by extracting only visual information—local feature descriptors, such as SIFT, spatio-temporal SIFT, or SURF descriptors. An eventual similarity with the method of the present invention is the fact that U.S. Pat. No. 8,285,118 method computes the video signature using a BoVW-based mid-level representation. However, while U.S. Pat. No. 8,285,118 proposes to match the BoVW signatures to a database of signatures (it is time consuming and it is not generalizable), the present invention proposes to classify video signatures according a mathematical model learned from the training/learning dataset (it is very fast processing and it is generalizable). Furthermore, in contrast with the present invention, document U.S. Pat. No. 8,285,118 does not use audio and/or textual content and does not fuse information.
Patent document US 2014/0372876 A1 titled “Selective Content Presentation Engine”, published on Dec. 18, 2014, by AMAZON TECHNOLOGIES, INC, proposes a method for suppressing content portion (e.g., audio portions that include profane language, video portions that include lewd or violent behavior, etc.) at an electronic device. In document US 2014/0372876 A1, the selective content presentation engine may determine whether the content portion is to be presented by the electronic device based on the user preferences using visual or audio recognition. However document US 2014/0372876 A1 only mentioned the different types of classifiers that recognize images or audio segments and, it is not mentioned how the visual or audio content may be analyzed. In contrast with the present invention, US 20140372876 does not fuse information and it does not use mid-level aggregate representation.
Patent document US 2014/0207450 A1 titled “Real-time Customizable Media Content Filter”, published on Jul. 24, 2014, by INTERNATIONAL BUSINESS MACHINES CORPORATION, proposes a method for content filtering (e.g., violence, profanity) in real-time, with customizable preferences. Textual information, extracted from subtitles, closed caption and audio stream, is analyzed by matching textual content with one or more blacklist table entries. Differently, the proposed method in the present invention analyzes textual content using a robust BoW-based framework. Additionally, the present invention uses visual information, fuse information, and classifies sensitive content within a video timeline.
Patent document US 2003/0126267 A1 titled “Method and Apparatus for Preventing Access to Inappropriate Content Over a Network Based on Audio or Visual Content”, published on Jul. 3, 2003, by KONINKLIJKE PHILIPS ELECTRONICS N.V, proposes a method and apparatus for restricting access to electronic media objects having objectionable content (such as nudity, sexually explicit material, violent content or bad language), based on an analysis of the audio or visual information. For example, image processing, speech recognition or face recognition techniques may be employed to the identified inappropriate content. In contrast with the present invention, document US 2003/0126267 does not fuse information, does not use mid-level aggregate representation, and it does not classify sensitive content within a video timeline.
Finally, patent document CN 104268284 A titled: “Web Browse Filtering Soft dog Device Special for Juveniles”, published on Jan. 4, 2015, by HEFEI XINGFU INFORMATION TECHNOLOGY CO., LTD, provides a web browser filtering softdog (USB dongle) device which comprises a pornographic content analysis module. As defined in document CN 104268284, pornographic content analysis unit includes a text analysis module, image analysis module and video analysis module. However, it is not clear how pornographic content is analyzed. Furthermore, in contrast with the present invention, CN 104268284 does not use audio content, it does not fuse information, it does not use mid-level aggregate representation and it does not classify sensitive content within a video timeline.
In the following, it is summarized the scientific papers for the two most important types of sensitive content considered in this invention, pornography and violence.
Pornography Classification
The first efforts to detect pornography conservatively associated pornography with nudity, whereby the solutions tried to identify nude or scantily-clad people (Paper “Automatic detection of human nudes”, D. Forsyth and M. Fleck, International Journal of Computer Vision (IJCV), vol. 32, no. 1, pp. 63-77, 1999; Paper “Statistical color models with application to skin detection”, M. Jones and J. Rehg, International Journal of Computer Vision (IJCV), vol. 46, no. 1, pp. 81-96, 2002; Paper, “Naked image detection based on adaptive and extensible skin color model”, J.-S. Lee, Y.-M. Kuo, P.-C. Chung, and E.-L. Chen, Pattern Recognition (PR), vol. 40, no. 8, pp. 2261-2270, 2007.). In such works, the detection of human skin played a major role, followed by the identification of body parts.
The presence of nudity is not a good conceptual model of pornography. There are non-pornographic situations with plenty of body exposure. Conversely, there are pornographic scenes that involve very little exposed skin. Nevertheless, nudity detection is related to pornography detection, with a vast literature of its own.
The clear drawback of using skin detectors to identify pornography is the high false-positive rate, especially in situations of non-pornographic body exposure (e.g., swimming, sunbathing, boxing). Therefore, Deselaers et al. (Paper “Bag-of-visual-words models for adult image classification and filtering”, T. Deselaers, L. Pimenidis, and H. Ney, in International Conference on Pattern Recognition (ICPR), pp. 1-4, 2008) proposed, for the first time, to pose pornography detection as a Computer Vision classification problem (similar to object classification), rather than a skin-detection or segmentation problem. They extracted patches around difference-of-Gaussian interest points, and created a visual codebook using a Gaussian mixture model (GMM), to classify images into different pornographic categories. Their Bag-of-Visual-Words (BoVW) model greatly improved the effectiveness of the pornography classification.
More recently, Lopes et al. developed a Bag-of-Visual-Words (BoVW) approach, which employed the HueSIFT color descriptor, to classify images (Paper “A bag-of-features approach based on hue-SIFT descriptor for nude detection”, A. Lopes, S. Avila, A. Peixoto, R. Oliveira, and A. Araujo, in European Signal Processing Conference (EUSIPCO), pp. 1152-1156, 2009) and videos (Paper “Nude detection in video using bag-of-visual-features”, A. Lopes, S. Avila, A. Peixoto, R. Oliveira, M. Coelho, and A. Araujo, in Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 224-231, 2009) of nudity. For video classification, they proposed a majority-voting scheme over the video frames. Similar to Dewantono and Supriana (Paper “Development of a real-time nudity censorship system on images”, S. Dewantono and I. Supriana, in International Conference on Information and Communication Technology (IcoICT), pp. 30-35, 2014) proposed aBoVW-based image/video nudity detection by using a skin filtering method and SVM classifiers.
By moving from nudity detection towards pornography classification, it arises the challenge in defining the notion of pornography. Many scientific papers have adopted the definition of pornography proposed by Short et al. (Paper “A review of internet pornography use research: methodology and content from the past 10 years”, M. Short, L. Black, A. Smith, C. Wetterneck, and D. Wells, Cyberpsychology, Behavior, and Social Networking, vol. 15, no. 1, pp. 13-23, 2012): “any explicit sexual matter with the purpose of eliciting arousal”, which while still subjective, establishes a set of criteria that allow deciding the nature of the material (sexual content, explicitness, goal to elicit arousal, purposefulness).
Avila et al. proposed an extension to BoVW formalism, BossaNova (Paper “Pooling in image representation: the visual codeword point of view”, S. Avila, N. Thome, M. Cord, E. Valle, and A. Araujo, Computer Vision and Image Understanding (CVIU), vol. 117, pp. 453-465, 2013), with HueSIFT descriptors to classify pornographic videos using majority voting. Recently, Caetano et al. (Paper “Representing local binary descriptors with BossaNova for visual recognition”, C. Caetano, S. Avila, S. Guimarães, and A. Araujo, in Symposium on Applied Computing (SAC), pp. 49-54, 2014) achieved similar results by using BossaNova, binary descriptors, and majority voting. In (Paper “Pornography detection using BossaNova video descriptor”, in European Signal Processing Conference (EUSIPCO), pp. 1681-1685, 2014), Caetano et al. improved their previous results by establishing a single bag for the entire target video, instead of a bag for each extracted video frame. A possible similarity to the method of the present invention is the fact that the method of Gaetano et al. calculates the signature using a mid-level video representation based on BoVW. However, while Gaetano et al. proposes the use of BossaNova representation of mid-level video—which extends BoVW method offering more pooling operation of information maintenance based on a distribution of the distance to the keyword, the present invention proposes applying the representation of the Fisher vector—one of the best mid-level aggregated representations currently described in the literature, which extends the BoVW method to encode the first and second order mean differences among the local descriptors and codebook elements. It is important to mention that to the best of our knowledge, for sensitive media content rating, there are no commercial solutions that use this best mid-level representation. Additionally, this article “Pornography Detection Using BossaNova Video Descriptor” does not have most of the advantages offered by the proposed solution of the present invention: while the method of Caetano et al. detects pornographic content only, the present invention proposes a unified and easy structure for extending to handle any kind of sensitive content; while the method of Gaetano et al focuses only on the visual signal, the present invention provides a method of high level multimodal fusion exploring auditory, visual and/or textual features; while the method of Gaetano et al. classifies pornographic content in the video as a whole, the present invention proposes a fine-tuning method of classifying sensitive media content in video fragments (small parts or series short time of video frames). Furthermore, in contrast to the present invention, the method of Caetano et al. cannot be performed in real-time nor on mobile platforms.
Some prior art works rely on bags of static features. Few works have applied space-temporal features or other motion information for the classification of pornography. Valle et al. (Paper “Content-based filtering for video sharing social networks”, E. Valle, S. Avila, F. Souza, M. Coelho, and A. Araujo, in Brazilian Symposium on Information and Computational Systems Security (SBSeg), pp. 625-638, 2012) proposed the use of space-temporal local descriptors (such as STIP descriptor), in a BoVW-based approach for pornography classification. In the same direction, Souza et al. (Paper “An evaluation on color invariant based local spatiotemporal features for action recognition”, F. Souza, E. Valle, G. Camara-Chavez, and A. Araujo, in Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 31-36, 2011) improved Valle et al.'s results by applying ColorSTIP and HueSTIP, color-aware versions of the STIP detector and descriptor, respectively. Both works established a single bag for the entire target video, instead of keeping a bag for each video frame, prior to voting schemes.
Very recently, M. Moustafa (Paper “Applying deep learning to classify pornographic images and videos”, M. Moustafa, in Pacific-Rim Symposium on Image and Video Technology (PSIVT), 2015) proposed a deep learning system that analyzes video frames to classify pornographic content. This work focused on the visual cue only and applied a majority-voting scheme.
In addition, other approaches have employed audio analysis as an additional feature for the identification of pornographic videos. Rea et al. (Paper “Multimodal periodicity analysis for illicit content detection in videos”, N. Rea, G. Lacey, C. Lambe, and R. Dahyot, in European Conference on Visual Media Production (CVMP), pp. 106-114, 2006) combined skin color estimation with the detection of periodic patterns in a video's audio signal. Liu et al. (Paper “Fusing audio-words with visual features for pornographic video detection”, Y. Liu, X. Wang, Y. Zhang, and S. Tang, in IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pp. 1488-1493, 2011) demonstrated improvements by fusion visual features (color moments and edge histograms) with “audio words”. In a similar fashion, Ulges et al. (Paper “Pornography detection in video benefits (a lot) from a multi-modal approach”, A. Ulges, C. Schulze, D. Borth, and A. Stahl, in ACM International Workshop on Audio and Multimedia Methods for Large-scale Video Analysis, pp. 21-26, 2012) proposed an approach of late fusion motion histograms with audio words.
In addition to those scientific results, there are commercial software packages that block web sites with pornographic content (e.g., K9 Web Protection, CyberPatrol, NetNanny). Additionally, there are products that scan a computer for pornographic content (e.g., MediaDetective, Snitch Plus, NuDetective). MediaDetective and Snitch Plus are off-the-shelf products that rely on the detection of human skin to find pictures or movies containing nude people. The work of Polastro and Eleuterio (a.k.a., NuDetective, Paper “NuDetective: A forensic tool to help combat child pornography through automatic nudity detection”, M. Polastro and P. Eleuterio, in Workshop on Database and Expert Systems Applications (DEXA), pp. 349-353, 2010) also adopts skin detection, and it is intended for the Federal Police of Brazil, in forensic activities.
Violence Classification
Over the last few years, progress in violence detection has been quantifiable thanks to the MediaEval Violent Scenes Detection (VSD) task, which provides a common ground truth and standard evaluation protocols. MediaEval is a benchmarking initiative dedicated to evaluate new algorithms for multimedia access and retrieval. Organized annually from 2011 to present, the MediaEval VSD task poses the challenge of an automated detection of violent scenes in Hollywood movies and web videos. The targeted violent scenes are those “one would not let an 8 years old child see in a video because they contain physical violence”.
The violence detection pipeline is typically composed of three steps: (i) low-level feature extraction from audio, visual or textual modalities, (ii) mid-level feature extraction using bag-of-visual-words (BoVW) representation or extensions, and (iii) supervised classification by employing support vector machines (SVM), neural networks, or hidden Markov models (HMM).
In the 2014 edition of the VSD task, for instance, all proposed techniques employed this three-step pipeline, except for one team, which used the provided violence-related concept annotations as mid-level features. In the low-level step, most of the approaches explored both auditory (e.g., MFCC features) and visual information (e.g., dense trajectories). Avila et al. (Paper “RECOD at MediaEval 2014: Violent scenes detection task”, S. Avila, D. Moreira, M. Perez, D. Moraes, I. Cota, V. Testoni, E. Valle, S. Goldenstein, and A. Rocha, in Working Notes Proceedings of the MediaEval 2014 Workshop, 2014) additionally incorporated textual features extracted from the Hollywood movie subtitles. In the mid-level step, the low-level features were frequently encoded using a Fisher Vector representation. Finally, in the last step, SVM classifiers were the most used for classification.
Before the MediaEval campaign, several methods were proposed to detect violent scenes in video. However, due to the lack of a common definition of violence, allied with the absence of standard datasets, the methods were developed for a very specific type of violence (e.g., gunshot injury, war violence, car chases) and, consequently, the results were not directly comparable. In the following, we overview some of those works for the sake of completeness.
One of the first proposals for violence detection in video was introduced by Nam et al. (Paper “Audio-visual content-based violent scene characterization”, J. Nam, M. Alghoniemy, and A. Tewk, in International Conference on Image Processing (ICIP), pp. 353-357, 1998). They combined multiple audio-visual features to identify violent scenes in movies, in which flames and blood are detected using a predefined color tables, and sound effects (e.g., beatings, gunshots, explosions) are detected by computing the energy entropy. This approach of combined low-level features with specialized detectors for high-level events (such as flames, explosions and blood) is also applied by paper “A multimodal approach to violence detection in video sharing sites”, T. Giannakopoulos, A. Pikrakis, and S. Theodoridis, in International Conference on Pattern Recognition (ICPR), pp. 3244-3247, 2010; Paper “Violence detection in movies with auditory and visual cues”, J. Lin, Y. Sun, and W. Wang, in International Conference on Computational Intelligence and Security (ICCIS), pp. 561-565, 2010.
Although most of the approaches on violence detection is multimodal, previous works (before MediaEval) have mainly focused on single modalities. For instance, using motion trajectory information and orientation information of a person's limbs, Datta et al. (Paper “Person-on-person violence detection in video data”, A. Datta, M. Shah, and N. Lobo, in International Conference on Pattern Recognition (ICPR), pp. 433-438, 2002) addressed the problem of detecting human violence such as first fighting and kicking. Nievas et al. (Paper “Violence detection in video using computer vision techniques”, E. B. Nievas, O. D. Suarez, G. B. Garca, and R. Sukthankar, in International Conference on Computer Analysis of Images and Patterns (CAIP), pp. 332-339, 2011) employed a BoVW framework with MoSIFT features to classify ice hockey clips. By exploiting audio features, Cheng et al. (Paper “Semantic context detection based on hierarchical audio models”, W.-H. Cheng, W.-T. Chu, and J.-L. Wu, in International Workshop on Multimedia Information Retrieval (MIR), 190-115, 2003) recognized gunshots, explosions and car-braking using a hierarchical approach based on GMM and HMM. Pikrakis et al. (Paper “Gunshot detection in audio streams from movies by means of dynamic programming and bayesian networks”, A. Pikrakis, T. Giannakopoulos, and S. Theodoridis, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21-24, 2008) proposed a gunshot detection method based on statistics of audio features and Bayesian networks.