This invention generally relates to the analysis of content items provided by online systems to users, and in particular to semantic analysis and classification of content items provided by an online system based on machine learning, for example, using neural networks.
In many online systems, such as social networking systems, users are able to connect to and communicate with other users of the online system. For example, an online system may allow for users to share content with other users of the online system by providing content items to the online system for presentation to the other users. In addition, content publishers may be able to submit content items to the online system for presentation to users of the online system. The content items may comprise text data, as well as image data, audio data, video data, and/or any other type of content that may be communicated to a user of the online system.
To ensure a high quality user experience, an online system may remove or prevent certain types of content items from being displayed to users, based on text data associated with each content item. The types of content items that can be displayed to users of the online system may be restricted by one or more policies. For example, a particular online system may have a policy that disallows display of content items having text associated with certain categories of content (e.g., adult content, illegal content, and/or the like).
The online system may maintain a review process to identify instances of content items having text that violates one or more policies, and are thus unsuitable for display to users. For example, human reviewers may manually review received content items in order to determine their suitability for display. An online system may receive a large number of content items to be reviewed, for example, hundreds of thousands of content items in a few days or a week. Use of humans for reviewing content items is a slow and expensive process. Existing automatic techniques, for example, searching for offensive keywords are often unable to identify several complex policy violations. Therefore, conventional techniques for identifying content items that violate policies of the online system are ineffective, expensive, and time-consuming.