This disclosure relates generally to digital content publishing, and more specifically to ranking domains of sources of digital content for digital magazines using trained domain classifiers.
Digital distribution channels disseminate a wide variety of digital content including text, images, audio, links, videos, and interactive media (e.g., games, collaborative content) to users. Users often interact with content items provided by various sources or content providers, such as social networking systems, online publishers and blogs. A content item provided by a source is often based on the content of a resource on the Internet identified by a universal resource locator (URL). A part of the URL for the content item is a domain name, which is a text-based label and serves to identify the source of the content item. For example, an article on Internet located at URL: http://www.example.net/index.html, is related to the domain name “example.net,” which identifies the source of the article.
Content items available for users to view in digital magazines are not immune to online spamming, where unsolicited articles or messages (also known as “spam”) are provided by spammers using various domain names. For example, 48-hour data collected by a digital magazine service, FLIPBOARD™, show that there are 228,884 new articles from 39, 620 domains, where at least 23% of the new articles are spam, and about 7-11% articles are from known spam domains. Existing techniques to identify spam domains include manual identification and spam identification at content item level (e.g., during the processing of content items). However, manually identifying spam domains or identifying spam domains at content item level are slow and costly, which degrades user experience with consumption of the content items provided by digital magazines.