Documents may be classified as being members of one or more groups or classes using a number of probabilistic techniques based on the textual content and semantics of each document. These types of classifications are often made based on the presence of specific words that are observed in documents belonging to the class. For example, if an email message contains the words “Nigeria” and “million,” these facts may contribute to the probability that the message is junk mail.
Such classifications may not work as well with email messages and other documents of more nuanced classes, such as “marketing.” While marketing emails may be identifiable via the presence of words such as “coupon,” “promotion,” or “newsletter,” many emails that should be classified as “marketing” often contain nothing other than hyperlinked images, thus defying text-based semantic classification. Other types of text-bearing marketing emails generated by retailers, particularly those operating online, may include a grid of products, where the individual products change each time the email is sent out. If the name of a given product is only observed in a single email, for example, a traditional classifier would not have prior context with which to identify the email with the “marketing” classification.
It is with respect to these and other considerations that the disclosure made herein is presented.