This invention relates generally to identifying content in messages, and in particular to training computer models for identifying hashtags in a message.
Hashtags have become a popular way for users to add topics, keywords, or ideas to a message. For example, a user may insert various hashtags in a message: “Watching the #olympics and the #100 mswim, #goteam #lovetheolympics.” As shown by this example, users may, and frequently do, add several hashtags to a single message, and in some examples, users may use a hashtag for every word in a message: “#this #is #the #bestalbum #ever.” As a result, hashtag data in messages is very noisy, and correctly analyzing hashtags in messages is difficult. In particular, it is challenging to determine which hashtags may be successfully predicted using a classifier or computer model. In addition, hashtags often correspond to terms that may otherwise be components of the message that are not currently accounted for in a feature set describing the message. Because such feature sets are often sparse, trained classifiers are often inadequate at providing an effective prediction of whether their output will accurately describe the probability of a message belonging to a hashtag.