This invention relates generally to identifying content in messages, and in particular to training computer models for identifying keywords for web pages linked in a message.
Users may add links to external webpages in posts to a social networking system. These webpages may relate to a variety of different topics. For example, a user may insert link to a webpage discussing a recent sports game or the user's favorite team. The webpages linked by the user typically relate to these various topics which may or may not be represented in the social networking system. Without an understanding of the topics or other terms relating to the webpage, the social networking system may not be able to associate the webpage to concepts of the social networking system. While certain webpages may identify “keywords” associated with the webpage, typically used for search engine analysis, many webpages have no keywords. Webpage keywords may be designated by operators of the webpage and are often unreliable. As a result, keyword data for a webpage is very noisy, and correctly analyzing keywords for association with a message in a social networking system is difficult. In particular, it is challenging to determine which keywords may be successfully predicted using a classifier or computer model.