The present disclosure relates to data processing, and more particularly, to methods, systems and computer program products for data-dependent clustering of geospatial words.
Geotagging is the process of adding geographical identification metadata to various media such as a photograph or video, websites, SMS messages, QR Codes, RSS feeds, or social media posts. Modelling the geospatial pattern of these words may help to disambiguate different locations. One challenging issue of such an approach is that millions of unique token types (e.g., on top of words found in a typical English dictionary) in social media lead to computational issues (e.g., hashtags (#GreatBarrierReef), word combinations (lolmythesis), and user handles (@melb). For example, geotagging assigns geographical information to existing objects. Due to limited reliable geographical information (e.g., GPS labelled data), many geotagging systems in social media (e.g., Twitter) rely on text messages to infer geographical locations. For instance, a post on Twitter may state, “yinz need to meet these folks—http://luv-water.co/—they are also a CMU startup and super nice” suggests that the Twitter message refers to Pittsburgh, Pa. because “yinz” and “CMU” are primarily used in Pittsburgh.