The present disclosure relates generally to the field of geotagging unstructured text.
With the continued advances of social network services, such as TWITTER, FACEBOOK and FOURSQUARE, a tremendous amount of unstructured textual data has been generated. One of the most popular forms of such unstructured texts is a short text message, called a “tweet”, from TWITTER. Each tweet has up to 140 characters. TWITTER users are posting tweets about almost everything from daily routine, breaking news, score updates of various sport events to political opinions and flashmobs (see A. Kavanaugh, S. Yang, S. D. Sheetz, and E. A. Fox. Microblogging in crisis situations: Mass protests in iran, tunisia, Egypt—in CHI 2011, and K. Starbird and L. Palen. (how) will the revolution be retweeted?: information diffusion and the 2011 Egyptian uprising—in CSCW 2012). Over hundreds of millions of such tweets are generated daily.
Furthermore, more and more business organizations recognize the importance of TWITTER and provide their customer services through TWITTER, such as receiving feedback about products and responding to customers' questions using tweets (see Twitter: A New Age for Customer Service—Forbes. http://onforb.es/VqqTxa).
Tweets can be much more valuable when tagged with their location information because such geo-tagged tweets can open new opportunities for many ubiquitous applications. For example, if a user posts a tweet tagged with her current location, nearby local stores can immediately send her customized coupons based on the context of the tweet or her profile (assuming that she is a subscriber of such location-based advertisement services). Similarly, local news and places of interest can be recommended based on the location, the context of the tweet and the past experiences of her friends on a social network. Geo-tagged tweets can also be used to report or detect unexpected events, such as earthquakes (see T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors—in WWW 2010), robbery or gun shots, and notify the event to the right people instantly, including those who are close to the location of the event.
On one hand, like most social network services, TWITTER recognizes the value of tagging tweets with location information and provides the geo-tagging feature to all its users. On the other hand, such opt-in geo-tagging feature is confronted with several challenges. First, TWITTER users have been lukewarm in terms of adopting the geo-tagging feature. According to a recent statistical analysis over 1 billion tweets spanning three months (discussed in more detail below), only 0.58% tweets include their fine-grained location. With such a tiny amount of geo-tagged tweets, it would be very hard to realize the many social and business opportunities such as those mentioned above. Second, even for the limited tweets tagged with geometric coordinates, a fair amount of them cannot be used effectively because their geometric coordinates cannot be applied as quality indicators of useful semantic locations, such as points of interest and places where events of interest may happen or have happened. This location sparseness problem makes it very challenging for identifying the types of tweets in which their location information can be inferred, i.e., the location where a tweet was written. In order to derive new values and insights from the huge amount of tweets generated daily by TWITTER users and to better serve them with many location-based services, it is important to have more geo-tagged tweets with semantically meaningful locations.
For the purposes of this disclosure, various conventional techniques are categorized into four categories: 1) location prediction in TWITTER-like social networks, 2) topic and user group prediction in TWITTER-like social networks, 3) analysis of FOURSQUARE check-ins, and 4) location prediction using other online contents.
Referring first to conventional location prediction in social networks, these techniques can be divided into the problem of predicting the location of each TWITTER user (see Z. Cheng, J. Caverlee, and K. Lee. You are where you tweet: a content-based approach to geo-locating twitter users—in CIKM 2010; and B. Hecht, L. Hong, B. Suh, and E. H. Chi. Tweets from justin bieber's heart: the dynamics of the location field in user profiles—in CHI 2011; and J. Mahmud, J. Nichols, and C. Drews. Where is this tweet from? inferring home locations of twitter users—in ICWSM 2012) or predicting the location of each tweet (see Y. Ikawa, M. Enoki, and M. Tatsubori. Location inference using microblog messages—in WWW 2012 Companion; and W. Li, P. Serdyukov, A. P. de Vries, C. Eickhoff, and M. Larson. The where in the tweet—in CIKM 2011). Concretely, Z. Cheng, J. Caverlee, and K. Lee, You are where you tweet: a content-based approach to geo-locating twitter users proposes a technique to predict the city-level location of each TWITTER user. It builds a probability model for each city using tweets of those users located in the city. Then it estimates the probability of a new user being located in a city using the city's probability model and assigning the city with the highest probability as the city of this new user. To increase the accuracy of the location prediction, it utilizes local words and applies some smoothing techniques. B. Hecht, L. Hong, B. Suh, and E. H. Chi, Tweets from justin bieber's heart: the dynamics of the location field in user profiles uses a Multinomial Naive Bayes model to predict the country and state of each TWITTER user. It also utilizes selected region-specific terms to increase the prediction accuracy. J. Mahmud, J. Nichols, and C. Drews. Where is this tweet from? inferring home locations of twitter users presents an algorithm for predicting the home location of TWITTER users. It builds a set of different classifiers, such as statistical classifiers using words, hash-tags or place names of tweets and heuristics classifiers using the frequency of place names or FOURSQUARE check-ins, and then creates an ensemble of the classifiers to improve the prediction accuracy. These coarse-grained location prediction methods rely heavily on the availability of a large training set. For example, the number of tweets from the users in the same city can be quite large and comprehensive. In contrast, embodiments of the disclosure predict the location of tweets (the short unstructured text) at a fine granularity.
Y. Ikawa, M. Enoki, and M. Tatsubori. Location inference using microblog messages—in WWW 2012 Companion and W. Li, P. Serdyukov, A. P. de Vries, C. Eickhoff, and M. Larson, The where in the tweet—in CIKM 2011 centered on predicting the location of each tweet. W. Li, P. Serdyukov, A. P. de Vries, C. Eickhoff, and M. Larson, The where in the tweet builds a POI (Place of Interest) model, assuming that a set of POIs are given, using a set of tweets and web pages returned by a search engine. For a query tweet, it generates a language model of the tweet and then compares it with the model of each POI using the KL divergence to rank POIs. Since it uses only 10 POIs and a small test set for its evaluation, it is unclear how effective the approach is in a real-world environment in which there are many POIs and a huge number of tweets and furthermore many tweets contain noisy text, irrelevant to any POI. Y. Ikawa, M. Enoki, and M. Tatsubori, Location inference using microblog messages extracts a set of keywords for each location using tweets from location-sharing services, such as FOURSQUARE check-in tweets, and other general expression tweets posted during a similar time frame. To predict the location of a new tweet, it generates a keyword list of the tweet and compares it with the extracted keywords of locations using cosine similarity. A clear problem with this work is that it treats all tweets equally in the context of location prediction. Thus, it suffers from high error rate in the prediction results, especially for those location-neutral tweets.
Reference will now be made to conventional topic and user group prediction in social networks. In addition to location prediction of TWITTER data, research efforts have been engaged in inferring other types of information from TWITTER data. J. Lin, R. Snow, and W. Morgan, Smoothing techniques for adaptive online language models: topic tracking in tweet streams—in KDD'11 proposes a framework to predict topics of each tweet. It builds a language model for each topic using hashtags of tweets and evaluates various smoothing techniques. M. Pennacchiotti and A.-M. Popescu. Democrats, republicans and starbucks afficionados: user classification in twitter—in KDD'11 proposes a social network user classification approach, which consists of a machine learning algorithm and a graph-based label updating function. L. Barbosa and J. Feng. Robust sentiment detection on twitter from biased and noisy data—in COLING 2010 proposes an approach to predict sentiments of tweets and F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida, Detecting spammers on twitter—in CEAS 2010 presents a technique to classify TWITTER users as either spammers or nonspammers. Most of the techniques in this category build their language-based classification model using supervised learning and utilize some external knowledge to initialize the classification rules, such as spam or non-spam. In contrast to this line of work, various embodiments focus on location detection of tweets rather than TWITTER user classification.
Reference will now be made to conventional analysis of FOURSQUARE check-ins. Z. Cheng, J. Caverlee, K. Lee, and D. Sui, Exploring millions of footprints in location sharing services—in ICWSM 2011; and A. Noulas, S. Scellato, C. Mascolo, and M. Pontil, An empirical study of geographic user activity patterns in foursquare—in ICWSM 2011 analyze FOURSQUARE check-in history in various aspects. Z. Cheng, J. Caverlee, K. Lee, and D. Sui, Exploring millions of footprints in location sharing services shows spatial and temporal (daily and weekly) distribution of FOURSQUARE check-ins. It also analyzes the spatial coverage of each user and its relationship with city population, average household income, etc. A. Noulas, S. Scellato, C. Mascolo, and M. Pontil, An empirical study of geographic user activity patterns in foursquare also shows spatiotemporal patterns of FOURSQUARE check-ins and calculates the transition probabilities among location categories.
Reference will now be made to conventional location prediction using other online contents. Many studies have been conducted to infer the geographical origin of online contents such as photos (see P. Serdyukov, V. Murdock, and R. van Zwol. Placing flickr photos on a Map—in SIGIR 2009), webpages (see E. Amitay, N. Har'El, R. Sivan, and A. S offer. Web-a-where: geotagging web content—in SIGIR 2004) and web search query logs (see R. Jones, R. Kumar, B. Pang, and A. Tomkins. “i know what you did last summer”: query logs and user privacy—in CIKM 2007). P. Serdyukov, V. Murdock, and R. van Zwol, Placing flickr photos on a map builds a language model for each location (a grid cell) using the terms people use to describe images. E. Amitay, N. Har'El, R. Sivan, and A. S offer. Web-a-where: geotagging web content identifies geographical terms in web-pages using a gazetteer to infer a geographical focus for the entire page. R. Jones, R. Kumar, B. Pang, and A. Tomkins. “i know what you did last summer”: query logs and user privacy utilizes a geo-parsing software which returns a list of locations for web search query logs to infer the location of users (at zip code level).