The tagging of data with geographic information is becoming increasingly common. Geo-tagging involves adding information about geographical positions to any kind of data to indicate the relationship between data and a physical place. Geo-tagging can be useful for many purposes. Some examples include: being able to search for the data in a geographical context such as searching for pictures taken close to a particular location; for visualizing data on a map such as searching for newspaper articles about a part of a city on a map of the city.
Geo-tagging can be performed manually where a person manually tags data with the geographical areas that relate to the data. This can be done when the data is created or by manually going through data and tagging it. Manual geo-tagging gives high accuracy, meaning that the data is typically associated with a relevant geographical position.
Geo-tagging can also be performed automatically where a machine controlled by an algorithm analyzes the data and tags it with geographical positions that relates to the data.
Several methods can be used when analyzing data to find any geographical positions related to the data. Some of these methods include: (1) content based classification, (2) domain based classification and (3) location determined by an IP (internet protocol) address.
In content based classification, text is analyzed and matched to a list of geographical positions such as city names, country names, street names, etc. If a city name is included in a text, for example, the text is determined to be related to that city.
Under domain label classification, content located on the internet at a particular website can be assumed to be relating to a country corresponding to the address of the website that includes a country code. That is, for example, websites in Sweden include the country designation .SE in the web address.
In location determined by IP address, IP addresses of hosts of data are distributed in series corresponding to geographical areas. Therefore, by looking at the IP address of a host, it is possible to make assumptions about the geographical area to which the text corresponds.
Each of these methods has certain drawbacks. Content based classification, for example, relies on a machine being able to automatically determine whether a word is an indicator of a geographical position. In order to make this determination, existing solutions rely on comparing words to a list of place (names). A problem with this approach is that such lists most often only include the official name of a place. Alternative names such as slang or popular references are not part of such lists (e.g. “Sergels Torg” in Stockholm is sometimes referred to as “Plattan” and “New York” is sometimes referred to as the “Big Apple”). Consequentially, existing solutions have problems identifying these types of geographic words.
In many texts, information relating to many different places may be present. That is, for example, a section of text could include the words “Boston”, “Stockholm”, “Copenhagen”, “Södermalm”, and “Skanstull”. These words represent a reference to a place and judging from the separate words it is difficult for an algorithm that matches words to positions to determine the relationship between the text and a place.
Some words are geographically ambiguous. They may refer to many different places. For example, “Vasastan” may refer to a place in Goteborg, Sweden and to a place in Stockholm, Sweden. Therefore, it is difficult for an algorithm that matches words to positions to determine the relationship between the text and a place.