1. Field of Invention
This invention relates to geocoding and more specifically to natural language parsers/splitters to normalize addresses.
2. Description of Related Art
Geocoding is the process of finding associated geographic coordinates (often expressed as latitude and longitude) from other geographic data, such as street addresses, or zip codes. With geographic coordinates the features can be mapped and entered into Geographic Information Systems (GIS), or the coordinates can be embedded into media such as digital photographs via geotagging. Generally, a geocoder is a piece of software or a (web) service that helps in this process.
Yet, there are many different addressing schemes and languages in the world. Hence, there is a need for a system that could understand those addressing schemes and languages, as well as all the different ways a human might write or input an address into a computer. The latter is referred to as “natural language” (or “ordinary language”), which is a language that is spoken, written, or signed by humans for general-purpose communication and often includes informal and/or abbreviated syntax, and relaxed adherence to grammatical rules. For example, when a user inputs an address, that input often does not adhere to standardized address formats processed by machines.
U.S. Pat. No. 7,039,640 to Miller et al., the disclosure of which is incorporated by reference herein in its entirety, discloses a system and method for geocoding diverse address formats. A single geocoding engine is taught that is allegedly capable of handling various address formats in use in different countries and jurisdictions. This engine uses country/jurisdiction specific parsers for isolating generic address components, e.g., street number, street, city, country, and postal code.
Conventionally, country/jurisdiction specific parsers are generated either by hand, or by manually describing the grammar and using a parser generator to construct a parser from the context free grammar. The former is extremely tedious and is prone to errors. As changes are made to improve hand crafter parsers, care needs to be taken not to upset addresses that previously could parse correctly. Manually describing the grammar as a context free grammar has its limitations as well, as ambiguous input (which is very common with street addresses) is not easily handled by this technique and as such the hit rate, i.e., matches between addresses input by a user and addresses accepted and known to a computer, is much lower.