There are applications within processing systems that identifies street addresses that are in human language text. An IBM product, IBM Identity Resolution is an example of a program which identifies individuals and their associated street addresses. Street addresses are human language texts that are not immediately understandable by a computer for further information processing. Street addresses may not be understandable because the addresses may be in different languages, e.g. English, Arabic, Turkish, etc. that are not understandable by the computer.
Furthermore, street addresses may have different sentence structures depending on the country/region they are coming from. For example, although the U.K. and the USA both use English addresses, their address structure is totally different.
A conventional algorithm utilized in many applications for parsing addresses is referred to as a pattern matching algorithm. In this type of algorithm, an address is compared to a finite number of possible patterns and the pattern that matches the address first is used as the correct address. Since the patterns are compared in a previously determined order, an address may match a pattern although it may better match a later pattern. In addition, addresses that were not entered correctly may match wrong patterns or not match at all.
Furthermore, the number of patterns that needs to be in a processing system to allow for accurate matching of the street addresses will grow exponentially when the existence of a new type of address is realized. Hundreds of patterns may be required to represent all possible addresses in a country like the United States. In addition, some addresses such as 1234 East West Street may be difficult to parse with a pattern matching algorithm. For example, an ambiguous US street address is 1234 East West Street. Finally, cities with long histories (for example in Europe) tend to have addresses that are not very well structured. The number of patterns for such cities may be huge.
Accordingly, conventional pattern matching algorithms have the disadvantage of being slow due to numerous lookups per address and the difficulty to implement such an algorithm for a new language/region. Accordingly, what is desired is a method and system for parsing street addresses that overcome the above-identified issues. The present invention addresses such a need.