The present invention relates to a method for the standardization of a database containing address data relevant to individuals and businesses.
In a business that has a customer database for mass mailing or other address related applications, it is advantageous to have all the address data in the database standardized. In a standardized address database, an address can be organized in a certain way so that the various components of the address data can be easily identified by a machine. There are many instances where an address must be broken down into different components so that each component is used in a different way. For example, if it is required to fill out a form where only the last name, the city name and the zipcode are separately put in different boxes of the form, one could use a machine to pick out the correct components in a standardized address and put them in the proper places. Moreover, using a standardized address database, address duplication can be easily discovered and eliminated.
Standardized address database can be used by private institutions such as banks, insurance companies, credit bureaus, hospitals, fund raising organizations, and mail-order businesses. The database are also useful for various government agencies such as the Social Security Administration, Internal Revenue Service, and so forth.
It is advantageous to provide a method for the standardization of address data.
It is the primary objective of the present invention to provide an effective method to standardize the address data in a database.
The basic steps for accomplishing the above-identified objective begin with breaking up (parsing) a set of address data into lines and then breaking up each line into words. Each word is looked up in a word dictionary wherein the word dictionary has a large quantity of stored words and each stored word is associated with a field type. The field type of each word is identified by comparing the word in the line to the stored words in the word dictionary. After the field types are identified, a line pattern is formed for each line using the field types of the words in the line. A pattern type is then identified for each line by comparing the pattern of the line to the stored patterns contained in a pattern dictionary. The line patterns for each line are then returned to the address data.
In some instances when a line pattern cannot be found in the pattern dictionary, it is necessary to include an additional step of splitting the line into two. For example, in a line that contains both the street address and the P.O. Box number, it is preferable to break up the line into a street address line and a P.O. Box line.
In some instances when a line pattern cannot be found in the pattern dictionary, it is necessary to include an additional step of joining two lines into one. For example, when the City/State/Zip line has been broken up such that the Zipcode is separated from the city name and state name, it is preferable to put the Zipcode behind the state name on the same line.
In some instances when a line pattern cannot be found in the pattern dictionary, it is necessary to include an additional step of dropping one or more words in a line. For example, in a personal name line containing financial words such as xe2x80x9cTrusteexe2x80x9d, xe2x80x9cDeceasedxe2x80x9d, xe2x80x9cMinorxe2x80x9d, and/or xe2x80x9cRetireexe2x80x9d, it is preferable to delete those words in the address data.
It is also preferred that the dictionary contains words that are considered standard in an address and those are non-standard, so that the non-standard words in an address can be identified and then replaced by the standard words. For example, it is preferable to include in the word dictionary the full state names (e.g. New Mexico) as well as the abbreviated state names (e.g. NM) for identification purposes, but the full state names will be replaced by the corresponding abbreviated state names in the standardized address database.
When a set of address data is broken up into lines, it is preferable to discern the line type. For example, it is preferable to know whether a line is a street address line or a firm name line. Accordingly, the field types of the stored words in the word dictionary are assigned in accordance with the line types. For line identification purposes, a procedure called Blockscan is used. The Blockscan procedure starts from the bottom line of an address and works upward until the topmost line has been identified, and each line is identified using the usual address queues appearing in the line. For example, if the bottom line contains a state name as the second or a later word, the line is identified as a city state line. In the next line up, if there is a street suffix word such as Ave., Avenue, and Blvd. in the line, the line is identified as a street address line. Similarly, if a word that is commonly used as part of a firm name, such as CO, Company, Inc., Industry, appears in a line above the street address, the line is identified as a firm line.
The present invention is more clearly described in the Detailed Description of the Embodiment.