1. Field of Art
The present invention generally relates to the field of data processing, and more specifically, to methods of training spelling checkers for use in a domain having structured corpora of data, such as geographic name data.
2. Background of the Invention
Users of computer systems often input misspelled data. For example, in the domain of geographic entity names, users frequently enter misspelled queries for locations, such as a query for map data associated with the city of “New Yrk” or of “Amstedam, the Netherlands.”
Some conventional spelling checkers for use in other data domains store all combinations of validly-spelled words along with the associated probabilities that the various combinations are what the user intended. The set of combinations is deemed a language model providing a representation of the permissible combinations of words in the relevant data domain. Although such an approach is feasible in certain domains such as those of natural languages, where the number of valid word combinations is constrained by comparatively restrictive grammatical rules governing word ordering, it is not practically possible in other domains.
For example, in the domain of geographic entity names, there is a very large vocabulary of possible names, such as tens of thousands of distinct cities (e.g., “New York” or “Paris”), streets (e.g., “Main Street”), and landmarks (e.g., “Eiffel Tower”), hundreds of countries (e.g., “United States” or “France”), and so on. A geographic query can contain any number of the different types (e.g., street, city, or country), and the types can be arranged in different orders based on the different language conventions of the possible users (e.g., “STREET, CITY, STATE” in American English, and “CITY, DISTRICT, STREET in Russian). Additionally, the different geographic entities of a geographic query can have multiple names, such as various abbreviations (e.g., “U.S.”, as well as “United States”), and/or various spellings in different languages (e.g., “Etats Unis” in French and “United States” in English). Similarly, a translation of a geographic name in one language (e.g., Chinese) can have multiple equally valid spellings in another language (e.g., English) and thus an entire set of associated misspellings. Thus, conventional techniques cannot create useful language models for domains such as that of geographic entity names, where the number of valid word combinations is extremely large.