The approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
Increased internationalization in business has coincided with the fast growth in volume of the digital data that is managed using distributed computer systems. Large volumes of documents are constantly harvested, produced, exchanged, and analyzed. To process such large volumes of documents with a computer system that has limited computational resources is challenging. This challenge is further exacerbated when these large volumes of documents have different locales. “Locale,” in this context, may mean different languages, different formatting of data for dates, numbers, and other semantics, and different local use of terms either in a particular language or after translation; that is, a particular word, when translated into different languages, may have substantially different meaning in local usage.
In some instances, schemas of documents also may have multiple locales. Localized document schemas make automatic processing of the document especially hard. For example, CSV (Comma-Separated Value) documents may have field names in a foreign language in addition to having foreign language records within the document. Even for semi-structured data based documents, such as XML (eXtensible Markup Language) documents, the tags, elements, and attributes may be defined in a local foreign language. Computer systems that receive and process such documents may fail to recognize schemas without prior knowledge of the locale and may require manual configuration to accurately process the data of the documents. Therefore, in many kinds of distributed systems, accurate operation requires accurately determining the locale to which a document applies.