As interest increases in reviewing, translating, transcribing, scanning, searching or otherwise analyzing or processing natural language texts, whether for business, scientific or academic endeavors, national security or other reasons, and whether in English, Japanese, French, Arabic or other languages, there exists a need for improved methods, systems, devices, structures and software products for enabling efficient and accurate processing of text.
Given the increasingly global nature of business and other enterprises, it is not unusual to receive bodies of text written in more than one language. Applications that process input data in today's global environment must be capable of processing data in languages from all over the world. Often, valuable information enters into an organization as unspecified text from disparate, unstructured sources such as e-mail, HTML pages, legacy systems, and external data feeds. Enabling an enterprise's critical information applications to handle this information is a significant challenge.
As the number of systems and applications for analyzing text increases, it would be useful to support and enhance such applications by enabling them to detect boundaries between different languages in a body of text. This could enable, for example, the dynamic optimization of processing between text sections of different languages.
The prior art contains methods for determining the language of a body of text, assuming that it is in a single language. Approaches to this problem typical of the prior art employ statistical and heuristic methods to determine the language of a body of text, again assuming that it is in a single language. (See, e.g., Cavner, W. and Trenkle, J., “N-Gram-Based Text Categorization”.) Thus, even though it is increasingly common to receive multi-lingual bodies of text, conventional language detecting and processing methods and software are generally adapted for texts written in a single language.
The prior art also includes methods for determining logical boundaries between units of text, such as words or sentences. An example is set forth in the Unicode Standard Annex #29, “Text Boundaries” (available at: http://www.unicode.org/reports/tr29/tr29-4.html). The method disclosed in that Annex is referred to below as the method of “UAX #29”, and is incorporated herein by reference as if set forth in its entirety.
However, the prior art does not describe an efficient, automated way to detect or identify boundaries between areas of different languages in a body of text containing multiple languages.