1. Technical Field
This invention relates to automated language translation. More particularly, the invention relates to the automated removal of noise words and foreign words from language text data used to train automatic language translators.
2. Background Information
Throughout this application, various publications, patents and published patent applications are referred to by an identifying citation. The disclosures of the publications, patents and published patent applications referenced in this application are hereby incorporated by reference into the present disclosure.
Most data used to train automated language identification systems such as the Rosette® Language Identifier (RLI) (Basis Basis Technology Corp., Cambridge, Mass.), are collected from the World Wide Web and contain English or other noise words. These noise words may lead to misidentification of the given language as well as reduced accuracy rates of (e.g., English) text detection.
Heretofore, there were generally no efficient techniques for removing these unwanted words, other than having human eyes go through the text word by word. This approach tends to be undesirably labor intensive, as typical training text data may be megabytes in size, including millions of words.
A need therefore exists for enabling the automated cleaning of relatively large amounts of training text.