1. Field of the Invention
The invention relates in general to statistical language modeling. More particularly the invention relates to a method for automatically filtering a corpus of documents containing textual and non-textual information of a natural language to model, in order to obtain a corpus of documents that is well representative of the natural language. The invention also relates to an apparatus for carrying out such a method.
2. Description of Related Art
Textual information is commonly formatted for the human eye, intermingled with non-textual information such as tables, graphics, etc. When such textual information needs to be processed by a machine (e.g. for delivery to a human through speech synthesis or for translation purpose), it becomes necessary to separate what really constitutes text (i.e. a succession of words and punctuation) from the non-textual information.
One such requirement applies to the elaboration of text corpora for statistical language modeling. Present statistical models used in Natural Language Processing (NLP) systems, such as speech recognition systems, require the analysis of large bodies of documents.
These documents, collectively referred to as corpus, need to be as “true-to-life” as possible and are therefore collected from a wide variety of sources. As a consequence, together with the desired textual information (the “wheat”) in those corpora, there is usually a lot of non-exploitable data (the “chaff”), such as binary attachments, images, logos, headers, footers, tables, line drawings and so on.
Thus, prior to running a meaningful statistical analysis on such a corpus of documents, the corpus needs to be cleaned up so that only the “real” textual portions are kept.
Up to now, the above “cleaning” operation of a corpus of documents is commonly performed in a manual way, that is, each document is edited by a person on a display screen and the document is “filtered” upon visual inspection.
As a typical document corpus contains tens of millions of words, manual editing and filtering is extremely labor-intensive and costly. It can also be error-prone, and potentially have dramatic consequences, e.g. if a corpus is damaged beyond repair by an over-enthusiastic use of the delete function.
In order to reduce the time necessary to achieve such a visual filtering of a corpus of documents, some software tools have been developed to assist people in performing this task. These software tools were designed to automate visual rules based on heuristics and “ad-hoc” observations.
Such rules are for instance: “Delete lines that contain less than 20% lowercase characters”, or “Delete lines that are more than 256 characters long”. Other rules were defined, based on visual inspection of the documents, such as: “Delete all the text that appears between two lines formed by ‘-------’” (when this is the way a table of numbers is presented in a given corpus).
All the above rules, even when they are implemented in a computer program, rely on visual inspection of the corpus and on human intervention. With such a “manual” filtering procedure, the cost of a sequence of filtering operations is commonly estimated to range, in average, from 1 to 2-man week, depending on the corpus size and the number of different sources it encompasses.
Thus, as underlined above, given the great deal of time required by present corpus filtering methods to operate, and the high risk of errors they imply as a consequence of human intervention, there is real need of a corpus filtering method that improves such an empiric method of filtering a large corpus of documents. This need is presently addressed by the invention disclosed herein.