1. Field of Invention
The present invention relates to the identifying of text regions within an image file of a document.
2. Description of Related Art
Optical character recognition, or OCR, is a broad term applied to the general field of using machines to recognize human-readable glyphs, such as alphanumeric text characters and Chinese written characters, or more generally, Asian written characters. There are many approaches to optical character recognition, such as discussed in U.S. Pat. No. 5,212,741. However, an integral part of the field of OCR is a step to first identify, i.e. classify, regions of an image as text or non-text. Image regions identified as text may then be further processed to identify specific text characters or Asian written characters.
Various approaches to distinguishing text regions from non-text regions of an image have also been proposed. For example, U.S. Pat. No. 6,038,527 suggests searching a document image for word-shape patterns to identify text regions.
It would be helpful if a machine could determine for itself how to identify text characters and/or Asian written characters. This leads to the field of machine learning, since an ideal would be for a machine to learn how to identify human-readable glyphs, itself.
Data classifiers are associated with the field of machine learning, and are typically applied in areas that require sorting through large data samples, such as the data mining technique described in U.S. Pat. No. 7,640,219. Data classifiers have also been applied to the field of OCR, as demonstrated by U.S. Pat. No. 5,640,492, which describes the use of a soft-margin classifier in text recognition.
Generally, in data classification various positive samples and negative samples are provided to a machine in a training phase to establish positive and negative references, and thereby to establish two classes. Once training is complete, the machine is asked to assign a newly provided sample to one of the two classes based on what it has learned.
For example, if each data point in an existing sample of data points can be designated as belonging to one of two classes, a goal could be for a machine to determine for itself to which class a newly provided data point should belong.
In the case of support vector machines, each data point may be viewed as a p-dimensional vector (i.e., a list of p numbers), and the goal is to determine whether such points can be separated with a (p−1)-dimensional hyperplane. This may be termed linear classification.
A hyperplane is a concept in geometry, and it is a generalization of the concept of a plane into higher dimensions. Analogous with a plane which defines a two-dimensional subspace in a three-dimensional space, a hyperplane defines an m-dimensional subspace within a q-dimensional space, where m<q. A line, for example, is a one-dimensional hyperplane in a higher dimension space.
The main idea in using a hyperplane in data analysis is to construct a divide (i.e. a hyperplane) that separates clusters of data points, or vectors, (i.e. separates data points into different classes). These separated data point clusters can then be used for data classification purposes. There may be many different hyperplanes that divide the data points into separate clusters, but some hyperplanes will provide better divisions of data points than others. Intuitively, a good choice of hyperplane is one that provides a good separation. That is, the best of choice of hyperplane would be the hyperplane that has the largest distance (i.e. functional margin) to the nearest training data points of the different classes. This is because, typically, the larger the functional margin, the lower the generalization error of the classifier. Thus, although there might be many hyperplanes that classify the data (i.e. may separate the data into classifications, or data clusters), one hyperplane may offer optimal separation.
For example, FIG. 1 shows a 2-dimensional space with eighteen data points (or vectors) separated into two clusters of nine data points, each. A first data cluster of nine data points is shown as black data points, and a second data cluster of nine data points is shown as white data points. For illustrative purposes, three candidate hyperplanes 11, 13, and 15 (i.e. three lines in the present 2-dimensional example) are shown to successfully separate the eighteen data points into two groups, or classes, of data points, but one of the three candidate hyperplanes offers the best data-point separation.
In the present example, hyperplane 13 separates four black data points on its left (side A) from five black data points and nine white data points on its right (side B). In order to obtain meaningful information, however, it is helpful to divide the data points into data clusters since the data points in each data cluster are likely to have some similar attributes. In the present case, it is relatively self-apparent that hyperplane 13 does not provide meaningful information regarding similarities or differences between the black and white data points since hyperplane 13 does not accurately differentiate between the two data clusters.
Hyperplane 11 does separate the first data cluster (consisting of nine black data points) on its upper side (side C) from the second data cluster (consisting of nine white data points) on its lower side (side D), but does not provide the optimal separation between the first and second data clusters.
In order to provide meaningful information, it is preferable that the hyperplane that separates the two data clusters provide a maximum separation between the two data clusters. The objective is to choose the hyperplane in which the functional margin (i.e. the distance from the hyperplane to the nearest data point along a line normal to the hyperplane) on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane, and such a linear classifier is known as a maximum margin classifier.
In the present example of FIG. 1, margin line 16 defines the border of the first data cluster of black data points with reference to hyperplane 15, and margin line 18 defines the border of the second data cluster of white data points with reference to hyperplane 15. The data points (or vectors) along margin lines 16 or 18 are typically called support vectors. The bias from the origin to hyperplane 15 is shown as bias term b. The functional margin w of hyperplane 15 to margin lines 16 and 18 is likewise shown. In the present case, hyperplane 15 would be the maximum margin classifier since it has the largest functional margin among the three candidate hyperplanes 11, 13, 15.
As shown, classifiers are effective at sorting data into two classes, such as text regions and non-text regions in an image sample, but they are generally very computationally expensive, requiring much computing resources. Furthermore, text regions may have a multiplicity of patterns in various orientations, and may be made distinguishable from non-text regions by a multitude of geometric, luminescent, and chromatic properties, such that separation into two classes may not be computationally practical. This is particularly the case when consideration both Western alphanumeric characters and Asian written characters, which may be written in different fonts and colors.
What is needed is a method of making use of the classifying strength of classifiers, while simplifying its application and reducing its computational requirements.