Data compression techniques, applicable to text, graphics and other representatives of information, have been used in many areas of communications such as voice, video, telemetry transmission and storage and retrieval of voluminous data. Of the techniques developed, adaptive data compression is one of the most attractive, because of its ability to increase the bandwidth utilization efficiency for data by reducing the data redundancy. C.A. Andrews et al., in "Adaptive Data Compression", Proc. I.E.E.E., Vol. 55 (1967) pp. 267-277, have noted that data compression techniques can be divided into four categories. "(1) Direct data compression techniques, which include variable rate compressors such as interpolators, polynomial predictors and bit-plan encoding, and fixed rate compressors such as optimum prediction, differential coding, probabilistic coding and adaptive sampling; (2) Linear and nonlinear transformation compression techniques that use pre-process filters, logarithmic amplifiers, filters, limiters/clippers, companders, Fourier filters and Karhunen-Loeve optimum discrete compression filters; (3) Parameter extraction compression techniques, in which one or more parameters associated with or derivable from the signal are use to represent the signal; and (4) Selective monitoring compression techniques that monitor the data and select a portion thereof for transmission or storage.
The efficient representation and storage of collections of ideographic characters or symbols from languages such as Chinese, Japanese and Hebrew is of particular interest here because of the large number of characters required in any reasonable language set. For example, a system of Chinese character patterns should contain 2500-4000 characters in order to adequately represent at least 99.5% of the characters that appear in ordinary text in that language. Chinese characters are also used in Japan, but the total number of characters is indefinite. For example, the Japanese Ministry of Education has identified 881 characters to be learned in elementary and middle schools and an additional 969 characters that should be known for ordinary daily use. Daily newspapers in Japan use about 4000 characters, and one standard code system for such characters contains 6349 characters. In another standard Japanese character set, 6802 characters appear. The most elaborate dictionary for Chinese characters in Japan contains approximately 50,000 characters. According to statistics accumulated on character use, a selected set of about 3000 characters covers about 99.9% of all the characters that appear in newspapers and journals in Japan. Even if one retreats to this smaller number, the task of representing such a large set, each described by a rectangular pattern of M dots by N dots, the task is daunting. If, instead, one concentrates on a larger and more adequate set of 6349 characters or 6802 characters that includes many specialized professional and scholarly characters, the task becomes more daunting.
Crane et al., in U.S. Pat. No. 4,718,102, disclose separation of recognition of complex characters, such as Kanji, into an algorithmic technique, which serves to identify a first set of all possible characters that are consistent with a given observed pattern of pixels, and a disambiguation technique, which serves to remove the ambiguity or possible confusion among all the characters of the first set by use of additional features or parameters associated with the target character. The inventors observe that, statistically, a Kanji character having more than 20 strokes or fewer than 5 will be much easier to distinguish, as compared with a Kanji character having approximately 10 strokes. Stroke characteristics are relied upon for the algorithmic portion of pattern recognition here, with statistics on different categories of statistics being accumulated and analyzed.
Use of pixel neighborhoods surrounding or adjacent to a pixel for optical character recognition purposes is disclosed by Casey et al. in U.S. Pat. No. 4,831,657. A probability table is constructed for recognition of characters expressed in a new font, based on the probabilities associated with characters expressed in a known font. A decision tree is generated and used to analyze the new font. This approach requires the use of a reference font, or something similar, for recognition of characters expressed in a new font.
In U.S. Pat. No. 4,850,026, Jeng et al. disclose extraction of all useful features of a set of characters expressed in a given font, as a character feature database. The particular database features discussed here are vertical, horizontal and diagonal character strokes within each of a sequence of rectangular groups of pixels that cumulatively cover all pixels on the screen.
Several techniques have been proposed for data compression of Chinese or Japanese character patterns. M. Nagao, in "Data Compression of Chinese Character Patterns", Proc. I.E.E.E., Vol. 68 (1980) pp. 818-829, reviews several techniques that have been proposed for such compression, using statistics of the patterns and other approaches. Two-dimensional predictive coding has been proposed in which a character pattern is divided into a sequence of rectangular pixels and the black versus white value of a particular pixel is predicted by use of the pattern of four nearest neighbor pixels. Pattern coding by m-by-n sub-blocks has also been used, relying on the fact that Chinese characters are primarily straight lines. Other techniques include stroke representation, where the strokes are straight line segments represented by vertical, horizontal and .+-.45.degree. strokes on a mesh grid. Contour coding has been used to account for the fact that some portions of Chinese characters are curvilinear rather than being straight line segments. Weighted sums of four adjacent surrounding points has been used for pixel prediction as well.
In "Machine Recognition of Printed Chinese Characters Via Transformation Algorithms", Pattern Recognition, vol. 5 (1973), pp. 303-321, Wang and Shiau identify 63 characteristic sub-patterns on the left side of Chinese characters and an unspecified number of subpatterns on the right side thereof, which together make up whole characters. Their general pattern recognition system includes: (1) receptor module that represents each Chinese character received as a rectangular matrix of pixels; (2) a pre-processor module that uses a Fourier, Hadamard, Rapid or other two-dimensional transform technique to transform the character to a form that is more easily recognized and processed; (3) a classifier module that examines each pixel pattern and assigns it to one of a number of categories based on a decision rule such as minimum-distance-to-mean of a reference character or feature; and (4) a memory module to store each of the classified characters for later retrieval. The classification step appears to introduce some loss of information here, and a character is force-fitted into one of the reference character categories so that the character may be incorrectly recognized and categorized.
Yamamoto and Mori, in "Recognition of Handprinted Characters By An Outermost Point Method," Pattern Recognition, vol 12 (1980), pp. 229-236, used a 64.times.64 pixel pattern, with each pixel having any of 16 levels of darkness, and construct the convex hull of each character examined. A hole, which arises from a plurality of dark pixels that completely surround one or more light pixels, is treated separately. The convex hull of each character is expressed as a mask, and the collection of masks forms a dictionary for character recognition.
F-H. Cheng, et al. in "Recognition of Hand Written Chinese Characters by Modified Hough Transform Techniques", I.E.E.E. Trans. on Pattern Analysis and Machine Intelligence, Vol. 11 (1989) pp. 429-439, uses a modified Hough transform technique plus dynamic programming to characterize and recognize hand written Chinese characters. In the Hough transform technique, a new twodimensional coordinate space is generated in which all points that lie on a straight line segment will map into a single point in the Hough transform space. The Hough transform technique has also been applied to printed and hand written Hebrew characters by M. Kushnir et al., in "An Application of the Hough Transform to the Recognition of Printed Hebrew Characters", Pattern Recoonition, Vol. 16 (1983) pp. 183-191, and in "Recognition of Hand Printed Hebrew Characters Using Features Selected in the Hough Transform Space", Pattern Recognition, Vol. 18 (1985) pp. 103-114.
Siromoney et al., in "Computer Recognition of Printed Tamil Characters," Pattern Recognition, vol. 10 (1979), pp. 243-247, use a run length encoding approach, applied to each line of a digitized character, to recognize and distinguish between Tamil characters.
Chinnuswamy et al., in "Recognition of Handprinted Tamil Characters," Pattern Recognition, vol. 12 (1980), pp. 141-152, apply stroke characterization techniques to linear and curvilinear segments that make up a Tamil character and use computed correlation coefficients for character recognition.
In "Automatic Recognition of Farsi Texts," Pattern Recognition, vol. 14 (1982), pp. 395-403, Parhami et al. discuss five difficulties in recognition of Farsi text and disclose a method that combines digitization, line separation, sub-word and character separation and geometrical characterization for Farsi character and text recognition.
In "Computer Recognition of Arabic Cursive Scripts," Pattern Recognition, vol. 21 (1988), pp. 293-302, El-Sheikh et al. use segmentation of words to obtain individual characters and use truncated Fourier analysis to obtain descriptors of each Arabic character.
Yhap et al. disclose the use of 72 constituent shapes or stroke combinations for Chinese character recognition in "An On-line Chinese Character Recognition System," IBM Jour. Res. Develop. vol. 25 (1981), pp. 187-195. About 2200 characters can be recognized by this method, but not all characters are described solely in terms of these constituent shapes.
Spivey, in "Data Compression Technique for APA Printer (Change Block Skipping), IBM Tech. Disclos. Bull. vol. 23 (1981), pp. 5464-5467, compare each scan line of pixels representing an image with the preceding scan line, noting only the changes, if any, in each corresponding group of four or eight consecutive pixels. The net compression achievable in the example given by Spivey would probably disappear when applied to a complex shape such as a Kanji character. The following article by Spivey, ibid., pp. 5468-5470, also discusses application of Change Block Skipping.
K. Toraichi et al., in "Handprinted Chinese Character Database", published in Computer Recognition and Human Production of Handwriting, ed. by R. Plamondon et al., World Scientific Publishing Co., 1988, pp. 131-148, have analyzed 48,000 characters, divided into 12 sets of 4,000 categories each, of handprinted Chinese characters and have determined statistical profiles of each category (numbers of connected components, "holes", contours, etc.). They have also determined the "horizontal complexity" and "vertical complexity", and "contour gradients", as defined therein, of each category. Much data are presented, but the significance of several of the statistical parameters is not made clear.
Scan-oriented methods of character recognition and encoding scan the original character, for example, line by line in a horizontal or vertical direction, in a predetermined path that is independent of the character. These approaches are the easiest and least expensive to implement but often produce only modest data compression. Fitch and Spivey, in "Font Data Reduction by Scan Compression for Ink Jet Printers", IBM Technical Disclosure Bulletin, vol. 23 (1981), pp. 5471-5472, disclose use of a run length encoding scheme, change block skipping, in which only the positions of changes in pixel values (dark-to-light or light-to-dark) are encoded. In "Compression/Decompression of Font Patterns", IBM Technical Disclosure Bulletin, vol. 28 (1986), pp. 3563-3564 (anonymous), consecutive scan lines of all-light pixels are represented by a single number, and only scan lines with one or more dark pixels therein are represented by full-detail bit patterns. Each of these approaches achieves a modest reduction in the amount of bit map information required to represent a character. However, as noted above, pure scan-oriented methods may produce no reduction when applied to complex characters such as those drawn from a Kanji or Hebrew character set.
Horizontal, vertical and slanted strokes (line segments) are used by Sugita et al. for Kanji character recognition and encoding in "Multi-font Kanji Generator", Trans. I.E.C.E., vol. E66 (1983), pp. 377-382. The two pixel end points of the line segment are specified, and intermediate (dark) pixels are determined by interpolation. This is a variation on a one-dimensional scanoriented approach in which scan lines in any and all directions are used. Changes of the fonts used, for example, from the well known Mincho font to another style is implemented by changing the interpolation rules.
Maeder, in "Local Block Pattern Methods for Binary Image Encoding", Proceedings of the 1988 Ausgraph Conference, discloses use of a neighborhood expansion approach in which each of a collection of dark and light pixel neighborhoods is expanded one row or line at a time, using a collection of like-row or like-column expansion rules that are only partly enumerated in the paper. Applied to a complex Kanji character, this approach would likely produce a large number of small neighborhoods with little or no similarity to one another. However, this approach does attempt to exploit two-dimensional similarity in character recognition and encoding.
Many of these techniques produce some characters that are either incomplete, contain extra line segments, are not esthetically pleasing, or offer relatively little reduction in the information required to be stored in memory to represent each character in the character set. What is needed is an approach that will provide a 30-70 percent reduction in the amount of information required to exactly represent each character in a character set and will provide the same amount of resolution, upon decompression, as is used to represent each of the characters in the original images.