Reference is made to a U.S. Pat. No. 5,020,112, and to a related publication by one of us (Chou) in SPIE, Vol. 1199, Visual Communications and Image Processing IV, pages 852-863 (1989), whose contents are herein incorporated. The patent describes a background which equally applies here, and explains the difference between object recognition with and without distinguishing the underlying structure in the image, the former being referred to as intelligent image recognition. Both the patent and the paper describe intelligent recognition of a bitmapped binary image file generated by any commercial scanning device, and goes on to describe a method, including the code in an appendix of the patent, for processing this binary image to intelligently recreate the hard copy source document from which the binary image file was generated.
The paper in particular describes how such binary images are decoded, using a stochastic type of grammar that has proved of value in speech recognition. The conclusion reached by the author was that, for this particular application of recognition systems, i.e., image instead of speech, a stochastic grammar of the context-free type is most suitable, and regular stochastic grammars, despite the latter's inherent shorter parsing time, are unsuitable.
It will be appreciated that any graphics system is based on an imaging model, which is the set of rules that determine how the image of an object is generated from a description of the object's underlying structure, and the formal grammars used to parse the resultant pixel image are based on that same imaging model.
Previous attempts to use formal grammars to describe a 2-dimensional (2-d) image structure, as in the referenced patent and paper (see also Tomita, ACM International Workshop on Parsing Technologies, 1989), have all taken the approach of generalizing 1-dimensional (1-d) formalisms by replacing the notion of a 1-d phrase with the notion of a 2-dimensional (2-d) rectangular region. The resulting grammar rules typically describe how a region corresponding to some phrase is formed by combining a pair of horizontally or vertically abutting subregions. Rectangular subregions may be combined only if they do not overlap and if their dimensions and relative positions are such that the composite region is also rectangular. One disadvantage of this approach is that the 2-d counterparts to regular (finite-state) string grammars are not particularly useful for image modeling, with the result that only context-free 2-d grammars have been investigated. As mentioned in the paper, the computational consequence of using a context-free grammar is that parsing time, in general, is 0(n.sup.3) in the number of terminal symbols (e.g. pixels), compared with 0(n) for regular grammars. As a result, applying context-free grammars directly to image pixels does not produce a particularly practical system.
Moreover, an imaging model based on the requirement that regions do not overlap, is a minor problem in applying this approach to images of text or equations since characters (e.g. `j`) may have negative sidebearings, and is a significant impediment to applying this approach to more complex graphical images, such as music notation. Another disadvantage of these previous attempts is that a recognition grammar is typically validated by using it to recognize examples of the images being modeled. This process can be time-consuming and inconvenient if recognition time is long.