The following relates to the graphical information processing arts. It is described with example reference to processing and utilization of page description language (PDL) graphical content. However, the following is amenable to processing and utilization of graphical content in other formats, and to other like applications.
Documents commonly include textual content and graphical content. In portable document format (PDF), PostScript, scalable vector graphics (SVG), or other existing document representation formats, textual content is typically represented by a suitable character-based code along with optional text attributes such as font type, font size, and so forth, while graphical content is typically represented by a vector-based language in which objects are specified by coordinates and optional attributes. For example, a line segment object may be represented by starting and ending coordinates and color and line width attributes, while a filled square object may be represented by coordinates of two opposite corners and a color attribute.
Document analysis is typically performed with respect to textual content of documents. For example, portions of text duplicated in multiple documents is readily detectable and can be used to identify and correlate related documents. Text can also be structured based on content, for example by converting the text to structured XML in which the abstract, headings, paragraphs, and so forth are resolved into structures. These and other types of document analysis are useful for creating searchable knowledge bases for organizing and locating documents of interest.
Document analysis with respect to graphical content is not as well developed. Graphical document analysis is difficult because visually similar or identical graphical content can typically be represented in a multiplicity of different ways. For example, a line segment of length L can be constructed using a single line segment, or using two abutting parallel line segments of lengths L/3 and 2L/3, respectively, or by using two overlapping parallel line segments of length 2L/3 each with an overlap of L/3, or so forth. Similarly, a filled square graphic can be represented as a single filled square object, or as two adjacent filled triangular objects, or as four adjacent smaller filled square objects, or as two overlapping filled square objects, or so forth.
Because of the multiplicity of possible representations for visually similar or identical graphical content, identifying similar graphical content, identifying graphical objects of interest in graphical content, or performing other types of graphical document analysis is challenging.
One approach for facilitating graphical document analysis is to raster-process the vector-based graphical content to form a dot-matrix representation. However, this approach has substantial disadvantages. The underlying groupings of graphical objects (such as into filled polygons, line segments, or so forth) is lost, making analysis difficult in a dot-matrix representation. Dot-matrix representation of graphical content is also inefficient. For example, in a vector-based representation, a two-dimensional line segment is suitably represented by four numeric values indicating x- and y-coordinates of the endpoints and perhaps an additional one, two, or few numeric values to represent the line color, line width, or so forth. When converted to a dot-matrix representation, this same line segment occupies a two-dimensional portion of the dot-matrix, with each point represented by intensity and color values. The data needed to represent the line in the dot-matrix thus increases substantially overthe vector-based representation. Still further, conversion of graphical content to a dot-matrix representation is usually lossy, as the graphical content is converted to the resolution of the dot-matrix.