The present invention relates to processing electronic documents.
Written or printed language is typically represented as a sequence of characters. A character is an abstract symbol that can be used as a building block for representing more complex concepts, e.g., text. An electronic document, i.e., a named electronic data collection, can represent text using coded or non-coded representations.
A coded representation specifies a sequence of code values, where each code value represents a character of the text. A coded representation is based on a character encoding that defines a character collection and, for each character in the collection, a code value identifying the character. Therefore, the character encoding provides a mapping between the code values in the coded representation and corresponding characters in the collection. Coded representations are typically based on standard character encoding, such as the American Standard Code for Information Interchange (“ASCII”) or the Unicode Standard (“Unicode”). ASCII encoding defines a character collection that includes 128 characters corresponding to letters, numbers, punctuation marks, and other abstract symbols to represent printed English text. Unicode is a language independent encoding that defines a character collection including more than 65000 characters. Alternatively or in addition, coded representations can be based on character encoding that is manufacturer or application specific.
In addition to code values, a coded representation typically specifies one or more fonts. A font associates glyphs with code values that represent characters according to a character encoding. A glyph is a graphical representation that specifies a visual appearance of the abstract symbol of a character, and typically includes one or more bitmap or vector graphics objects. For example, standard fonts can specify glyphs with different designs, called typefaces, such as Times Roman, Helvetica or Courier. The association between the glyphs and the code values can be direct or indirect, explicit or implicit. In any case, when the coded representation is printed or displayed, the characters of the text are rendered using the glyphs associated with the code values. Typically, there is a semantic relation between a glyph and a character represented by the associated code value, i.e., the glyph recognizably represents the character to a human reader. Optionally, however, a font can associate a default glyph with code values that represent characters for which the font has no glyph with proper semantic relation.
A non-coded representation specifies a visual representation, i.e., an image, of a text without reference to character encoding. For example, a non-coded representation can include text scanned from a printed document. The scanned text is typically represented as bitmap graphics objects without any coding information about the characters in the text. Similarly, several applications (e.g., AutoCAD® design software, available from AutoDesk Inc. of San Rafael, Calif.) print text to electronic documents (e.g., in Adobe® Portable Document Format, “PDF”) using only graphics objects representing the glyphs without specifying the corresponding code values. While humans may read a text in an image that is specified by a non-coded representation, text processing applications, such as word processors or text layout programs, cannot manipulate the text without coding information that identifies the characters represented by the non-coded representation.