In addition to producing physical renderings of digital documents, e.g. paper prints, exchanging and archiving the digital documents themselves play an increasing role in business as well as private communications. In order to facilitate exchange and provide universal access regardless of computer system and application, general page description languages are used instead of native word processor formats for exchanging digital documents. In order to reuse the text contents of digital documents for archiving, indexing, searching, editing, and other purposes which are not related to producing a visual rendering of the page, it is desirable to convert the text using some standard character identification (encoding).
Since digital documents may contain characters from arbitrary scripts and languages in any combination, a preferred choice for such a character identification is the Unicode standard, almost identical to ISO 10646. Unicode is widely recognized as the only universal standard capable of encoding all characters which are in use world-wide. The Unicode sequence corresponding to a given text string provides the semantics of the text. Mapping the text contents of a digital document to Unicode is highly advantageous for all processes which rely on the text semantics, such as searching, editing, or converting to other formats, such as XML.
In addition, the ability of creating a semantically equivalent text version of a graphically rendered page may facilitate the accessibility of PDF (Portable Document Format) documents for physically impaired users (e.g. software for reading the text to blind users). If only a graphical representation is available, without proper semantics, other forms of usage are impossible.
The importance of preserving the semantics of a digital document by providing proper Unicode mappings for the text contained in the document is emphasized by the forthcoming ISO 19005-1 standard for PDF in Archiving, or PDF/A. PDF/A strives to define a subset of PDF which is suited for long-time preservation and archival in order to make sure that PDF documents can be used decades from now, even using software systems and applications which are completely different from those in use today. The conditions stated by PDF/A eliminate all ambiguous constructs which may thwart faithful rendition of the document in the future.
In addition, the “full conformance level” of PDF/A mandates the availability of complete and correct Unicode mapping information for all text contents. As opposed to the “minimum conformance level,” which guarantees only faithful graphical representation, the full conformance level guarantees to preserve the underlying semantics of the document as well, which is a highly advantageous aspect of long-time preservation.
Digital document formats such as the PDF (Portable Document Format) use a variety of data structures for representing textual content. The use of various font formats, encoding schemes, and combinations thereof results in a variety of methods for mapping the bytes in a page description to readable text on the page. While these methods generally allow faithful visual rendition, Unicode mappings (and therefore the semantics of the text) are not always available in the digital document. In some cases Unicode mappings are provided explicitly in the PDF document, sometimes they can be derived indirectly using well-known methods, and in some cases substantial effort may be required to provide Unicode mappings.
PDF documents can use various techniques and data structures for representing text on a page. The choice of font and encoding, as well as the kind and volume of information for Unicode mapping, typically depends on the software creating the PDF. Many considerations (ease of development, project requirements, internationalization issues, scheduled may influence the font output created by a particular program for creating PDF, and therefore the degree and reliability of Unicode mappings. While in recent years the awareness of the importance of proper Unicode mappings among developers of PDF-creating software increased and subsequently more products create PDF output with reliable explicit Unicode mappings, a large number of existing (“legacy”) PDF documents do not contain explicit or complete information for Unicode mapping.
In the following description the terms “character” and “glyph” are used; it is important to distinguish these concepts. “Characters” are the smallest units which convey information in a language. Common examples are the letters of the Latin alphabet, Chinese ideographs, and Japanese syllables. Characters have a meaning; they are semantic entities. The Unicode standard encodes characters. “Glyphs” are different graphical variants which represent one or more particular characters. Glyphs have an appearance; they are representational entities. Fonts are used to produce visual representations of glyphs. There is no one-to-one relationship between characters and glyphs. For example, a ligature is a single glyph which corresponds to two or more separate characters.
Certain classes of fonts are supported in PDF. For >>simple<< fonts (e.g., PostScript Type 1, TrueType, and Type 3 fonts), each glyph on the page is identified by an 8-bit value which is used to index the encoding vector, an array containing up to 256 glyph names. The glyph name in turn is used to locate the glyph outline description within the font data to draw the glyph shape. The encoding can explicitly or implicitly be specified in the PDF file. Some simple fonts don't have an explicit encoding entry with glyph names, but use a >>builtin<< encoding. The builtin encoding is part of the font outline data which may be embedded in the PDF document, or may be available from an external source such as from the operating system or from an external file.
For CID fonts, each glyph is identified by a sequence of one or more 8-bit values, where the sequences may have varying lengths. Using a mapping scheme called CMap (Character Map), these sequences are mapped to a CID value (Character ID). These CIDs can refer to predefined tables, so-called “character collections.” For example, Adobe Systems Inc., the developer of PDF, makes available character collections for Chinese, Japanese, and Korean. The combination of CID (a numerical code) and a named character collection uniquely identifies the glyph. Since the character collections are well-known, Unicode mappings for all character collections can be prepared in advance, and are actually publicly made available by Adobe Systems, Inc. The availability of these mapping tables facilitates Unicode mappings for the well-known character collections and predefined CMaps. However, some CID fonts do not refer to a predefined character collection, but to some other mapping scheme which is internal to the font (e.g., Identity-H and Identity-V CMaps). CID fonts with Identity CMaps do not allow Unicode mapping using predefined tables as is the case for CID fonts with predefined CMaps.
Regardless of the font class (simple font or CID font), additional optional data structures may provide Unicode mappings for some or all of the glyphs in a font (e.g., ToUnicode CMap; not to be confused with the CMaps used for CID fonts) or some instances of text on the page (e.g., ActualText for Tagged PDF). However, such additional data structures are not necessarily present. If a ToUnicode CMap is present, the PDF-generating software usually creates it from information provided in the corresponding font outline file.
The actual font outline descriptions (descriptions of the geometric glyph shapes) may optionally be embedded in the PDF document in several formats, such as PostScript Type 1, TrueType, and OpenType. All or parts of the original font file can be embedded with or without modifications. Partial embedding (i.e., font subsets) offers space advantages since only the outline descriptions of those glyphs are embedded which are actually used in the document.
As explained above, in many cases the corresponding Unicode values for the text semantics can either be deduced from the code mapping scheme itself (e.g. standardized glyph names or codes according to a well-known code page) or some auxiliary data structure, such as the ToUnicode CMap in PDF. However, digital documents are not guaranteed to contain explicit information for creating Unicode mappings for the text in a digital document.
Therefore, the known Unicode mapping methods fail if a particular font does not have a ToUnicode CMap (or an incomplete one) and one of the following conditions is true:                It is a simple font which uses non-standard glyph names. For example, glyph names may have been created algorithmically instead of chosen by a human.        It is a simple font with builtin encoding.        It is a CID font with one of the Identity-H or Identity-V CMaps.In these cases, the known methods do not provide any Unicode mappings.        
In addition, in some situations the required data structures are present in the PDF, but correct and complete Unicode mapping is impossible nevertheless. The present inventors have determined that although the data structures for Unicode mapping are available, they may provide wrong or useless results. For example, simple fonts may use glyph names from a well-known set, but the name assignments can be wrong. Similarly, the present inventors have determined that situations exist where the PDF-generating software may have created a ToUnicode CMap which contains wrong Unicode mappings because proper Unicode information was not available at the time when the PDF was created. As an example for useless Unicode mapping data, a ToUnicode CMap may provide values in Unicode's Private Use Area (PUA) which do not have any intrinsic semantics, and are therefore unusable for general data processing and exchange. (PUA values are actually quite common since many font developers assign PUA values to some of the glyphs in their fonts.). Furthermore, the present inventors have determined that situations exist where even if the data structures for Unicode mapping are available, they may be incomplete; while Unicode mappings are available for most glyphs of a font, some glyph mappings may be missing. For example, the ToUnicode CMap is not required to cover all codes which are actually used in the document; some glyph codes may be missing from the ToUnicode CMap.
Therefore, it is an object of the present invention to provide correct Unicode mappings in more cases than the methods known in the art, especially where these methods do not produce Unicode mappings, or where these mappings are wrong or incomplete. Further, it is an object of this invention to provide a general solution for all such situations.