1. Field of the Invention
The invention relates to the field of document management and more specifically relates to systems and methods to improve operation of a document indexing user interface to index documents that contain shift out and shift in (“SOSI”) embedded control codes such as for double-byte code page encodings.
2. Statement of the Problem
Document management systems provide a flexible, configurable, centralized management structure for controlling access to documents of an enterprise and for managing revisions to documents. In general, document management systems provide security mechanisms to allow an administrator to clearly define allowed and disallowed access to particular documents in accordance with a wide variety of access parameters such as user ID, group associations for a particular user ID, physical or logical location of a particular user, etc. In general, such a document management system includes a database used for indexing the content of all documents submitted to the document management system. The content oriented database is a repository for indexing all documents in the document management system based on textual content of the document as well as other attributes of the document (such as name, author, etc.). An example of a common, commercially available document management system is IBM's DB2 Content Manager OnDemand for Multi-Platforms. Information regarding this exemplary document management system is readily available to those of ordinary skill in the art, for example, www.ibm.com.
To index the document in the document management system database requires identifying index values from the textual information contained in each document submitted to the document management system. Where a document is structured in accordance with high level structured document standards, the defined structures of the standards often rigorously define beginning and end of textual fields as well as various attributes and parameters of such identified textual fields. Locating fields of information to be indexed in the case of such structured document standards is a simpler, well defined process readily understood by those of ordinary skill in the art. Examples of such structured document standards are IBM's Advanced Function Presentation (“AFP”) architecture and Adobe's Portable Document Format (“PDF”). These and other well-known, commercially available structured document formats permit the document indexing system to readily and rigorously identify fields of textual information useful for index values in the document management, database structures.
However, a large class of documents may simply contain line data. As used herein, “line data” refers to data formatted so that it can be printed on a line printer. Line printers are typically older, legacy printing systems that were adapted only for receipt of simple encoded characters and simple formatting controls such as “new line”, “index line feed”, “top of page”, etc. Often the text and controls are encoded according to the EBCDIC or ASCII standards defining certain 8-bit values as printable characters and other 8-bit values as control codes. In addition to the simple formatting control codes, some line printers are capable of processing single byte character sequences (“SBCS”) as well as double-byte character sequences (“DBCS”). Double-byte character sequences are common in languages with a substantial number of characters for encoding words and phrases. For example, Chinese and Japanese utilize character or phrase symbols numbering in the thousands. Thus, two bytes are required for encoding code points representing individual glyphs or symbols for Chinese and Japanese language line printers. Still other languages require three or more bytes for encoding various symbols native to the language.
In standard EBCDIC line data, a transition between SBCS and DBCS text encoding is indicated by a shift out control code (“SO”—encoded as a hexadecimal value of 0x0E). The transition from DBCS encoding back to SBCS encoding is marked by the shift in control code (“SI”—encoded as hexadecimal value 0x0F). When transmitted to a suitably adapted line printer, the SO and SI control codes are processed by the line printer to cause appropriate imaging of corresponding single or double-byte character values on the printable medium of the line printer.
When indexing standard structured documents such as an AFP or PDF structured document, the structural elements defined by the standards help the indexing program identify the start of a document, start of a page, a particular field in the document etc. By contrast, indexing an EBCDIC line data document containing only line data is a more complicated task for the indexing system. Typically, the document indexing system must define a “trigger” parameter for indexing. The trigger parameter is defined by a string (i.e., a group of bytes) that can be found only at a specified location of the line data document. Any “field” to be indexed in such a document may then be defined by a byte or column offset from a corresponding trigger. Thus, the triggers serve as anchor point definitions for subsequent definition of an index field location within similar line data documents to be indexed. For example, in a line data document containing customer invoices, a useful indexing parameter may identify a trigger string such as the vendors name known to be present at a fixed position in the invoice. A customer name or account number could then be defined as an index field identifying its location relative to the vendor name in the invoice line data.
It is useful in an indexing system to provide a graphical user interface for a user to simplify definition of triggers and index field locations within a line data document to be indexed (or a group of related documents to be indexed based on similar index field locations). Such a graphical user interface may be used, for example, to permit the user to select particular text to be defined as a “trigger” or to be defined as a “field” (e.g., defining the location of an index field as relative offset from the anchor position of a trigger found in the document).
A problem arises in effectively utilizing a graphical user interface to define triggers and field locations within a line data document that contains a mixture of SBCS and DBCS sequences. For such a GUI definition of parameters for indexing, the document is first presented on a display screen for the user to select strings of interest as indexing parameters. Typically as presently practiced a line data document is formatted for a display by converting it to an equivalent ASCII encoded string—a more common encoding for presentation on a display screen. This conversion process typically strips any shift out (“SO”) or shift in (“SI”) control codes since there is no equivalent code in ASCII for the EBCDIC SO and SI characters. Thus it is a problem in a GUT aspect of the document indexing system to select columns or strings of characters from a display of a line data document to accurately determine a location of an index field relative to a trigger string. The selected strings will not include the stripped SO or SI control codes that indicate shifts between single and double byte character sequence. Hence, the selected triggers and fields will not accurately reflect the proper byte position in the original EBCDIC line data document file to later locate the selected text when indexing the document. When the selected trigger and field definitions are later used to locate corresponding text in line data documents, the column/byte locations specified by the selected information on the GUT display may not match the location of those fields in a document being indexed.
It is evident from the above discussion that a need exists for an improved method and associated systems for permitting accurate selection of text to define triggers and index field locations for indexing of line data documents having double-byte character sequences therein.