Embodiments of the present invention relate to using automated processes to determine changes between different versions of document files comprising text information, and to indicate the determined changes to a user in a useful manner.
It is known to use programmable device applications to compare different versions of document files to determine changes in document content. Often the determined changes are indicated to user through inserting mark-ups directly in a merged version combining two different document versions, the mark-ups in a format indicating the nature of the change, for example showing moved or deleted text items in a strikethrough font, and added or inserted items in an underlined font. Mark-ups are also often depicted in different color fonts, in order to more readily recognize them in a color contrast with the font of the unchanged document items.
While mark-up processes may be straightforward and efficient in noting relative changes in text content, documents comprising constituent components organized in a logical structural arrangement or schema present additional challenges in efficient document comparison. Document schema define methods for machine-to-machine communication of structured data, in one aspect enabling end user display means to display document content with specified emphasis (bold, italic font, etc.) or tables structures. Schema support interoperable interaction within a given network or service domain to enable consistent replication of a desired document display format across a variety of end user display applications and devices.
One commonly used schema is Darwin Information Typing Architecture (DITA), an Extensible Markup Language (XML) data model for design for capturing, authoring and publishing document content. DITA provides opportunities to link processes for authoring, producing and delivering information with underlying information technology infrastructures that support content-related activities. In contrast to book or chapter hierarchies, DITA document content is mapped through links to pluralities of small topic items which may be reused in other documents. DITA topics are organized in a sequence in which they are intended to appear in a finished document, wherein a DITA map defines a table of contents for deliverables. Relationship tables in DITA maps can also specify which topics link to each other.
Thus, DITA enables the reuse of modular topics in different deliverables over a large variety of content contexts. However, the topic-orientation of DITA documents renders effective automated document comparison based on text comparison problematic, for example often generating large pluralities of unimportant or even spurious mark-ups due to changes in document structure that may obfuscate document content changes actually of interest.