An unstructured document migrator migrates an unstructured to a Darwin Information Typing Architecture (DITA). DITA is an industry content standard for creating and managing diverse types of documentation and is a write-one, publish-many technologies used for a variety of applications, including stand-alone documents, large online content collections, embedded user assistance, and customized run time generation of help systems.
DITA is not limited to a specific XML dialect, instead it provides an extensible model for topic-based content authoring and management. DITA permits the creation of custom content topic types called specialization, which are derived from the main Topic archetype based on schema inheritance mechanisms. Examples of child topic types derived from Topic include Concept, Task, and Reference. Additional topic types can be derived from these such as API topic type, message topic type, white paper topic type, and more. Strong topic typing ensures and enforces content consistency from writer to writer, enables reuse, and preserves investment in content. It also preserves investment in existing processing, which is also extensible. Based on XML, DITA provides semantically rich intelligent content that machines can process with predictability and exceptional flexibility.
DITA provides a framework to create, build and deliver complex technical information sets with flexible features that enable extensive reuse and re-purposing. DITA has resulted in a large growing commercial marketplace of DITA tools and services based on the open DITA standard. These tools and services reduce cost and complexity of DITA, and enable cross industry content interchange by DITA adopters whereas traditional solutions typically an expensive roll-you-own proposition.
A DITA documentation library is created by dividing information into collections of topics (and nested collections), which are then processed to create a variety of outputs. The three most common DITA topic types, differentiated by their schema, are task, concept, and reference. Each DITA topic file contains a title element, a short description element, elements to contain metadata, and a body element that comprises information regarding the specific topic. The metadata in DITA includes information and attributes about the topic which makes it easier to locate.
DITA is a broadly accepted industry standard that has a vibrant community with deep skills and expertise in DITA implementation across various industries such as telecom, banking, insurance, medical instrumentation, software and even academics. DITA's content reuse and single sourcing capabilities were its strengths that were leveraged by enterprises and organizations large and small.
Compared to What You See Is What You Get (WYSIWYG) publishing, DITA is complex. The complexity of DITA, both real and perceived, is an inhibitor to adoption. Despite its complexity, enterprises choose to standardize on DITA to manage complex publishing demands that often include multi-channel and multi-format (omni-channel) publishing, distributed authoring, extensive reuse, and in-line version control. However, enterprises with limited budgets and tight deadlines must often migrate large volumes of legacy content to DITA from various applications, such as word processing, desktop publishing, and legacy SGML formats. Conversion scripts are typically employed to convert to DITA. These scripts cannot automatically determine the source and target topic type unless additional metadata is manually added to the source content. Adding additional metadata is often a prohibitively time consuming, manual process. Typical approaches often involve the manual assignment of named paragraph styles or markers for unstructured word processing source files, the manual addition of metadata in structured source files, or the insertion of eye-catcher text. All of these methods are labor and time intensive. For unstructured content, the challenge is further compounded by the fact that the source typically identifies only how the content looks, not what it is.
As a result, attempts to use heuristic methods results in poor partial conversion with excessive post-conversion manual conversion required. As a result of these limitations, organizations often choose to convert their content to the generic DITA topic type, or convert everything to only one specialized topic type. These approaches require less complex conversion scripts. However, the resulting absence of strong and accurate topic typing defeats one of the main benefits of DITA. Post-conversion retro-fitting from one topic type to another is often costly and complex due to the constraints between topics, thus post-conversion topic typing is rarely undertaken, despite best intentions.