There are two known general approaches for working with documents in different formats during the process of developing Natural Language Processing (NLP) systems, for example, in machine translation. The first approach is based on integration with applications that use various formats. In this approach, external programs (such as Internet Explorer and Microsoft Word) and their API (application programming interfaces), that include a collection of standard procedures (functions, methods), are used to develop application software to work with data in the specific formats. The API determines a certain level of abstraction that allows working with a family of related formats supported by a single application. In this case, a special application or library can be used to work with the specific format. For example, if it is necessary to support the *.DOC format, one can use Microsoft Word™. Microsoft Word™ provides an API so that software can read and modify Microsoft Word™ documents. However, that does not allow source texts to be transferred from the format of one editor to the format of another.
This first approach has at least the following shortcomings:                it is impossible to use it for every format and application;        it requires an outside native application;        processing in automated mode, such as on a server, is made more difficult or impossible;        adding additional functions to the editor, such as dynamic highlighting of variant translations in a machine translation system, is made more difficult or impossible; and        conversion into a different format is impossible or limited to supported different formats.        
Another limitation of the first approach is that if the source format, such as .PDF, cannot be edited, the user or the system cannot add or change anything.
An outside application can be avoided, if one's own library is able to work with a specific format. The specification of the format, however, must be accessible. The task of supporting the capability of editing while retaining the data is very labor-intensive. A general shortcoming of this approach is that an individual solution is needed for each format. That is inconvenient, both for the developer and for the end user.
Another approach is to represent source documents as text with tags. An example format that uses this approach is the XLIFF format. This approach is also used in developing NLP products. Using this approach, documents of various formats are transformed into a global representation as text annotated with tags. The composition and content of the tags are determined by the source format of the document. The tags store data needed to recover the document. The tags may be formatting or structured data. Some of the tags cannot be changed, but some tags can be edited together with the text that corresponds to the tag. Modification is usually done in semi-automatic mode. The user manually tracks and corrects the text that contains tags. The advantage over the previous approach is that the solution is uniform for all formats. One shortcoming is that the document-editing capabilities are severely limited. Automatic modification is cumbersome and correcting text by hand is inconvenient.
An example of such a format is XLIFF (https://www.oasis-open.org/committees/xliff/faq.php#WhatIsXLIFF). XLIFF is an open standard (utilizing XML) for describing documents. The problem of converting from one particular format to another particular format, however, is not solved solely by the XLIFF standard. This standard also does not provide a capability of displaying and editing a document in What You See Is What You Get (WYSIWYG) mode. So the individual or general shortcomings of the second approach are at least:                there is an insufficient selection of editing tools, lack of WYSIWYG;        it can be impossible to convert to another format; and        it is well suited to tag-based formats such as HTML or XML, but not very useful for binary formats such as DOC.        
Text editors such as Microsoft Office™ or OpenOffice™ could be used to open and store files in various formats.
Document editing application supports can support a specific type of document. For example, if the “type” is “text document,” then the formats Microsoft Word™, a rich text editor format, and OpenDocument Text are possible. These formats are supported by a variety of applications, such as, Microsoft Word™, OpenOffice™, and AbiWord™. Some applications are limited to opening only particular formats of documents. For example, it is impossible to open presentation (PowerPoint) files in Microsoft Word™. Even if a document of the same type is opened in different editing applications, the document may be displayed differently. For example, formatting and data elements may be partially lost or distorted.