Many systems and databases contain data in incompatible formats. One of the most time consuming challenges for developers has been to exchange data between incompatible systems over the Internet. XML permits data to be exchanged between incompatible systems. Converting data to XML format can greatly reduce this complexity and create data that can be read by many different types of applications. Because of this XML has become a standard format for information exchange in IT applications and systems. However, the number of documents available/generated in XML format remains fairly low as compared to documents in other formats. First, converting documents from other formats into XML is often difficult and time-consuming. Second, because of the particular verbosity and lengthiness of XML documents, creating new XML documents is also a time-consuming process. Creation of an XML document requires permanently interleaving document content (textual data) with semantic tags and attributes according to a Document Type Definition or DTD (a DTD defines the legal elements and structure of an XML document), which generation process is frequently tedious and error-prone.
The appearance of various XML editors help the designer partially reduce document generation overhead by offering an advanced graphic interface with menu-based selection of elements/attributes and a possibility to align the document generation with a corresponding DTD by validating entire files or their fragments. Although DTDs serve well for document validation, they provide little help during document editing or creation. The main reason for this is that most DTDs are designed by humans before any valid XML documents are created; as result many DTDs either contain errors or are too general, that is, they allow a much greater degree of ambiguity than the actual documents expose. Moreover, suggesting tree-like patterns with DTDs is simply impossible, since most element definitions are regular expressions describing infinite sets of possible element contents, while document authoring is a sequence of instantiations of the element definitions. What is needed is a method of easily converting a document from one format into a structured document, such as an XML document.
The need for strongly structured documents increases with the development of new software applications (such as the semantic web) and new standards (SGML, XML, etc.). Structured documents can be viewed as composed of two components: the content part and the (tree-like) structure part. Authoring assistants have been developed, especially for helping authors create the structural markup of their documents, the most widely used being the DTD or XML-Schema checker for checking XML documents. Some tools also allow tagging of textual components semi-automatically using tagging/parsing techniques. Many structured documents repeat the same content components at various locations throughout the document. What is needed is a method of predicting repeated both structure and content components during document authoring.
Text prediction is a widely developed art. Historically, one of the first studies on text prediction was published by C. Shannon (Claude E. Shannon, “Prediction and Entropy of Printed English”, Bell Systems Technical Journal, pp. 50-64, 1951) presenting his game (“Shannon game”). The purpose of the Shannon game is to predict the next element of text (letters, words) using the preceding context. Shannon used this technique to estimate bounds on the entropy of English.
Many applications propose word/text completion using simple techniques such as MRU (Most Recent Used) and Lookup in some files (these files can be the current file, the buffer, the clipboard, specific lexicons, databases, etc.). More sophisticated prediction systems have been developed, such as (Multilingual) Natural Language Authoring (Marc Dymetman, Veronika Lux and Aarne Ranta, “XML and Multilingual Document Authoring: Convergent Trends”, Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000), pp. 243-249, Saarbruecken, 2000) and form completion (some application programs such as MS Excel propose a cell). Hermens and Schlimmer (L. A. Hermens and J. Schlimmer, “A machine learning apprentice for the completion of repetitive forms”. New York, N.Y.: Cambridge University Press, 1993) propose a learning approach (decision trees) suggesting text for form fields. They also apply ML algorithms in order to predict what the user of an electronic organizer is going to write, but the system only allows predictions from a pre-defined structure (forms).
Foster et al. (George Foster, Philippe Langlais, Elliott Macklovitch, and Guy Lapalme, “TransType: Text Prediction for Translators. Demonstration Description” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July, 2002) describe a technique for translation completion. The aim of the TransType project is to develop a new kind of interactive tool to assist translators. The proposed system will observe a translator as s/he types a text and periodically proposes extensions to it, which the translator may either accept as is, modify, or ignore. The system takes into account not only the source text, but the already-established part of the target text.