The number of documents available in electronic format has exploded. With the number of available electronic documents increasing rapidly, it is important to be able to quickly and accurately search the available electronic documents. In addition, it is desirable to be able to store data into electronic documents and generate new electronic documents which are similar in structure to existing electronic documents. Hence, tools which assist in the querying of electronic documents, the creation of electronic documents, and the storage of data into electronic documents are desirable.
Electronic documents for display over the Internet and/or an Intranet are commonly stored in a Standard Generalized Markup Language (SGML) format. SGML is a standard for how to specify a document markup language or tag set. SGML is not in itself a document language, but a description of how to specify one. The SGML format provides for the inclusion of a document type descriptor (DTD). A document's DTD specifies how the data within a document should be organized. One SGML format for storing data within electronic documents which is becoming increasingly popular is eXtensible Markup Language (XML). XML is rapidly emerging as the new standard for representing and exchanging data on the World Wide Web (web). An XML document may be accompanied by a document type descriptor (DTD). For example, in an XML document, the DTD may specify the tags which can be used, the order in which the tags appear, how the tags are nested, and tag attributes. Thus, the DTD plays an important role in the storage of data to the XML document, the generation of similar documents, and increasing the efficiency of queries of the XML document. Efficiency is achieved by using the knowledge of the structure of the data to remove elements that cannot potentially satisfy the query.
Although DTDs are helpful in the storage, generation, and retrieval of data related to an XML document, DTDs are not mandatory. Since DTDs are not mandatory, many XML documents exist which do not contain DTDs. In addition, since only a small portion of the electronic documents in existence today are in an XML format, initially the majority of XML documents will likely be automatically generated from pre-existing non-XML documents. In many instances, the automatically generated XML formatted documents will not contain DTDs. Therefore, a tool for automatically generating DTDs is desirable for improving data storage and retrieval.
Others have attempted to automatically generate DTDs with varying degrees of success. One system is IBM's Data Descriptors by Example (DDbE) system. The goal of DDbE is to give users a good start at creating DTDs for their own applications. However, this system and other available systems do not produce highly accurate DTDs for all XML documents, especially complex XML documents. Since accurate DTDs enable efficient storage and retrieval of data, improved methods for extracting accurate DTDs from XML documents are desirable.