Storing document-based information in the memory of a computer installation requires a predesigned system of document order or control. Developing such order is a science of software referred to generally as a database management system (DBMS). DMBSs store information as records carrying facts as to document content and structure. Retrieving information from a collection of records, such as a database, differs among system designers. Document-based databases traditionally have been stored in a hierarchy that resembles an inverted family tree with relationships of root, or parent, child and the like. Alternately, a network database, while resembling a hierarchical model, establishes links between various levels of one or more tree structures. A popular system typically employed with spreadsheet forms of programs is a relational database.
In order to develop a document type hierarchically designed database, it is necessary to compile and simplify information as to document structure. This information can be called a grammar or document type definition (DTD). Creating a DTD is non-trivial, substantial effort and corresponding cost being invested in their development. For example, one major database service organization developed over 10,000 DTDs representing over 10,000 different types of documents in building its databases. Their creation often has been compared to writing a computer program. See generally, the following publications:
(1) Joan Knoerdel, SGML is more than generic coding, EPSIG News, pp 3-4, December, 1987 PA1 (2) SGML: A usage overview. Electronic Documents, 2(10:23-32 PA1 (3) Editor's rebuttal. EPSIG News, p. 4, May 1988 PA1 (4) Ludo VanVooren, Implementing SGML: Where do you start? &lt;TAG&gt;, The SGML Newsletter, pp 5-7, February 1990 PA1 (5) Information processing--text and office systems--standard generalized markup language (SGML). International Organization for Standardization. Ref. No. ISO 8879-1986., September 1969. PA1 (6) Steven J. DeRose, David G. Durand, Elli Mylonas, and Allen H. Renear. What is text, really? Journal of Computing in Higher Education, 1(2):3-26, 1990. PA1 (7) Michael Farrell. Text markup, SGML, and text databases. EPSIG News, 5(4):19-20, 1993. PA1 (8) Erik Naggum. Sgml Faq.0.0. ftp://ftp.ifi.uio.no/pub/ SGML/FAQ.0.0, January 1992 PA1 (9) Erik Naggum. SGML general information. ftp://ftp.ifi. uio. no/pub/SMGL/general-info, January 1992 PA1 (10) Haviland Wright. SGML frees information. Byte, pp 279-287, June 1992 PA1 (11) Eric van Herwijnen. Practical SGML. Kluwer Academic Publishers, Boston/Dordrecht/London, second edition, 1994 PA1 (12) Robin Cover. Standard Generalized Markup Language ISO 8879:1986 (SGML) annotated bibliography and list of resources. ftp://ftp.ifi/uio.no/pub/SGML/bibliography, January, 1992
Several vendors now offer consulting services to write and maintain DTDs.
The hierarchical DTD, in general, is a condensed assemblage of AND, OR, and element rules functioning to identify document structure elements including, for example, title, author, and paragraph. These structural components typically are identified or "marked-up" utilizing a somewhat complex meta-language referred to as Standard Generalized Mark-up Language or "SGML". Utilizing SGML methodology, document publishers insert start and end tags to identify the structural components of the documents they publish from electronic media. Text material similarly is identified. Without such tagging, only human intervention and reading of the "raw" text will find the structural components such as author, title, and the like. With SGML techniques, a start tag typically is formed of a &lt; symbol followed by a tag name which, in turn, is followed by a &gt; symbol. Correspondingly, the end tag has the same structure with the addition of a slash following the &lt; symbol. An example of a simple SGML mark-up may be shown as follows:
______________________________________ &lt;record&gt; &lt;name&gt;Keith Shafer&lt;/name&gt; &lt;title&gt;Research Scientist&lt;/title&gt; &lt;mailcode&gt;MC 410&lt;/mailcode&gt; &lt;ext&gt;x5049&lt;/ext&gt; &lt;/record&gt; ______________________________________
The corresponding grammar for the above tagged material is as follows:
______________________________________ &lt;RECORD&gt; ::= (&lt;NAME&gt;&lt;TITLE&gt; &lt;MAILCODE&gt;&lt;EXT&gt;); &lt;NAME&gt; ::= &lt;#PCDATA&gt;; &lt;TITLE&gt; ::= &lt;#PCDATA&gt;; &lt;MAILCODE&gt; ::= &lt;#PCDATA&gt;; &lt;EXT&gt; ::= &lt;#PCDATA&gt;; ______________________________________
Note that adjacent to the identification of record there are the ANDed grammar elements, name, title, mailcode, and extension. Components containing text are tagged, for example, with the notation: "PCDATA". A large grammar or a compilation, described heroin as a "corpus grammar" may have thousands of rules and hundreds of grammar elements. Even though the corpus grammar is so extensive, it is still of value to those needing to develop a reduced grammar or DTD.
The SGML standard is available as:
Several overviews of SGML, SGML resources and related textual needs are available. See, for instance:
Several books have been written about SGML, a more popular one of which is:
The reader's attention additionally is directed to a bibliography of SGML papers, books, products and the like, which also include abstracts/opinions and many of the listed items.
This bibliography is identified as:
A DTD of the corpus grammar shown above may be provided as follows:
______________________________________ &lt;!DOCTYPE RECORD[ &lt;!ELEMENT RECORD (NAME, TITLE, MAILCODE, EXT)&gt; &lt;!ELEMENT NAME #PCDATA&gt; &lt;!ELEMENT TITLE #PCDATA&gt; &lt;!ELEMENT MAILCODE #PCDATA&gt; &lt;!ELEMENT EXT #PCDATA&gt; &lt;!ENTITY #DEFAULT " *** UNDEFINED ENTITY REFERENCE***"&gt; ______________________________________
From the foregoing, substantial relief in terms of labor requirements and costs can be foreseen in the electronics information industry with the development of a technique for automatically generating a corpus grammar and then additionally for reducing the extent of the corpus grammar or an overly extensive DTD to a grammar of practical size.