1. Field of the Invention
The present invention relates to the field of data processing. More specifically, the present invention relates to the storage of semi-structured data.
2. Background Information
Increasingly, because of its richness in functions and extensibility, information pages, such as web pages, are being constructed using the extensible style language (XSL) and semi-structured data, such as extensible markup language (XML) encoded data.
“Semi-structured data” refers to data that has structure, but where the contents of particular structural elements need not be consistent. To facilitate this characteristic, data are “self-describing.” For example, in a “person” application, a person can be validly defined by semi-structured data with only a subset of all possible data associated with a person, e.g., by only a last name and a telephone number, or a first name, last name, and address, or some other combinations. Or, a person may be defined with additional data not previously seen, such as an employer name, an employer address, and an employer telephone number. Thus, each semi-structured “person” definition may vary.
“XML” is one example of a language for representing semi-structured data. XML is a subset of the Standard Generalized Markup Language (SGML), a system promulgated by the International Standards Organization (ISO) for organizing and tagging elements of a document. Interpretation of the XML tags, however, is left to an interpreter. Thus, XML is adaptable to a wide variety of purposes. In particular, since XML is based on plain text, and therefore, XML based documents can be received and processed by many different computing platforms, such as by a platform-independent browser or other networked application programs.
“Structured data,” in contrast, refers to data formats such as those used for databases, spreadsheets, address books, and so forth, where in each case, the data format is well-defined by a schema and essentially inflexible. For example, in the database context, a database can be defined to store data according to some data-storage requirements. The storage requirements, e.g., the schema or nature of valid input, are known in advance, and the database is defined according to the structure of the potential input data. If the database were storing information about person, such as first name, last name, address, telephone number, and employer, every person record in the database would have space allocated for the information being tracked. Hence, the database is structured.
An example of a structured database is the relation database, also referred to as SQL database, where SQL is the name of its query language, Structured Query Language. In addition to being “inflexible,” unlike semi-structured data, SQL databases are also platform dependent. Nevertheless, because of its availability as well as robustness, recent research has turned towards using structured databases, such as a SQL database, to store and retrieve information defined by semi-structured data.
One significant issue, however, is how to convert from semi-structured data, such as XML encoded data, to structured data storage, such as a SQL database. Towards this end, various approaches have been proposed. For example, see Florescu et al., A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database, Rapport de Recherche No. 3680, INRIA, Rocquencourt, France (May 1999), discusses techniques and issues related to using directed graphs to represent semi-structured data. And, Shanmugasundaram et al, Relational Databases for Querying XML documents: Limitations and Opportunities, Proceedings of the 25th VLDB Conference, Edinburgh, Scotland (1999), discusses using XML document type descriptors (DTDs) to convert XML documents to relational tuples.
A significant limitation of these and other current conversion approaches is that mapping between structured and semi-structured data formats is by way of applying a fixed set of “rules” to perform the mapping. That is, in each of the these techniques, one and only one mapping is possible. For a given semi-structured input, the conversion rules control conversion into corresponding structured database output. The conversion is not flexible.
Thus, a more flexible approach to handling semi-structured data in a structured manner is desired.