1. Field of the Invention
The present invention relates to the field of data processing. More specifically, the present invention relates to the storage of semi-structured data.
2. Background Information
Increasingly, because of its richness in functions and extensibility, information pages, such as web pages, are being constructed using the extensible style language (XSL) and semi-structured data, such as extensible markup language (XML) encoded data.
xe2x80x9cSemi-structured dataxe2x80x9d refers to data that has structure, but where the contents of particular structural elements need not be consistent. To facilitate this characteristic, data are xe2x80x9cself-describingxe2x80x9d. For example, in a xe2x80x9cpersonxe2x80x9d application, a person can be validly defined by semi-structured data with only a subset of all possible data associated with a person, e.g., by only a last name and a telephone number, or a first name, last name, and address, or some other combinations. Or, a person may be defined with additional data not previously seen, such as an employer name, an employer address, and an employer telephone number. Thus, each semi-structured xe2x80x9cpersonxe2x80x9d definition may vary.
xe2x80x9cXMLxe2x80x9d is one example of a language for representing semi-structured data. XML is a subset of the Standard Generalized Markup Language (SGML), a system promulgated by the International Standards Organization (ISO) for organizing and tagging elements of a document. Interpretation of the XML tags, however, is left to an interpreter. Thus, XML is adaptable to a wide variety of purposes. In particular, since XML is based on plain text, and therefore, XML based documents can be received and processed by many different computing platforms, such as by a platform-independent browser or other networked application programs.
xe2x80x9cStructured dataxe2x80x9d, in contrast, refers to data formats such as those used for databases, spreadsheets, address books, and so forth, where in each case, the data format is well-defined by a schema and essentially inflexible. For example, in the database context, a database can be defined to store data according to some data-storage requirements. The storage requirements, e.g., the schema or nature of valid input, are known in advance, and the database is defined according to the structure of the potential input data. If the database were storing information about person, such as first name, last name, address, telephone number, and employer, every person record in the database would have space allocated for the information being tracked. Hence, the database is structured.
An example of a structured database is the relation database, also referred to as SQL database, where SQL is the name of its query language, Structured Query Language. In addition to being xe2x80x9cinflexiblexe2x80x9d, unlike semi-structured data, SQL databases are also platform dependent. Nevertheless, because of its availability as well as robustness, recent research has turned towards using structured databases, such as a SQL database, to store and retrieve information defined by semi-structured data.
One significant issue, however, is how to convert from semi-structured data, such as XML encoded data, to structured data storage, such as a SQL database. Towards this end, various approaches have been proposed. For example, see Florescu et al., A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database, Rapport de Recherche No. 3680, INRIA, Rocquencourt, France (May 1999), discusses techniques and issues related to using directed graphs to represent semi-structured data. And, Shanmugasundaram et al, Relational Databases for Querying XML documents: Limitations and Opportunities, Proceedings of the 25th VLDB Conference, Edinburgh, Scotland (1999), discusses using XML document type descriptors (DTDs) to convert XML documents to relational tuples.
A significant limitation of these and other current conversion approaches is that mapping between structured and semi-structured data formats is by way of applying a fixed set of xe2x80x9crulesxe2x80x9d to perform the mapping. That is, in each of the these techniques, one and only one mapping is possible. For a given semi-structured input, the conversion rules control conversion into corresponding structured database output. The conversion is not flexible.
Thus, a more flexible approach to handling semi-structured data in a structured manner is desired.
In accordance with a first aspect of the present invention, a mapper generates a structured organization to store a collection of semi-structured data. Collaterally, the mapper also generates a description of how the semi-structured data are stored under the structured organization.
In accordance with a second aspect of the present invention, a mapper generates a semi-structured data organization for a collection of structured data. In like manner, the mapper also collaterally generates a description of correspondence between the semi-structured data organization and the structured data.