Throughout the world of computer data systems, a lack of standardization is often times problematic. Different operating systems, applications, and file formats conspire to create barriers against the transfer of data between systems. These barriers are particularly germane to the field of computerized data representation and storage.
Data can be stored, represented, and displayed in many different formats. These diverse formats can be loosely grouped into three broad categories: highly structured documents, unstructured documents, and indecipherable documents. Documents which are highly structured, such as a set of tables in a relational database, an XML document governed by a DTD or Schema, or data coming out of SAP, have data that is easily readable by both humans and machines. This data can therefore be extracted and transferred between documents with dissimilar formats with little or no difficulty.
Conversely, documents which fall into the indecipherable category, such as an Adobe Postscript files generated by printing a Microsoft Word document to a file, or a JPEG image that happens to contain text, have data that might be easily readable by humans. But data from these indecipherable documents is virtually impossible to put into a machine readable format, greatly hindering the transfer of this data between documents.
The increasing prevalence of documents which have been created to be displayed on the World Wide Web, however, means the most interesting, and largest, category of documents are those that fall under the rubric of unstructured documents. These documents are ones that have data which is human readable, but impose no well defined structure on that data. Nonetheless, data can be extracted and transferred between these documents, despite the lack of a well defined structure. Examples of such unstructured documents are HTML pages with hierarchical header tags, HTML pages with tables, or an XML document with no DTD or Schema.
It is often desirable to transfer data from data sources to a destination. To transfer data between the source and destination, data must be extracted from the source which contains them and represented in some manner. Additionally, the destination to which the data is to be transferred must be represented. After both the source and destination are represented a mapping must be constructed between data in the source and the corresponding location in the destination to which the data is to be transferred.
Because of the pervasiveness of unstructured documents, and the volume of data which they contain, there is a need to provide a means to transfer data from these unstructured sources to destinations where this data can be better manipulated and utilized. Moreover, because these unstructured documents contain such a large volume of data, there is a need to provide such a means that is simple and easy to use.