Databases play an integral role in the information systems in most major organizations. Databases may take many forms, and play many roles such as a mailing list, an accounting spreadsheet, or statistical sales projections. After using several generations of technology, most organizations of any size have data stored in many different systems and formats. However, the increasing pace of competition is putting the onus on the organizations to build seamless bridges that combine the dizzying array of data sources quickly and cost-effectively into meaningful information.
In addition, increased opportunities brought by the World Wide Web (the xe2x80x9cWebxe2x80x9d) add to the pressure of providing access to that information in a useful and efficient manner. For example, organizations may need to transform raw data and stage it to separate, redundant web servers for quick access via multi-tier application architectures designed for thin clients. Alternatively, organizations may need to couple their systems via XML across the web with the systems of other organizations.
At the heart of nearly every Web-based business is the need to transform and integrate data. Because of the wide range of formats and applications within which business objects, transactions, catalog content and log files may be stored, data integration is perhaps the most painful and complex challenge facing business persons and application developers alike. The pain is most sharp in common scenarios that involve multiple trading partners with each partner having different internal production systems and different protocols for exchanging data.
Regardless of the particular need, organizations desiring to participate in any sort of e-commerce venture will likely have to deal with staging data from disparate sources. For example, an organization may need to combine information from multiple Internet systems along with external text feeds to build a customer relationship management system; or to integrate the organization""s systems with those of customers and suppliers across the value chain. To be successful, an organization must transform data into useable formats for internal departments, partners, and customers.
To make data available and meaningful for different recipients, data transformation is often necessary. Data transformation generally refers to a sequence of operations that transforms a set of input data into a set of output data. Though the term data conversion has a slightly different technical connotation, it is often used synonymously with the term data transformation. Data transformation allows for the changing of the content, format, or data structure. Common content changes include adding, deleting, aggregating, concatenating, and otherwise modifying existing data. Common data formats include binary files, sequential files, embedded binary data, EBCDIC data from mainframes, common file types created by C, COBOL, FORTRAN, Basic, RPG, Pascal, and other languages, arrays, ISAMs and other record managers, PC-based databases, accounting applications, and Web-based data reachable through SQL/ODBC. Common data source structures may include spreadsheets, contact managers, mail list software, and statistical packages.
The process of converting data becomes increasingly complicated with each increase in the number of input data sources, the number of output data sources, the content of the data sources, the format of the data sources, and the complexity of data structures. For example, different data storage systems use data structures with different structures. For example, mainframe systems typically use a hierarchical data storage method, whereas client-server systems often use a relational database storage method.
Current data transformation techniques are generally expensive to implement, are not portable, and difficult to adapt to new or changing circumstances. For example, point-to-point links are generally hand-coded customized data transformation programs. Customized code is typically written in-house and is specific to a single application or DBMS environment. On the positive side, such solutions generally provide exactly what is needed and no more, and address requirements for which there may be no off-the-shelf products. In-house development, testing and debugging also narrows the focus, and tends to produce a workable, if non-versatile, solution. On the other hand, because these routines are usually specific to a particular source or target database, they are difficult to port to other environments. These routines may also be difficult to repeat because the routines are generally unique to each situation and because there is typically no infrastructure in place to manage the processes. Finally, building custom routines robs in-house DBAs of time better spent on other tasks. In addition, custom coded solutions require continued maintenance because they must be modified every time a new requirement is added to the system Further, custom code may take a relatively long time to implement with some legacy migration projects tying up critical IT staff for weeks, months and even years.
Consultants and customized tools are also used by organizations with increasing frequency today. Outside consultants typically have acquired extensive experience in building data models, designing movement and transformation methodologies and developing transformation tools. Such tools tend to be more portable, since they have been developed with multi-platform DBMS environments in mind. Because database consultants have had to become knowledgeable about business operations as well, these tools also tend to address business processes adequately. However, all application expertise leaves along with the consultant. In addition, because these routines are specific to single aspects of the business, they are difficult to recreate for other branches or divisions.
A common alternative to point-to-point links involves streaming data through a conduit into a universal structure, transforming the data in a central hub, then streaming the data through another conduit to the target format. Transforming the data may happen in real time but requires downloading the structure into memory to make it possible to apply a consistent set of visually defined transformation capabilities to an intermediate data stream regardless of the data""s original format or transformed format. This method generally requires less programming code than point-to-point links. However, the architecture tends to limit transformation operations, and less efficient versions increase latency and complicate scalability. Further, loading the entire source structure into memory may become prohibitive with large and complex structures.
The use of an iterative method addresses this last concern. Iterative data transformation methods do not require the use of large amounts of memory that loading the entire structure requires, because source data is examined one record at a time. However, with current iterative methods, it is difficult to keep track of the relationship between records.
Another relatively new technology, XML, has lured application developers with the promise of an easier way to integrate data between applications and between organizations over the Internet. However, as organizations rush to adopt progressive e-business infrastructures such as XML to gain an edge over the competition, they are stumbling upon an unsettling reality. Since the W3C released the public specifications for XML in 1998, vertical industries and major corporations have already implemented hundreds of disparate XML xe2x80x9cstandards.xe2x80x9d Thus, a big hurdle for trading partners, developers, and net market makers that need to interface with multiple production systems and organizations is the wide range of XML standards (as well as other data formats) that they will likely encounter.
Data transformation tools currently in use are generally expensive, time-consuming to implement, programming-intensive, and inflexible. An ideal data transformation tool requires a minimum amount of custom programming, has the flexibility to work with varied data formats, contents, and structures, and possesses the capability to adapt to different sources and targets.
The following illustrative explanations are provided to facilitate understanding of certain terms used frequently herein, particularly in the Description of Illustrative Embodiments. The explanations are provided as a convenience and are not limitative of the invention.
columnxe2x80x94the container for a single item of information in a row. Also known as field.
databasexe2x80x94one or more structured sets of persistent data, usually associated with software to update and query the data. A simple database might be a single file containing many records, each of which contains the same set of fields where each field is a certain fixed width.
Data Definition Language (DDL)xe2x80x94those text commands that define, as opposed to manipulate, data. For example, the SQL statements CREATE TABLE, CREATE INDEX, GRANT, and REVOKE.
Data Manipulation Language (DML)xe2x80x94those text commands that manipulate, as opposed to define, data. For example, the SQL statements INSERT, UPDATE, DELETE, and SELECT.
database management system (DBMS)xe2x80x94a software system for managing databases. These systems typically provide mechanisms for creating databases, defining objects in the database (tables, views, etc.), manipulating data (inserts, updates, queries), and controlling access to data (security).
electronic data interchange (xe2x80x9cEDIxe2x80x9d)xe2x80x94in common usage, generally refers to either the ANSI standard (X12) or the ISO EDIFACT (Electronic Data Interchange for Administration Commerce and Transport) standard formats.
eventxe2x80x94an occurrence of significance to a particular task.
fieldxe2x80x94sometimes defined as the smallest addressable piece of a record, for both defining record layouts or structures and for access data in a data record.
filexe2x80x94a named collection of data which is stored or processed as a single logical unit. Files are generally subdivided into one or more types of records, which are themselves subdivided into fields. See also field, and records.
file layoutxe2x80x94the organization of the data within a file. Describes the types of records used in the file. See also file, and records.
fixed-length data typexe2x80x94a data type that is always stored in the same number of bytes, such as a two-byte integer. See also variable-length data type.
foreign keyxe2x80x94a column or columns in a table that match the primary key in another table.
keyxe2x80x94a column or columns whose values identify a row.
HDSxe2x80x94HDS is a file-based hierarchical data system designed for the storage of a wide variety of information. It is particularly suited to the storage of large multi-dimensional arrays (with their ancillary data) where efficient access is needed. HDS organizes data into hierarchies, broadly similar to the directory structure of a hierarchical filing system, but contained within a single HDS container file. The structures stored in these files are self-describing and flexible; HDS supports modification and extension of structures previously created, as well as deletion, copying, renaming, etc. All information stored in HDS files is portable between the machines on which HDS is implemented. Thus, format transformation problems when moving between machines are minimized.
Hypertext Markup Language (xe2x80x9cHTMLxe2x80x9d)xe2x80x94an application of SGML that uses tags to mark elements, such as text or graphics, in a document to indicate how Web browsers should display these elements to the user and should respond to user actions such as activation of a link by means of a key press or mouse click. HTML is used for documents on the World Wide Web. HTML 2.0, defined by the Internet Engineering Task Force (xe2x80x9cIETFxe2x80x9d), includes features of HTML common to all Web browsers as of 1995, and was the first version of HTML widely used on the World Wide Web. Future HTML development will be carried out by the World Wide Web Consortium (xe2x80x9cW3Cxe2x80x9d). HTML 3.2, the latest proposed standard, incorporates features widely implemented as of early 1996. A description of SGML and HTML features is given in Bradley, N., The Concise  less than SGML greater than  Companion, Addison Wesley Longman, New York, 1997, which is incorporated herein by reference.
Indexed Sequential Access Methods (xe2x80x9cISAMsxe2x80x9d)xe2x80x94a record management system that provides support both for sequential access and for indexed retrieval of the records by key values.
metadataxe2x80x94data about data that informs users how data arrived in the data store, how the data is laid out, where it came from, and/or how it was transformed.
object-oriented database (OOBD)xe2x80x94a system offering database management facilities in an object-oriented programming environment. Data is stored as objects and can be interpreted only using the methods specified by its class. The relationship between similar objects is preserved (inheritance) as are references between objects.
queryxe2x80x94a user""s (or agent""s) request for information, generally as a formal request to a database or a search engine. SQL is the most common database query language.
recordxe2x80x94a collection of data items arranged for processing by a program. Multiple records are contained in a file or data set. The organization of data in the record is usually prescribed by the programming language that defines the record""s organization and/or by the application that processes it. Typically, records can be of fixed-length or variable length with the length information contained within the record.
relational databasexe2x80x94a database based on the relational model developed by E. F. Codd. A relational database allows the definition of data structures, storage and retrieval operations, and integrity constraints. In such a database, the data and relations between them are organized in tables. A table is a collection of records and each record in a table contains the same fields. Certain fields may be designed as keys, which means that searches for specific values of that field will use indexing to speed them up. Records in different tables may be linked if they have the same value in one particular field in each table.
rowxe2x80x94a set of related columns that describe a specific entity. Also known as a record.
SAP Intermediate business Document (xe2x80x9cIDOCxe2x80x9d)xe2x80x94an interchange format based on EDI used by SAP AG applications such as SAP R/3.
schemaxe2x80x94a database object that contains one or more tables, often created by a single user.
sessionxe2x80x94an OLE DB object that serves as the context for a transaction.
Standard Generalized Markup Language (xe2x80x9cSGMLxe2x80x9d)xe2x80x94an information management standard adopted by the International Organization for Standardization (xe2x80x9cISOxe2x80x9d), as ISO 8879:1986, as a means for providing platform-independent and application-independent documents that retain content, indexing, and linked information. SGML provides a grammar-like mechanism for users to define the structure of their documents and the tags they will use to denote the structure in individual documents. A complete description of SGML is provided in Goldfarb, C. F., The SGML Handbook, Oxford University Press, Oxford, 1990, and McGrath, S., Parseme.1st: SGML for Software Developers, Prentice Hall PTR, New Jersey, 1998, which are incorporated herein by reference.
standard query language (SQL)xe2x80x94a language which provides a user interface to relational database management systems, developed by IBM in the 1970s for use in System R. SQL is the de facto standard, as well as being an ISO and ANSI standard. It is often embedded in other programming languages.
transactionxe2x80x94an atomic unit of work. The work in a transaction must be completed as a whole; if any part of the transaction fails, the entire transaction fails.
transaction isolationxe2x80x94the act of isolating one transaction from the effects of all other transactions.
transaction isolation levelxe2x80x94a measure of how well a transaction is isolated.
truncatexe2x80x94to discard one or more bytes of variable-length data or non-significant digits of numeric data. Truncation results in a warning condition when getting data and a error condition when setting data.
Uniform Resource Identifier (URI)xe2x80x94the generic term for all types of names and addresses that refer to objects on the World Wide Web. A URL is one kind of URI.
valuexe2x80x94a data value.
variable-length data typexe2x80x94a data type for which the length of the data can vary, such as a string. See also fixed-length data type.
eXtensible Markup Language (xe2x80x9cXMLxe2x80x9d)xe2x80x94a subset of SGML defined by W3C as a method for putting structured data into a text file.
In one broad respect, the present invention is directed to a system for data transformation comprising one or more read spokes with each read spoke configured to connect to one or more data sources, each data source having one or more data structures referred to collectively as source structures; one or more modeless write spokes with each modeless write spoke configured to connect to one or more data targets, each data target having one or more data structures referred to collectively as target structures; and a transformation engine operatively coupled to the one or more read spokes for retrieving data from the one or more data sources, and coupled to the one or more modeless write spokes for storing data in the one or more data targets, with the transformation engine comprising a transformation map that comprises one or more mappings that relates one or more source structures to one or more target structures and an event list comprising one or more event actions, each with a corresponding triggering event, and the transformation engine is configured to iterate through the data sources and detect occurrences of triggering events and execute the respective one or more event actions from the event action list in response to the detection of a triggering event. In a narrow respect, the transformation engine further comprises a query language preprocessor operable to review the data transformation map and evaluate embedded expressions in the one or more mappings. In another narrow respect, at least one of the event actions (called a xe2x80x9ctransformation event actionxe2x80x9d) comprises retrieving at least source structure from the data source, transforming data from at least one source structure (called xe2x80x9ctransformed source dataxe2x80x9d), and storing the transformed source data into one or more target structures, with the transformation engine operable, in response to a transformation event action, to transform data specified by the transformation event action in a manner described by the data transformation map. In another narrow respect, a user interface is configured to allow a user to define the one or more data sources, and to define data structures in each of the one or more source databases. In another respect, a user interface is configured to allow a user to define the one or more data targets, and to define data structures in each of the one or more target databases. In yet another respect, a user interface is configured to allow a user to define the relationship between one or more data sources and one or more data targets. In a narrower respect, the user interface allows the user to relate source data structures to target data structures. In particular, the user interface comprises a display configured to graphically depict the relation between the source structures and the target structures specified in the transformation map. In another narrower respect, the user interface is configured to define the relationship between one or more data sources and one or more data targets as a logical expression and/or as a numeric expression. In another narrow respect, the transformation engine includes a display configured to show the contents of the data source and the contents of the data structure. In other respects, the triggering event is a generic source event, a generic target event, a generic transformation event, or a specific source record event. In yet another narrow respect, the transformation engine may also be configured to filter the data retrieved from the data 11 source, referred to as filtered source data, and can be further configured to iterate only through the filtered source data. In narrower respects, the transformation engine filters the data using predetermined sampling parameters governing a range or sample, or using predetermined logical extraction criteria. In yet other respects, the read spokes connect to the one or more data sources by utilizing a raw sequential mode such that an intuitive visual parser reconstructs record layouts, or by utilizing a compatible physical file format allowing the transformation engine to physically read from the data sources using the native internal storage format.