The way data is usually organised in databases and the way data is processed is often not optimal for efficiency and processing speed. An example of a simple database in a common format is shown in FIG. 1. Suppose there is a newly installed database comprising four small tables as shown in FIG. 1. The first table FIG. 1a is called CLIENT and the column called CLIENT_NAME is the table's primary key (see the exclamation mark). Initially this table contains two records: one for Apple and one for Microsoft. The second column called CLIENT_ADDRESS contains the addresses of the clients. The second table, shown in FIG. 1b, called PRODUCT has a column PRODUCT_NAME, which is the table's primary key. Initially this table contains two records: one for ‘mouse’ and one for ‘disk’. The second column called PRODUCT_PRICE contains the corresponding price of the ‘mouse’ and ‘disk’. The third table, which is shown in FIG. 1c, is called CLIENT_ORDER and stores the orders made by clients. Column ORDER_ID is the table's primary key and column ORDER_CLIENT is to contain the corresponding client registered in table CLIENT. The table is empty, as no orders have been made by the clients in the first example. The fourth table, shown in FIG. 1d, is called ORDER_ITEM and may store the quantities of products commissioned in a particular order, e.g. a client might commission 3 mice and 5 disks in a single order, which would result in two records added to the table—one for the mice and one for the disks. The column ITEM_ORDER contains the order the product was commissioned in, while column ITEM_PRODUCT contains the name of the commissioned product. The table is empty, as no orders have been made by the clients yet.
In this example, on the 8th of April Apple makes the first order with ORDER_ID=1 and they want to buy 7 disks and 4 mice. The database is changed as shown in FIG. 2. The tables CLIENT_ORDER and ORDER_ITEM are populated with the corresponding data.
On the very same day, Microsoft send in their order, which order will have identifier ORDER_ID=2, and they want to buy 12 disks. FIG. 3 shows how the tables CLIENT_ORDER and ORDER_ITEM are updated with the corresponding data.
Four days later the last order is placed by Apple, who orders 9 disks. FIG. 4 shows how the tables CLIENT_ORDER and ORDER_ITEM are updated with the corresponding data.
At the end of month the clients are charged for what they have ordered and a detailed invoice needs to be generated for each client. Examples of invoices are shown in FIGS. 5 and 6. The products are listed under the corresponding orders. The structure of the invoice is an example of a final format for presenting the data and the data need to be transformed from the database structure as shown in FIGS. 1 to 4 into the structure of the invoice as shown in FIGS. 5 and 6
The data stored in the database is not organized in the way the invoice is organised. Instead, each type of object has an individual table that stores the objects of that type (i.e. one object in one row). The relationships between the objects are maintained by keeping a unique identifier of the referred object within the row of the referring object—e.g. order of ID “3” refers to the client of name “Apple”. FIG. 7 illustrates with arrows 2 how the relationships between the objects are maintained.
To retrieve the data from the database in the form that is closest to the form of the invoice, the database needs to be queried using an SQL “select” statement as illustrated in FIG. 8. Looking at the SQL Select statement in FIG. 8, we can see that the first part of the statement defines the output fields (Client_name, Client_address, Order_ID etc) which will appear in the stream of output records shown in the table of FIG. 9. The next part of the Select statement defines the manner in which the tables are joined to produce the output data.
The SQL engine in the database server executes the statement and returns a stream of records made up of combinations of correlated rows taken from the database tables, as shown in FIG. 9.
As FIG. 9 shows, each record in the stream is a result of matching or chaining four correlated rows—one row from each of the four tables.
For example, the first record (the first row) in the stream of FIG. 9 is generated by the following sequence of matches:
1. matching the row “Apple” from table CLIENT with row “1” from table CLIENT_ORDER (as CLIENT.CLIENT_NAME=CLIENT_ORDER.ORDER_CLIENT),
2. matching row “1” from table CLIENT_ORDER with row “1”-“mouse” from table ORDER_ITEM (as ORDER_ITEM.ITEM_ORDER=CLIENT_ORDER.ORDER_ID)
3. “1”-“mouse” from table ORDER_ITEM with row “mouse” from table PRODUCT (as ORDER_ITEM.ITEM_PRODUCT=PRODUCT.PRODUCT_NAME)
The second record in the stream is generated by the following sequence of matches:
1. matching row “Apple” from table CLIENT with row “1” from table CLIENT_ORDER—as previously (as CLIENT.CLIENT_NAME=CLIENT_ORDER.ORDER_CLIENT),
2. matching row “1” from table CLIENT_ORDER with row “1”-“disk” from table ORDER_ITEM (as ORDER_ITEM.ITEM_ORDER=CLIENT_ORDER.ORDER_ID),
3. “1”-“disk” from table ORDER_ITEM with row “disk” from table PRODUCT (as ORDER_ITEM.ITEM_PRODUCT=PRODUCT.PRODUCT_NAME).
An important consequence of the fact that the “select” statement delivers data in the form presented above is the occurrence of redundancies in the delivered data. In FIG. 9 the gray fields contain redundant information: the “Apple”-“Cupertino”-“1”-“08.04.2010” foursome is present twice in the stream, simply because the order of ID=1 contains two items. If there were three items in the order, the “Apple”-“Cupertino”-“1”-“08.04.2010” would be tripled. This is how the execution of the SQL select statement works and that cannot be changed.
SQL statements always return flat structures and this is the reason that the redundancy occurs. It is necessary to replicate data when the original data is hierarchical. This can be seen in FIG. 9 which shows the flat structure while the invoices of FIGS. 5 and 6 are presented in a hierarchical structure.
Retrieving data from a database takes up memory in a computer which in turn increases the amount of processing needed; therefore, redundancies need to be avoided to optimise processing speed. The example given above is small and more realistic examples would comprise many entries, such that the problem of processing speed becomes more acute. It is not uncommon for computer programs to run out of memory when retrieving data from large databases.
The example of an SQL database is given, but the problems may occur when retrieving data from other storage systems such as operating system files.
There is a need to transform the flat and redundant data structure into a hierarchical and non-redundant data structure.