One composite document contains not only text but also graphics, spreadsheet data, sound, video image and other information. For example, files such as message recording files and expression files of instant messaging (IM) clients can be stored by using composite documents. As the use time of the IM tools increases, the corresponding composite documents will become bigger and bigger.
FIG. 1 is a schematic logical structure of a storage and streams of a composite document. The logic structure of the composite document is very similar to that of a file system, and each document includes a root storage each having from 0 to many storages or streams. Each storage and stream has a name, which is usually constituted by 16-bit Unicode characters and has a maximum length of 31 characters. The names of storages or streams in the same storage should be different, and the names of storages or streams in different storage can be the same.
Except the header, all data of the composite document are organized in the form of streams. All streams of the composite document are divided in to smaller data blocks, called as sectors. The sector can contain control data or user data. The whole composite document contains a header following with a series of sectors. All the sectors have the same size which is set in the header.
The sectors are listed in their order in the document. An index (starting from 0) of the sector is called as sector identifier (SID) which is a 32-bit signed integer value. If one SID is not smaller than 0, then it must point to an existing sector; if the value of one SID is negative, then it may have special meanings.
A linked list formed by all sectors of a stream is called as a sector chain. Adjacent sectors in the sector chain are not necessarily adjacent in the physical. In order to facilitate indicating each sector's relative position relationship in the sector chain, a concept of sector identifier chain is introduced. The sector identifier chain is a sector identifier array. The sector identifier chain sequentially records sector identifiers of the sectors of the stream starting from the sector identifier of the first sector of the stream and ended with a linked list termination (−2).
The streams of the composite document can be divided into inner controlling streams and user data streams according to the purposes. The inner controlling streams include a directory stream, a master sector allocation table (MSAT), a sector allocation table (SAT), a short sector allocation table (SSAT) and a short stream container stream.
The master sector allocation table is a SID array which sequentially indicates SID of the sectors which are used to store the sector allocation table. The size of the MSAT is equal to the number of sectors which are used to store SAT, and the size is stored in the header.
The SAT is a sector identifier array, and includes all the user data streams and the inner controlling streams. The size of the SAT is equal to the number of sectors existing in the whole composite document. An index of an element of the SAT array is the sector identifier represented by the element, and the value of the element is the next node of the sector represented by the element in the sector chain. The SAT can contain Free SID(−1) at any position, and these sectors will not be used by any stream; if this position contains End of SID Chain(−2), it means an end of one stream; if this position contains End of SID Chain(−3), it means that the presented sector is used to store SAT; if this position contains End of SID Chain(−4), it means that the presented sector is used to store MSAT. For example, if an element value of the SAT array is −1, then it means that a sector corresponding to the element index is not used by any stream; if the element value is −2, it means an end of one stream; if the element value is −3, it means that a sector corresponding to the element index is used to store SAT; if the element value is −4, it means that a sector corresponding to the element index is used to store MSAT.
The short stream container stream is the same as other common user stream which has a length not smaller than that of the standard stream. In the sector chain of the short stream container stream, SID of the first sector is stored in the directory entry of the root storage. A sector identifier chain of the short stream container stream can be obtained from SAT.
SSAT is another SID array, and it contains sector identifier chains of all the short streams. As an inner controlling stream, SSAT has the same establishment process as that of common streams. The first sector identifier of SSAT is stored in the header. As a sector allocation table, SSAT's function is very similar to that of SAT, and the unique difference is that the sector identifiers of SSAT point out to short sectors rather than common sectors.
The directory stream is an inner controlling stream, and includes a directory entry array. Each directory entry points to a storage or stream in the composite document. In the directory stream, a directory entry index starting from 0 is called as directory entry identifier (DID).
The aforementioned composite document does not control the allocation of the sectors, resulting in a large number of patches; I/O keeps jumping in the entire composite document, which will seriously affect the performance. The MSAT, SAT, SSAT and directory entry are distributed throughout the entire composite document, which seriously affects the performances of the composite document such as opening, traversing, reading and writing and so on; for streams and short streams, a too small allocation unit and the sector allocation with no control also result in a large number of patches.