The present invention relates to data storage, and in particular, but not exclusively, to methods and apparatus for encoding or formatting data and for storing the data to, for example, a magnetic medium such as tape.
Taking data storage to tape as an example, a host computer system typically writes data to a storage apparatus, such as a tape drive, on a per Record basis. Further, the host computer may separate the Records themselves using Record separators such as FILE MARKs or SET MARKs. Record length, and the order in which the Records and the Record separators are received, are determined by the host computer.
Typically, Records comprise user data, for example, the data which makes up wordprocessor documents, computer graphics pictures or data bases. In contrast, Record separators, such as FILE MARKs, are used by a host computer to indicate the end of one wordprocessor document and the beginning of the next. In other words, Record separators typically separate groups of related Records.
By way of example, the diagram in FIG. 1(a) illustrates a logical sequence of user data and separators that an existing type of host computer might write to a tape storage apparatus. Specifically, the host computer supplies five fixed-length Records, R1 to R5, in addition to three FILE MARKs, which occur after R1, R2 and R5.
It is known for a storage apparatus such as a tape drive to receive host computer data, arrange the data Records into fixed-sized groups independently of the Record structure, and represent the Record structure, in terms of Record and FILE MARK position, in an index forming part of each group. Such a scheme forms the basis of the DDS (Digital Date Storage) data format standard for tape drives defined in ISO/IEC Standard 10777:1991 E. EP 0 24 542 describes one example of a DDS tape drive, which implements this scheme. Once the groups data are formed, the tape drive stores the groups to tape, typically after applying some form of error detection/correction coding.
The diagram in FIG. 1(b) illustrates the organisation into DDS groups of the host computer data shown in FIG. 1(a). Typically, the host computer data Records are encoded or compressed to form a continuous encoded data stream in each group. FILE MARKs are intercepted by the tape drive, and information that describes the occurrence and position of the FILE MARKs in the encoded data stream is generated by the tape drive and stored in the index of the respective group. In the present example, Records R1, R2 and a part of Record R3 are compressed into an encoded data stream and are stored in the first group, and information specifying the existence and position in the encoded data stream of the records and the first and second FILE MARKs is stored in the index of the first group. Then, the remainder of Record R3, and Records R4 and R5, are compressed into a continuous encoded data stream and are stored in the second group, and information specifying the existence and position in the encoded data stream of the records and the third FILE MARK is stored in the index of the second group.
In such a scheme, a tape drive reading the stored data relies on information in the index to reconstruct the original host computer data for return to a host computer.
FIG. 2 illustrates very generally the form of the indexes for both groups shown in FIG. 1(b). As shown, each index comprises two main data structures, namely a block access table (AT) and a group information table (GIT). The number of entries in the BAT is stored in a BAT entry field in the GIT. The GIT also contains various counts, such as a FILE MARK count (FMC) which is the number of FMs written since the beginning of Recording (BOR) mark, including any contained in the current group, and Record count (RC), which is the number of Records written since the beginning of Recording (BOR) mark, including any contained in the current group. The values for the entries in this simple example are shown in parentheses. The GIT may contain other information such as the respective numbers of FILE MARKs and Records which occur in the current group only.
The BAT describes, by way of a series of entries, the contents of a group and, in particular, the logical segmentation of the Record data held in the group (that is, it holds entries describing the length of each Record and the position of each separator mark in the group). The access entries in the BAT follow in the order of the contents of the group, and the BAT itself grows from the end of the group inwardly to meet the encoded data stream of the Record data.
The applicant""s co-pending patent application xe2x80x9cData Encoding Method and Apparatus, filed on the same date as the present application, describes an invention wherein the requirement for a BAT is removed by embedding special, reserved codewords representing Record boundaries and Record separators, such as FILE MARKS, into the encoded data stream. Therein, Record boundaries and FILE MARKS can be located by the respective embedded codewords.
Another applicant""s co-pending patent application xe2x80x9cData Encoding Scheme With Switchable Compressionxe2x80x9d (EP application number 97308778.6), describes an invention, which may be used in addition to the invention of the above-mentioned, co-pending application, in which both compressed data and non-compressed data can be encoded into the same continuous, encoded data stream. For the invention, preferably the non-compressed data is simply passed through the encoder and is stored in unencoded form.
In order to implement the inventions of the two co-pending applications at the same time, there is a need to encode reserved codewords into an encoded data stream even when data compression is not being applied to the input data.
In addressing the problem of combining the inventions of the two aforementioned co-pending patent applications, the applications have arrived at a particularly advantageous solution.
In accordance with a first aspect, the present invention provides a method of formatting host data, including the step of:
encoding members of a pre-defined group of data with m-bit codewords and encoding other data with codewords in excess of m-bits long to produce an encoded data stream, wherein all other data are encoded with codewords which have a common m-bit root sequence, and wherein the common m-bit root sequence is not itself representative of any of the members of the pre-defined group of data.
In accordance with the invention, there can be at least 2m codewords, 2mxe2x88x92 of which are free to represent members of the group. In other words, the m-bit root sequence is not free to represent a member of the group. In a practical embodiment, the m-bit root sequence is detected during data decoding as being the start of a reserved codeword that has a length greater than m bits. Obviously, the number of bits following the root sequence for reserved codewords determines how many reserved codewords there can be in the format.
In accordance with one embodiment, there are 2m members in the first group of data and the reserved m-bit root sequence forms part of a longer, p-bit codeword, which is reserved to represent the remaining one of the 2m possible members.
The advantage here is that there are 2m codewords available to represent members of the group. For example, where m=8, 28xe2x88x921 (i.e. 255) of the codewords are 8-bits long and the 28th (i.e. 256th) codeword is, for example, 9-bits long. The state of the 9th bit determines whether the 9-bit codeword is the 256th character, or whether the 9-bit codeword is a further root sequence for other reserved codewords, which can be 10 or more bits long.
In the preferred embodiment to be described, the length of reserved codewords (including the root sequence), n, is 13. Then, the state of the 9th bit after the root sequence either indicates that the 9-bits represent the 2mth member of the group, or that the next 4-bits (i.e. 13 bits in total) represent one of 16 possible reserved codewords.
Embodiments of the invention are particularly advantageous for encoding host data to which data compression is not applied. For example, if the host data is ASCII-based, the group members could be the ASCII character set and all but one of the codewords could be simple copies of the ASCII character set. Indeed, all but one of the codewords could be formed, by simply passing the ASCII character data through the encoder without firer operation by the encoder. The one remaining codeword could then be used as the root of codewords representative of other data, such as Record boundaries or FILE MARKs, or data format control data. In this example, the ASCII character which would otherwise be represented by the root codeword needs to be represented by a longer codeword.
Other aspects of the present invention, particularly apparatus to enact the method, which are claimed herein, will become apparent from the following description. Further, some aspects of the invention relate to methods and apparatus for reading and/or decoding data which has been formatted or encoded as described herein.