Recent increases in regulation, for example from bodies such as the Securities and Exchange Commission (SEC), National Association of Securities Dealers (NASD), their counterparts in other countries, and by legislation such as the Sarbanes-Oxley Act in the United States, have significantly increased the requirement to retain copies of electronic communications, such as electronic mail, instant messages, and other document types.
Coupled with the significant growth in the use of electronic communications and documents, the storage and management costs of retaining such data have escalated enormously, particularly for large organizations with many thousands of employees.
Recent studies show that a typical corporate employee currently handles around 7 megabytes of electronic mail per day. This number is forecast to increase to around 14 megabytes by 2007. Assuming an average 5 year retention requirement, this means that the storage needs of a 25,000 user organization could grow to over 300 terabytes, just for electronic mail.
In the past, much of the long term document retention needs of an organization have been satisfied using technologies such as magnetic tape and in some cases write-once optical devices. For many organizations however these are no longer viable options. This is because of an increasing need to be able to rapidly retrieve individual documents and to be able to search the content of all stored documents in a matter of seconds. Slow access times mean that this is impossible in the case of magnetic tape, and impractical in the case of optical devices which also suffer from relatively low capacity.
Currently, the only viable option is high capacity on-line data storage, which is produced by a relatively small number of large scale storage manufacturers, costing in the region of US$100,000 per terabyte. This is in sharp contrast to the ‘commodity’ end of the market where disk drive prices have fallen sharply while capacities have increased. A 30 gigabyte capacity hard disk can be purchased for as little as US$50, equivalent to around $1,700 per terabyte. One must, however, be cautious when making comparisons of this type, because large scale systems include infrastructure such as power supplies and controllers, and provide manageability and resilience, thus lowering the effective cost of ownership; however, such comparisons do illustrate the point that there is a wide pricing gap between the two.
The industry generally has also been slow to recognize that the requirements for long term on-line archival storage are fundamentally different to those of ‘normal’ on-line storage applications. In particular, in a traditional on-line storage environment, data ‘read’ activity generally exceeds ‘write’ activity by several, if not many, times. Consequently, on-line storage is usually optimized for reading, by the use of techniques such as caching.
In contrast, archiving, particularly of transient data such as email and instant messages, demands very high write performance in order to rapidly process the constant stream of incoming data—otherwise the messages have to either be delayed, or a temporary copy of them must be made, placing additional demands on the messaging system.
Storage technology has evolved considerably in recent years, with three new storage technologies emerging.
Network Attached Storage (NAS) couples one or more hard disk drives with a processor, memory and network connection to provide an inexpensive server dedicated to data storage, which can be accessed by any other networked system.
NAS has become a commodity item, available from a wide range of manufacturers at a cost below US$10,000 per terabyte. It has a relatively low capacity however, currently a maximum of 1 terabyte in a single physical device.
Storage Area Networking (SAN) adopts a different approach, providing a pool of physical storage devices connected by a dedicated high speed network. Application and file servers connect to this dedicated network and share the available storage pool, which can be easily expanded without needing to take servers off-line.
SAN is available from a number of manufacturers, but requires purpose designed high speed networking components, which increase its cost considerably over NAS. It supports a large pool of storage devices, but typically does not share logical storage volumes between servers.
Content Addressable Storage (CAS) is specifically designed for archiving static documents, ie those which do not change over their storage lifetime. Applications pass a copy of the document they want stored, and are returned a unique ‘token’ which the application uses to subsequently retrieve the document. CAS provides the advantage that multiple copies of the same document are automatically identified and only a single copy is stored.
Since CAS manages its own operating system its capacity is practically unlimited, but the requirement to calculate the unique token for every file written creates an overhead which limits write performance. In terms of capital cost however, CAS is currently at the top of the scale.