1. Field of Invention
The present invention relates to encoding and decoding digital content for reliable distribution and storage within a networked cluster of storage systems and more particularly to a system that evenly distributes the resource requirements across the cluster of storage systems.
2. Description of Related Art
Network Protocols:
Communication between systems across the Internet is generally accomplished through the Internet Protocol (IP). This transmission protocol supports two higher-level protocols: The transfer Control Protocol (TCP/IP) is a streaming point-to-point protocol and the User Datagram Protocol (UDP/IP) is a connectionless protocol.
TCP/IP has been compared to a telephone conversation where the two parties are connected via a dedicated circuit, with the correctness of the data transmitted being guaranteed by the protocol. In TCP/IP, data is transmitted and received as a stream and, while the sequence of bytes is preserved, it is not guaranteed to arrive all at once, as there are no protocol defined packet boundaries. TCP/IP requires one dedicated socket at each end for the duration of the connection. Thus, data to be transmitted to multiple recipients requires multiple socket connections at the source. This can be a limitation as most operating systems have a finite pool of sockets. Once the pool is exhausted, no new connections can be made until some connections are terminated and their sockets are released back into the pool. Further, data to be transmitted to multiple recipients requires retransmission for each additional recipient thereby using more network bandwidth.
UDP/IP is a packet-oriented protocol that has been compared to sending letters via the post office with the correctness of the data being the responsibility of the application and not the UDP/IP protocol. There is very little management of UDP/IP packets, so they can arrive in the wrong order, they can be duplicated, or not arrive at all. Packet loss in UDP/IP could be due to network congestion, operating system socket buffer overflow, etc. In UDP/IP individual packets arrive complete as the protocol does define packet boundaries. UDP/IP does not require a dedicated socket per connection, as the protocol does not manage the state of the transmission. Instead, one socket can be used to send packets to any number of hosts with each datagram specifying a different network address. UDP/IP is generally faster than TCP/IP but it lays upon the application the responsibility for error detection and recovery, as there is no inherent acknowledge and retransmit capability.
UDP/IP defines three ways of transmitting packets: unicast, multicast and broadcast. Unicast transmits datagrams (packets) for delivery to a single recipient. Multicast transmits datagrams for delivery to a well-defined group of recipients. Broadcast transmits datagrams for delivery to every potential recipient on the network. The usage of broadcast is limited due to the heavy load it places on the network.
Transmission Errors:
Anytime data is transferred across a medium from its source to its destination there is the possibility that errors will be introduced causing packet loss. The errors can be introduced at many steps during the transmission. Some errors are due to physical conditions (such as weather, interference, etc.) affecting the transmission medium such as satellite, antenna, etc. Other errors are due to software/hardware conditions (such as network congestion, buffer overflow, CPU busy, etc.) affecting the source and destination systems' ability to send/receive data packets.
Error Detection and Correction:
Transmission failures fall into two categories: “errors” occur when the data received is corrupted and “erasures” occur when the data is not received. The TCP/IP and UDP/IP protocols ensure that the destination system will not receive corrupted data. However, erasures can occur when packets are entirely missed such as when they are not received within an application-defined period of time. This can easily occur in UDP/IP due to network congestion, and it can happen in both UDP/IP and TCP/IP when the source system dies. There are two methods for correcting the errors, Backward Error Correction (BEC) and Forward Error Correction (FEC). BEC is when the destination system detects that an error has occurred in a packet (e.g., through a single checksum, etc.) and requests that the source system retransmit that packet. The implementation is relatively simple, but the performance is poor as the same packet could be re-transmitted many times due to errors. Additionally, the overhead of the protocol requesting a re-transmission upon error detection and otherwise sending an acknowledgement for each packet is great. Standard FEC coding improves the reliability of transmission by introducing checksum symbols into the data prior to transmission. These checksum symbols enable the receiving system to detect and correct transmission errors, when the error count is within the encoding parameters, without requesting the retransmission of the original data.
Forward Error Correction (FEC):
One of the criteria by which a FEC coding method is gauged is the number of failures/erasures that it can tolerate. There exist many FEC codes whose implementations are of varying complexity depending upon the flexibility and performance required. High performance parity-based coding methods (e.g., Hamming, etc.) usually compute the checksum symbols using the bitwise exclusive-or (XOR) of the data. These are inadequate, as they can tolerate no more than two errors at a time in some error combinations. A system is needed that can tolerate the number of systems failing simultaneously within a cluster to be greater than two. Such coding methods (e.g., generic Reed-Solomon, etc.) often have poor performance when used to encode/decode large data sets that makes them inapplicable to most real-world problems. Another consideration is whether the coding method allows the sequential decoding of the data. Sequential decoding retrieves the data in the order in which it appeared in the original content before encoding. Streaming of audio/video content is not possible without the ability to decode sequentially as the entire data context would have to be decoded before streaming could commence. This is impractical, as it requires that the decoded content be stored locally, which may exceed the system's storage capacity and/or violate the content's copyright/licensing, as well as entail a long delay while the decoding is proceeding before the streaming can begin. For content that does not have a sequential nature (e.g., databases, etc.), a coding method that allows random access into the encoded representation is necessary. The requirement is to encode/decode a specific area of the data set without first encoding/decoding from the start down to the specific area. Performance is an issue for those encoding methods that have this capability while other encoding methods lack this capability altogether.
Storage Medium:
The vast majority of on-line content is stored on hard disk drives (HDD). Near-line content, though mostly stored on tape, is migrating to HDD as the cost of the latter continues to come down and their capacity expands. Off-line content is usually stored on tape. Some storage mediums have inherent limitations that preclude some functionality (e.g., linear tape requires sequential access, volatile memory is not capable or retaining data, etc.) Other storage mediums have no such limitations and allow the application of the full functionality of this invention (e.g., HDD's, Flash memory, etc.)
HDD's are most interesting at the present because the growth in their capacity has far outpaced their ability to store and retrieve data. This asymmetry is such that entirely reading or writing a one-terabyte HDD's would require many days.
Another limitation of HDD's is their reliability. No matter what their Mean Time Between Failure (MTBF), HDD's can fail thereby losing their contents. In order to improve their reliability, HDD's are sometimes grouped into a Redundant Array of Independent Disks (RAID) configuration so that the loss of a single member of the disk group will not interrupt the operations of the RAID. When the defective disk is replaced with a new (empty) disk, the RAID will “rebuild” the data that belongs on the new disk. This is an operation that can take several hours depending upon the size of the disk and the how busy the RAID is. Starting from the time the disk failure was first detected and until the time the replacement disk is “rebuilt”, the RAID is said to be “naked.” The term naked indicates that the RAID no longer offers any protection, as the loss of a second member of the disk group is fatal to the RAID since it is incapable of correcting more than one failure.
Virtual File System:
A Virtual File System (VFS) provides a unified view of multiple networked file systems. Conventional VFS technology is essentially a networked file system in that only the real file systems know and manage their contents. A VFS is not a real file system as it relies on real file systems to manage the files on their local disks (thus, the “virtual” aspect). Therefore accessing a file through a VFS amounts to forwarding file I/O commands (e.g., open, close, read, write, etc.) via the network to a remote file system. One advantage of a VFS is that it can provide a common layer over heterogeneous file systems. The main benefit is the translation of file path syntax between the different file systems. Thus, an application running under one file system can access files on a remote file system through a seemingly native file path.
One of the limitations of current VFS technology is that it can only represent files that are entirely contained within a single file system.
Scalability:
The amount of data to store is growing at a tremendous rate with no indications of tapering any time soon. This has resulted in ever-greater capacity and performance requirements for storage servers. The latter have grown to manage terabytes of data, which has exacerbated the I/O throughput problems. Storage Area Networks (SAN) were created to provide high performance Fibre networks connecting servers and large RAID's. SAN's are highly complex and very expensive.
Redundancy:
Powerful servers service many simultaneous data transfers and therefore would have a severe impact when they become unavailable. A failure in a non-redundant server will cause immediate disruptions. Redundancy is often added to servers to minimize down time by avoiding single points of failure. Server sub-systems can be made redundant by duplicating their major components, including the hard disk drives (RAID), the host bus adapters (HBA), RAID controllers, CPUs, network interfaces, memory, routers, switchers, power supplies, etc. SAN's have the same reliability requirements so they are made redundant which requires cross strapping of all the connecting cables from all the servers, RAID's, SAN's, etc. For all this added hardware, most servers and SAN's provide only protection for a single point of failure, as a second failure within the same sub-system will usually cause disruptions. Most fully redundant systems still cause disruptions when their failed components are repaired (e.g., a memory bank, CPU or I/O controller are replaced). The failed components must be repaired as soon as possible because their unavailability increases the vulnerability of the systems. Thus, fully redundant systems do not eliminate disruptions on failures, they simply afford some time to schedule the disruption. Embodiments of the present invention are inherently able to withstand multiple concurrent failures as well as having repairs performed while operational without disruptions.
Server Failure:
When a storage server or RAID fails, its content becomes unavailable and all of its sessions (data transfers) are aborted. At best a replacement becomes available and the clients re-issue their requests so that the sessions restart. This does not result in the resumption of the sessions at the point of interruption; the sessions have lost their previous context. In some cases, the massive spike of activity due to the hundreds of re-issued requests can by itself overwhelm the new server. Some requestors will not re-issue the request and incomplete content may remain on their systems. It is hoped that the new server has access to the same data as the server that failed without which the new requests will fail. As a rule, the error recovery process is more complex than the transfer process for both the clients and the servers. Typically, a backup/stand-by server detects a server failure and a fail-over procedure is initiated culminating in the backup server taking over. The new server has no knowledge of on-going transactions that were aborted as no context is retained during the fail-over procedure. The client systems must recover from the failure and attempt to reconnect to the new server when it completes its take-over for the failed server. It is the burden of the client to keep state and cleanup any interrupted transactions. Long database transactions may be impossible to roll back and the clients may no longer have the data to restart the transactions. The solution is usually to restart the client and requires operator intervention. A very large industry has developed to provide software solutions as workarounds. These are very complex and expensive, yet cannot avoid disruptions either when the failure occurs or when the repair occurs.
Load Balancing:
Load balancing is a major problem for enterprise data centers. Load balancing encompasses all of the resources of a server including the CPU(s), the memory, the storage capacity, the storage I/O bandwidth, the network bandwidth, etc. The availability of large RAID Storage Systems and powerful servers is not sufficient to ensure load balancing. The location of the most requested data determines which servers and their network segments bear the greatest load. Thus, a data center could have several RAID Storage Systems of identical capacity with very different loads based upon usage patterns. Typically adjusting the load is a complex endeavor require service disruption due to the need to move data, take RAID systems off-line in order to re-stripe them, etc.