In computing, specifically computer storage, a redundant array of independent (or inexpensive) drives (or disks) (RAID) is an umbrella term for data storage schemes that divide and/or replicate data among multiple hard drives. They offer, depending on the scheme, increased data reliability and/or throughput.
Fundamentally, RAID combines multiple hard disks into a single logical unit. There are two ways this can be done: in hardware and in software. Hardware combines the drives into a logical unit in dedicated hardware which then presents the drives as a single drive to the operating system. Software does this within the operating system and presents the drives as a single drive to the users of the system. RAID is typically used on servers but can be used on workstations. This is especially true in storage-intensive computers such as those used for video and audio editing.
Three of the concepts used in RAID are mirroring, striping and error correction. Using mirroring, data is copied to more than one disk. Using striping, data is split across more than one disk. For error correction, redundancy coding, such as XOR parity and/or Reed-Solomon code, methods are used to protect data in case of device failure. Different RAID levels use one or more of these techniques, depending on the system requirements.
A hardware implementation of RAID requires at a minimum a special-purpose RAID controller. On a desktop system, this may be a PCI (peripheral communications interface) expansion card, or might be a capability built in to the motherboard. In industrial applications the controller and drives are provided as a stand alone enclosure. The drives may be IDE/ATA, SATA, SCSI (Small Computer Systems Interface), SSA, Fiber Channel (FC), or any combination thereof. The using system can be directly attached to the controller or, more commonly, connected via a storage area network (SAN). The controller hardware handles the management of the drives, and performs any parity calculations required by the chosen RAID level.
The RAID controller is best described as a device in which servers and storage intersect. The controller can be internal to the server, in which case it is a card or chip. Alternatively, it can be external to the server, in which case it is an independent enclosure, such as a NAS (network-attached storage). In either case, the RAID controller manages the physical storage units in a RAID system and delivers them to the server in logical units (e.g., six physical disks may be used to ensure that one drive stays correctly backed up, but the server sees only one drive).
When the primary functional blocks of a RAID controller are integrated in a System On a Chip (SOC) VLSI device, they are often called RAID controller chip, or RAID-on-a-Chip (ROC).
FIG. 1 illustrates the conceptual diagram of a typical architecture of a ROC 10. Many of the current RAID controller SOCs implement variants or a subset of this basic architecture, including but not limited to the following devices: LSI 1078; Intel IOP333, Intel Sunrise Lake IO processor; AMCC PPC440.
The key components of this class of architecture include: a single CPU 12, typically a RISC based embedded processor core; Host bus interface 14, typically PCI, PCI-x, or PCIe interfaces; Memory interface or memory controller 16, typically DDR, or DDR-II interfaces; Storage Interface or Storage Protocol controller 18, typically SCSI, ATA, SATA, SAS, or Fibre Channel Interfaces; XOR or RAID6 engine 20; direct memory access (DMA) logic 22; Peripheral interfaces or buses 24 including UART, I2C, GPIO, SGPIO; and Message Unit 26 for communication between host CPU and the ROC CPU.
Within the ROC memory space there are two basic data types. The first type of data (Low Latency) is used for the control of data movement and includes ROC CPU instructions & data, DMA Engine status & control structures, Messaging Unit messages, and Storage Interface Controller status & control structures. The second data type (High Bandwidth) is Host Application Data. The second data type is never accessed by the ROC CPU. For disk drive status, drive control, and drive metadata, either a Low Latency or High Bandwidth path may be used, but both the producer and consumer of the data must use the same path.
An interconnect mechanism is required to handle the data movement/access of the two types of data structures, or data types, to satisfy the performance requirements for both types of data structures. Known architectures are all based on a system bus interconnect mechanism 28 to provide inter-block communication and to provide access the common memory spaces, including local (DDR) and host (PCI) memory space. Because many of the data structures and control structures are quite complex and significant in size, typically they are stored in DDR or PCI memory space.
The system bus interconnect mechanism 28 provides a method of arbitration to allow multiple blocks (including various hardware engines and the CPU) to access shared data and control structures. Typically, the system bus provides a peer-to-peer communication mechanism, wherein a number of system blocks perform an exclusive master function that generates a transaction request to an address space (e.g. ROC CPU). Some blocks perform an exclusive slave function (DDR memory controller) that receives and responds to transaction requests. Other blocks perform primarily a master functions (e.g. SAS controller, XOR engine) for data movement and access control structures, but also have an associated slave function for control and configuration purposes (e.g. control registers for DMA engine).
Some approaches employ a common internal fabric, in which the same fabric is used to support Low Latency traffic and High Bandwidth traffic, without treating these two types of access differently. This approach is typically used in earlier ROC implementation or systems with lower performance requirements. There are a number of ways to segregate Low Latency traffic from High Bandwidth traffic. Three such approaches are described briefly below.
1) Multiple internal fabrics—a portion is dedicated for Low Latency traffic and a portion is dedicated for High Bandwidth traffic. Masters of one fabric type would not be allowed to talk to slaves of the other type.
2) A more hybrid method is to have multiple fabrics, wherein portions are dedicated for Low Latency and other portions are dedicated for High Bandwidth. The determination of the particular fabric targeted is typically based on address or attributes.
3) The last is to have an internal bus structure where multiple virtual channels are created—one for High Bandwidth and the other for Low Latency traffic. For this architecture to work, the following features should be present: data packet size must be small (less than 256 bytes); support deep pipelining (due to small packet size); support out of order reads; support high priority Low Latency traffic arbitration, and reserved pipelining depth to always allow Low Latency requests to be made (reservation depth should be programmable). Virtual channel or fabric targeted selection is based on address or attributes.
Regardless of how the Low Latency and High Bandwidth requirements are met, targets or slaves that handle both types of traffic (PCI Modules & Memory Controller) must understand the different traffic types providing priority to Low Latency requests. Additionally, it should reserve resources for Low Latency requests. Both of these techniques ensure that a Low Latency request does not get delayed due to a large number of queued High Bandwidth requests.
The architecture of FIG. 1 has difficulties scaling towards higher RAID system performance requirements, as the speed of interfaces increases according to new protocol standards such as SAS-2 and PCIe Gen2. The main limitations include:
1. Lack of multi-processor support, which limits the RAID stack firmware performance as CPU cycles become saturated.
2. Even when multiple processors are added to the ROC architecture, the CPU performance is still hampered by fabric contention as they compete for access for DDR or other common resources.
3. Bus contention also occurs between the CPUs and the hardware engine DMA operations. This sequentializes many operations that can otherwise be executed in parallel.
4. Bus fabric typically causes Head of Line (HOL) blocking issues on concurrent transfers over different physical interfaces. For example, suppose a DMA master handles a top of the queue request to transfer data from DDR. If the DMA master faces a temporary contention on DDR, this causes subsequent transfers, which request data from PCI, to be blocked even though the PCI interface is completely idle in the mean time.
5. The fabric interconnected ROC architecture lacks a shared buffer resource that is accessible by all subsystems. This often leads to distributed small local buffer memory that is required by each subsystem. For example each Phy of the SAS controller requires a staging buffer for receiving and transmitting frames. The distributed buffer results in low memory utilization, low memory density (due to the small size of each buffer RAM), and large overhead for logic that handles the DFT (Design For Testability), ECC, and lack of flexibility for sharing of buffer resources across the local function block boundaries.
6. In a conventional ROC architecture, the off-chip DRAM (or host memory on PCI) is the only memory that is commonly accessible by all hardware engines and CPUs that provides sufficient capacity as data buffer or control structures. Consequently, the off chip memory is typically used as the operational space for CPUs and the hardware subsystems. In other words, the majority of control functions (such as exchange of descriptor and hardware context), and data functions (such as receiving frame, performing XOR computation, or encryption computation) is defined to be off-chip memory oriented (use off-chip memory as the storage space descriptor, source data, and result data). In order for any of those types of operations to be performed, the CPU or the hardware subsystems involved need to access memory. This intensive reliance on off-chip memory puts a heavy load on the memory bus, and consequently the system performance is bounded by how much memory bandwidth can be provided by the specific implementation. The contention for the off-chip memory bandwidth often leads to greater latency, and consequently idle waiting time in the CPU and the hardware engines. Through performance profiling, it was concluded that DRAM bandwidth is the typical system bottleneck in most conventional ROC architecture. Given that the DRAM bandwidth available to a ROC device is limited by the current state of the art of the DRAM technology, i.e. the width of the memory bus that is dictated by the cost point of the ROC solution, it is impractical to increase DRAM bandwidth beyond a certain established threshold at given design constraints. Hence the memory bottleneck becomes the ultimate limiter of scaling ROC system performance.
7. A further limitation of the off-chip memory oriented processing by the hardware engines is the lack of support of multi-pass processing of data. In a storage application, it is a common requirement for a block of data to be processed by multiple engines in a pipeline fashion. An example of such multi-pass processing is that data is fetched from DRAM buffer, a T10 DIF checksum and tag are computed and the AES encryption is applied to the data before the device writes the data into a target disk. To perform such a multi-pass operation, the conventional ROC architecture requires the data to be fetched from off-chip memory, processed, and then the intermediate results are written back to off-chip memory for each of the intermediate steps of the processing sequence. This results in data movement on and off-chip multiple times, consuming the critical memory bandwidth resource. The net result is lower system performance and higher power consumption for a given processor/memory speed.
8. A further common characteristic of the conventional ROC architecture is the lack of centralized DMA service that provides the data movement, address translation (such as handling of scatter-gather), and management of context switching (on transfers that involves part of a partial scatter gather buffer). Yet since most of the hardware processing subsystems perform operations on data blocks from external memory buffers, and write back results to external memory buffers, having a DMA function in each hardware processing subsystem (for directly accessing external memory interface) is a common design approach. This distributed DMA approach has the following shortcomings:
a. Repeated logic leads to greater gatecount and higher cost of the ROC system. In particular, the complexity of handling address-translation (scatter-gather) and the context saving/switching for partial scatter-gather buffer handling is repeated in each hardware subsystem that requires DMA function. The logic for handling of mis-aligned addresses, paging of data in and out of memory, endianness conversion etc is repeated. Any bug fixes in one subsystem may or may not be propagated to other subsystem's DMAs that have a similar design.
b. Distributed DMAs compete for access to external memory interfaces autonomously, leading to inefficient usage and sometimes unfair sharing of external memory bandwidth.
c. The distributed DMA approach often leads to divergences in the DMA features mentioned in a) over time of evolution of the design, resulting in inconsistency in programming interfaces for the various hardware engines, and results in difficulty in maintaining the hardware and software source code.
d. Distributed DMAs often require arbitration in the internal fabric, which leads to higher latency, and larger overhead for accessing external interfaces. This often leads to lower performance, and higher cost of buffering.
e. Conventional DMA operations are based on a physical address or the physical addressing of scatter-gather list without the notion of virtual address to physical address translation. The lack of virtual address DMA seriously complicates the design of the DMA master, particularly in the handling of context saving/switching among partial transfers.
To meet the increasing requirement of RAID on a Chip applications, the ROC architecture needs to evolve to address the following needs:
1. To scale up the raw CPU processing speed as represented by the number of instructions the ROC device can execute per second. Since the speeds of storage interfaces and PCI bus interfaces are accelerating at a faster rate than the increase in CPU processing speed offered by Moore's Law in recent years, a single embedded processor can no longer keep pace with the requirement of RAID processing. Consequently, the ROC architecture should support multiple processors on-chip.
2. To support new emerging application feature requirements for data security, data integrity, and more advanced RAID processing such as RAID6, the ROC architecture should support advanced hardware acceleration functions for data encryption, DIF (Data Integrity Field) as specified by T10 SBC standards, compression, hashing etc at very high data throughput.
3. To provide efficient & uniform DMA structure that presents a consistent representation of DMA operation and scatter/gather buffer definition that is capable of handling a large number of concurrent buffers and fast context switching among partial buffer transfers. Ideally the DMA operation should be consistent no matter which type of hardware and software acceleration/operation functions are applied to the data. The DMA also should be able to be non-blocking in order to fully utilize the bus bandwidth on system buses, storage interfaces, and on-chip and off-chip memory interfaces.
4. The ROC architecture should minimize the data traffic moving across external interfaces, specifically the off-chip memory interfaces, to resolve system performance bottleneck on memory bandwidth.
5. To have efficient on-chip system interconnect for the CPUs and the hardware engines, the DMAs should allow all processing engines and firmware programs to execute concurrently without blocking each other while competing for access for the system interconnect.
6. To support RAID processing without using an external DRAM interface. Ideally this can be done without significantly changing the software architecture of the RAID stack. This should be achieved either with on-chip memory resources or host memory resource in the PCI space.
7. To support multi-pass processing, wherein the data is processed by multiple hardware and/or firmware entities with flexibility on the type of operations applied without transferring the intermediate results off-chip and on-chip again between the processing steps, the sequence of the operation, and the capability of context switch amongst partial data processing that are associated with independent processing threads or I/O processes.
It is, therefore, desirable to provide an improved system on a chip architecture.