1. Field
Example embodiments of the following description relate to a processor, and more particularly, to a processor for processing stream data at a high speed.
2. Description of the Related Art
Batch-based processing is mainly used by processors to process a large amount of data, for example, a SRP. In batch-based processing, a same amount of input data and/or output data required for an operation, at a predetermined amount is collected in an L1 memory, and the collected data is processed.
First, a same amount of input data received from an external source, at a designated amount, may be collected in an input data buffer of the L1 memory, and output data may be collected in an output data buffer of the L1 memory, while performing an operation on the collected input data. Subsequently, the collected output data may be transmitted to the external source. The above operations may be performed simultaneously or sequentially. The above-described batch-based processing inevitably requires a high-cost L1 memory with a large input and/or output (I/O) bandwidth and a large storage capacity.
Hereinafter, conventional batch-based processing will be further described with reference to FIG. 1.
FIG. 1 illustrates a diagram of a structure of a conventional processor.
Referring to FIG. 1, the conventional processor may include a memory 110, and a functional unit 120. The functional unit 120 may perform an operation, and the memory 110 may store I/O data of the operation.
To achieve high performance in the conventional processor, a high-speed operation of the functional unit 120 may be required, and a high-speed memory 110 (for example, an L1 memory) may also be required. The functional unit 120 may directly access the memory 110 to store the I/O data in the memory 110. The memory 110 may include, for example, a cache memory or a scratch pad memory (SPM).
When a processor is used to process a large amount of data, for example, for the purpose of multimedia or scientific computation, the functional unit 120 needs to assimilate a required operation amount, and simultaneously needs to sufficiently provide a data bandwidth required by the memory 110.
In the conventional processor, each of an input buffer 111 and an output buffer 112 may use double buffering with buffers A and B to simultaneously perform an operation of the functional unit 120, an input data loading of an external data producer 101, and an output data fetching of an external data consumer 102.
The memory 110 of the conventional processor needs to simultaneously satisfy the following I/O bandwidth requirements:    1. Write input buffer write    2. Read input buffer    3. Randomly access to L1 memory    4. Write output buffer    5. Read output buffer
As the conventional processor requires processing of a large amount of data with a higher performance, the I/O bandwidth requirements may be increased. However, since fully using a same capacity of a multi-port, wide-I/O memory at a considerable capacity causes high costs for an H/W area and a design burden, there is a need to sacrifice either performance or cost. In this case, the memory may enable a high-speed operation with a high I/O bandwidth to provide batch-based processing.
Additionally, when the functional unit 120 is operated using an H/W pipeline process or S/W pipeline process to consecutively process a serial of operations, when a maximum throughput is reached, it is efficient in performance to process a large amount of data (for example, a size of a batch) at a time, if possible. When data is processed several times, bubbles may occur in a pipeline, thereby reducing efficiency.
Since there is a need to use a large-capacity memory to set a large size of a batch to increase the efficiency, costs for an H/W area may be increased in proportion to the capacity of the memory.
Accordingly, there is a desire for a stream I/O interface architecture that may more efficiently process a large amount of data by overcoming a limitation of the conventional L1 memory-based batch-based processing.