1. Field of the Invention
The present invention relates to a memory system, or more particularly to a conflict-free memory system, which can reduce access time to the memory system by supporting simultaneous access to pq units of various data elements of types of four directional blocks (pq) and eight directional lines of a constant interval at a location of data within M×N arrays in a Single-Instruction Multiple-Data (SIMD) stream processor having pq units of PE's (Processing Elements).
2. Description of the Related Art
There have been certain application areas for a Single-Instruction Multiple-Data (SIMD) stream processor which consists of a common control unit, a shared memory module which stores an SIMD command program, and a large number of processing elements (PE's) with a conflict-free (CF) memory system, which can be attached to a host computer with a main memory as shown in FIG. 1.
Some of those application areas are image processing operations, two-way merge sort, successive-doubling Fast Fourier Transform, recursive doubling, and basic matrix operations, M. J. B. Duff, “Computing Structures for Image Processing,” Academic Press, 1983; J. L. Potter, IEEE Computer, vol. 16, No. 1, pp. 62-67, January 1983; K. Preston, Jr., IEEE Computer, vol. 16, No. 1, pp. 36-47, January 1983; T. J. Fountain, K. N. Matthews, and M. J. B. Duff, IEEE Trans. PAMI, vol. 10, No. 3, pp. 310-319, May 1988; H. S. Wallace and M. D. Howard, IEEE Trans. PAMI, vol. 11, No. 3, pp. 227-232, March 1989; L. A. Schmitt and S. S. Wilson, IEEE Trans. PAMI, vol. 10, No. 3, pp. 320-330, May 1988; V. Dixit and D. I. Moldovan, IEEE Trans. PAMI, vol. 9, No. 1, pp. 153-160, January 1987; L. Uhr, “Parallel Computer Vision,” Academic press, 1987;A. Rosenfeld, “Multiresolution Image Processing and Analysis,” Springer-Verlag, 1984; G. Y. Kim, “Parallel Memory Architectures for Image Processing and Wavelet-based Video Coding,” Ph.D. Thesis, Korea Advanced Institute of Science and Technology (1999); G. A. Baxes, “Digital Image Processing,” Prentice-Hall (1984); H. E. Burdick, “Digital Imaging,” McGraw-Hill (1997); E. Horowitz and S. Sahni, “Data Structures in Pascal,” Computer Science Press (1984); J. W. Cooley, P. A. W. Lewis, and P. D. Welch, IEEE Trans. Educ., vol. E-12, No.1, pp. 27-34, 1969; D. T. Harper III and D. A. Linebarger, “Storage Schemes for Efficient Computation of a Radix 2 FFT in a Machine with Parallel Memories,” in Proc. 1988 Int. Conf. Parallel Processing, 1988; H. S. Stone, “High-performance Computer Architecture,” Addison Wesley (1993); D. J. Kuck and R. A. Strokes, IEEE Trans. Comput., vol. C-31, pp.362-376, May 1982, etc. For those applications, after the common control unit of the SIMD computer issues a command to retrieve a set of data elements with a subarray type, base coordinates, and a constant interval between the data elements from the conflict-free memory system and to assign them to the PE's, the control unit issues a command to the PE's to perform the same operation on the different data elements. Therefore, for the efficient utilization of the PE's of the SIMD processor, the important goals of the memory system are as follows:                (1) Various subarray types and constant intervals: The memory system should support simultaneous access to various types of data elements that are related by a constant interval, which is a positive integer except zero.        (2) Simultaneous access with no restriction on the location: The position of the data elements to be accessed simultaneously can be anywhere within a given data array.        (3) Simple and fast address calculation and routing circuitry: The address calculation and routing should be simple and fast.        (4) Simple and fast data routing circuitry: The data routing should be simple and fast.        (5) No burden to the PE's: The address calculation and routing, and data routing should be performed in the memory system without burdening the PE's so that only the interface between the memory system and the PE's is a data register.        (6) Small number of memory modules: The number of memory modules of the memory system should be as small as possible, which is greater than or equal to the number of PE's.        
For some time, there has been much research on the storage schemes to increase the utilization of the memory system with multiple memory modules. One of them is to overlap the memory access time of the modules by issuing memory requests sequentially (D. T. Harper III, IEEE Trans. Parallel Distrib. Syst., vol. 2, pp.43-51, January 1991; D. T. Harper III, IEEE Trans. Comput., vol. C-41, pp.227-230, February 1992), which is not adequate to the SIMD processor mentioned above.
A simple storage scheme of the memory system for the SIMD processor is a memory interleaving, which maps an address a to a memory module (a mod m), where m is the number of memory modules in the memory system. Because the number of memory modules of the interleaved storage scheme is the same as that of the PE's of the SIMD processor, which results in a simple implementation of address calculation and routing of the data elements, this scheme has been incorporated in many of the SIMD processors. Unfortunately, the performance of the interleaved storage scheme is relatively low due to conflicts at the memory modules because the interleaved scheme does not support a simultaneous access to various subarray types of data elements that are related by a constant interval, W. Oed and O. Lange, IEEE Trans. Comput., vol. C-34, pp. 949-957, October 1985; D. Baily, IEEE Trans. Comput., vol. C-36, pp. 293-298, March 1987.
Meanwhile, there was a proposal to improve the average performance over the interleaved memory system, but this attempt is not very useful for the SIMD processor because any conflict of the memory requests on the same module delays all of the operations of the PE's of the SIMD processor.
Another class of storage schemes is that of nonlinear schemes investigated by several researchers. Most nonlinear skewing schemes are based on bitwise XOR operations, which were first considered by Batcher, K. Batcher, IEEE Trans. Comput., vol. 26, no. 1, pp. 174-177, 1977. The XOR scheme, which was generalized by Frailong et al., computes the storage location as a dot product of the address of data element and a transformation matrix. In the XOR scheme, it is simple to calculate addresses of data elements and to route them to memory modules, where the number of memory modules is a power of two, but it restricts the subarray types of data elements, constant intervals between the data elements, or the location of the data elements.
The other class of storage schemes is a linear skewing scheme which maps the data element located at (i, j) of a matrix to the memory module (ai+bj) mod m, where a and b are constants and m is the number of memory modules. The linear skewing scheme was first considered by Budnik and Kuck (P. Budnik and D. J. Kuck, IEEE Trans. Comput., vol. C-20, pp.1566-1569, December 1971), and the properties of the schemes were investigated by Shapiro, and Wijshoff and Van Leeuwen, H. Shapiro, IEEE Trans. Comput., vol. C-27, no. 5, pp. 421-428, May 1978; H. Wijshoff and J. Van Leeuwen, IEEE Trans. Comput., vol. C-34, no. 6, pp. 501-505, June 1985; H. Wijshoff and J. Van Leeuwen, IEEE Trans. Comput., vol. C-36, no. 2, pp. 233-239, February 1987. Budnik and Kuck, Shapiro, Wijshoff and Van Leeuwen, and Lawrie proved that a memory system can access data elements simultaneously without conflicts within a block, a row, a column, a forward-diagonal, or a backward-diagonal subarray if the number of memory modules is a prime number greater than the number of data elements, D. H. Lawrie, IEEE Trans. Comput., vol. C-24, no. 12, pp. 1145-1155, December 1975. The drawback of the linear skewing schemes is either the address calculation, address routing, and data routing are complex and slow when the address calculation is incorporated with the m memory modules, or the number of the memory modules is twice the number of the data elements accessed simultaneously in order to avoid the slow modulo (m) calculations for the case that the number of data elements is a power of two.
Tables 1(a) and 1(b) show the class of storage schemes, linearity, routing method, subarray types, constant intervals, the hardware implementation, simultaneity, location of access, the burden to the PE's, and the number of memory modules of the previous memory systems and the memory system proposed in the present invention for the SIMD processor to fulfill the goals (1)˜(6) by using a linear skewing scheme, (1) D. T. Harper III, “Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems,” IEEE Trans. Parallel Distrib. Sys., vol. 2, pp. 43-51, January 1991; (2) D. T. Harper III, “Increased Memory Performance during Vector Accesses through the Use of Linear Address Transformations,” IEEE Trans. Comput., vol. C-41, pp. 227-230, February 1992; (3) W. Oed and O. Lange, “On the Effective Bandwidth of Interleaved Memories in Vector Processing Systems,” IEEE Trans. Comput., vol. C-34, pp. 949-957, October 1985; (4) D. T. Harper III and J. R. Jump, “Vector Access Performance in Parallel Memories Using a Skewed Storage Scheme,” IEEE Trans. Comput., vol. C-36, pp. 1440-1449, December 1987; (5) R. Raghavan and J. P. Hayes, “On Randomly Interleaved Memories,” Supercomputing '90, pp. 49-58, 1990; (6) K. Batcher, “The Multidimensional Access Memory in STARAN,” IEEE Trans. Comput., vol 26, no. 1, pp. 174-177, 1977; (7) J. Frailong, W. Jalby, and J. Lenfant, “XOR-schemes: A Flexible Data Organization in Parallel Memories,” in Proc. Int. Conf. Parallel Processing, pp. 276-283, 1985; (8) A. Norton and E. Melton, “A Class of Boolean Linear Transformations for Conflict-free Power-of-two Stride Access,” in Proc. Int. Conf. Parallel Processing, pp. 247-254, 1987; (9) D. Lee, “Scrambled Storage for Parallel Memory Systems,” in Proc. Int. Symp. on Comp. Architecture, pp. 232-239, 1988; (10) K. Kim and V. K. P. Kumar, “Perfect Latin Squares and Parallel Array Access,” in Proc. Int. Symp. on Comp. Architecture, pp. 372-379, 1989; (11) C. S. Raghavendra and R. Boppana, “On Methods for Fast and Efficient Parallel Memory Access,” in Proc. Int. Conf. Parallel Processing, pp. 76-83, 1990; (12) 32. D. T. Harper III, “A Multiaccess Frame Buffer Architecture,” IEEE Trans. Comput., Vol. C-43, pp. 618-622, May 1994; (13) P. Budnik and D. J. Kuck, “The Organization and Use of Parallel Memories,” IEEE Trans. Comput., vol. C-20, pp. 1566-1569, December 1971; (14) H. Wijshoff and J. Van Leeuwen, “On Linear Skewing Schemes and d-ordered Vectors,” IEEE Trans. Comput., vol. C-36, no. 2, pp. 233-239, February 1987 (15) D. H. Lawrie, “Access and Alignment of Data in an Array Processor.” IEEE Trans. Comput., vol. C-24, no. 12, pp. 1145-1155, December 1975; (16) D. C. Van Voorhis and T. H. Morrin, “Memory Systems for Image Processing,” IEEE Trans. Comput., vol. C-27, pp. 113-125, February 1978; (17) J. W. Park, “An Efficient Memory System for Image Processing,” IEEE Trans. Comput., vol. C-35, pp. 669-674, July 1986; “An Efficient Memory System for Image Processing,” Korean Patent No. 32719 (1990); “Memory System for Image Processing Having Address Calculating Circuitry Permitting Simultaneous Access to Block, Horizontal Sequence and Vertical Sequence Subarrays of an Array Data,” U.S. Pat. No. 4,926,386 (1990); (18) D. H. Lawrie and C. R. Vora, “The Prime Memory System for Array Access,” IEEE Trans. Comput., vol. C-31, pp. 435-442, May 1982; (19) D. T. Harper III and D. A. Linebarger, “Conflict-free Vector Access Using a Dynamic Storage Scheme,” IEEE Trans. Comput., vol. C-40, no. 3, pp. 276-283, March 1991; (20) A Deb, “Multiskewing—A Novel Technique for Optimal Parallel Memory Access,” IEEE Trans. Parallel Distrib. Syst., Vol. 7, No. 6, pp. 595-604, June 1996; (21) J. W. Park and D. T. Harper III, “Memory Architecture Support for the SIMD Construction of a Gaussian Pyramid,” IEEE Symp. Parallel and Distributed Processing, pp. 444-451, December 1992; (22) J. W. Park and D. T. Harper III, “An Efficient Memory System for the Construction of a Gaussian pyramid,” IEEE Trans. Parallel Distrib. Syst., vol. 7, No. 8, pp. 855-860, August 1996; (23) J. W. Park, “Efficient Image Analysis and Processing Memory System,” Korean Patent No. 58542 (1993); “Efficient Image Analysis and Processing Memory System,” Japanese Patent No. 2884815 (2000); (24) J. W. Park, “Multi-access Memory System with the Constant Interval for Image Processing,” Korean Patent No. 121295 (1997).
TABLE 1(a)StorageRoutingSubarrayConstantSimul-LocationBurden#SchemeLinearityMethodType2IntervalHW1taneityof Accessto PE'sMemoryOverlappingNonlinearNotB, R, CArbitraryYesNoArbitraryAddresspqScheme by HarperMentionedpositiveCalcula-(1)integer*tionXOR Scheme byNonlinearNotRArbitraryYesNoArbitraryNopqHarper (2)Mentionedpositiveinteger*InterleavedLinearShiftingRArbitraryYesYesArbitraryNopqScheme by Oedpositiveand Lange (3)integer*Skewed StorageLinearNotRArbitraryYesNoArbitraryNopqScheme byMentionedpositiveHarper, et al. (4)integer*Scheme byLinearNotRArbitraryNoNoArbitraryNopqRaghavan, et al.Mentionedpositive(5)integer*XOR Scheme byNonlinearPerfectR, C1YesYesArbitraryNopqBatcher (6)ShuffleXOR Scheme byNonlinearOmegaR, C, BArbitraryNoYesRestrictedNopqFrailong, et al. (7)positiveinteger*(row,column),1(block)XOR Scheme byNonlinearInvertedRPower of twoYesYesRestrictedAddresspqNorton, et al. (8)BaselineCalcula-tionLee (9)NonlinearThetaR, C, SB,1NoYesArbitraryNopqDBKim and KumarNonlinearNotR, C, FD1YesYesRestrictedNo(10)MentionedRaghavendraNonlinearOmegaR, C,1NoYesRestrictedAddresspqBoppana (11)FD, SBCalcula-tionXOR Scheme byNonlinearCrossbarR, C,1YesYesRestrictedNopqHarper (12)Varioustypes ofblocks1HW denotes hardware implementation; *Arbitrary positive integers are used except for the integral multiples of the number of memory modules. 
TABLE 1(b)StorageRoutingSubarrayConstantSimul-LocationBurden#SchemeLinearityMethodTypeIntervalHW1taneityof Accessto PE'sMemoryBudnik and KuckLinearNotR, C, B,1NoYesArbitraryNoPrime(13)MentionedFD, BD(>pq)Wijshoff (14)LinearNotR, C, FD,NotNoYesArbitraryNotPrimeMentionedBDmentionedmention(>pq)edLawrie (15)LinearOmegaR, C, FD,1NoYesArbitraryNo2N (forBDNxMmatrix)Van Voorhis andLinearMultiplexingR, C, B1YesYesArbitraryNopq+1,Morrin (16)and Rotationpq2 ,2pqPark (17)LinearMultiplexingR, C, B1YesYesArbitraryNopq+1and RotationLawrie and VoraLinearCrossbarR, C, FD,ArbitraryYesYesArbitraryNoPrime(18)BDpositive(>pq)integer*Harper andLinearNo needRArbitraryYesNoArbitraryNopqLinebarger (19)positiveinteger*Skewed StorageLinearShiftingR, C, FD1NoYesArbitraryNopqScheme by Deb(20)Park and HarperLinearMultiplexingR, C, B2(row),YesYesArbitraryNopq+1(21)and Rotation1(column),1(block)Park (22)LinearMultiplexingR, C, BPower ofYesYesArbitraryNopq+1and Rotationtwo(row),1(column),1(block)Park (23)LinearMultiplexingB, 8-DL,1YesYesArbitraryNopq+1and Rotationline ofintervalmultiple of5°Park (24)LinearMultiplexingB, R, C,Arbitrary*YesYesArbitraryNopq+1and RotationFD, BDPresent InventionLinearMultiplexing4-DB, 8-DLArbitraryYesYesArbitraryNoPrimeand Rotationpositive(>pq)integer*1HW denotes hardware implementation; *Arbitrary positive integers are used except for the integral multiples of the number of memory modules. 
For the following image processing operations, a simultaneous access to image points within the block, row, column, forward-diagonal, or backward-diagonal subarray is required for the SIMD processor in order to reduce the overall memory access time, G. Y. Kim, “Parallel Memory Architectures for Image Processing and Wavelet-based Video Coding,” Ph.D. Thesis, Korea Advanced Institute of Science and Technology (1999). For the point operations such as arithmetic and logic operations, it is required that PE's are assigned to the corresponding image points within a block, row, or column subarray in order to perform the arithmetic and logic operations in parallel, G. A. Baxes, “Digital Image Processing,” Prentice-Hall, 1984. For the wavelet transform, a memory system to access image points within a row and a column subarray is required, K. R. Castleman, “Digital Image Processing,” Prentice-Hall, 1996; S. Mallat, IEEE Trans. PAMI, vol. 11, No. 7, pp. 674-693, July 1989. For the neighborhood operations such as edge detection, convolution, or low (high)-pass filters on an image of 2n1×2n1 by 2n2×2n2 PE's, where n1>n2, by using 3×3 or 4×4 edge masks, it is required that each PE is assigned to one 3×3 or 4×4 block and a memory system accesses 2n2×2n2 block with a constant interval of 3 or 4, G. A. Baxes, “Digital Image Processing,” Prentice-Hall, 1984; H. E. Burdick, “Digital Imaging,” McGraw-Hill, 1997; J. R. Parker, “Algorithms for Image Processing and Computer Vision,” John Wiley & Sons, 1997; D. H. Ballard and C. M. Brown, “Computer Vision,” Prentice-Hall, 1982; J. S. Lim, “Two-dimensional Signal and Image Processing,” Prentice-Hall, 1990. For the 8×8 Discrete Cosine Transform of an image of 2n1×2n1 that are performed by 2n2×2n2 PE's, where n1>n2, it is required that each PE is assigned to one 8×8 block and a memory system accesses 2n2×2n2 block with a constant interval of 8, Discrete Cosine Transform, K. R. Rao and P. Yip, “Discrete Cosine Transform,” Academic Press, 1990. For the fast comparison of one 16×16 block within the previous image and the other 16×16 block within the reference one for the motion estimation, it is required that a memory system access block subarrays whose constant intervals are 1, 2 and 4, J. S. Lim, “Two-dimensional Signal and Image Processing,” Prentice-Hall, 1990. For the progressive transmission using 2×2 or 3×3 subsampling method, a simultaneous access to data elements within a block subarray whose constant interval is 2l or 3 2l is required in order to reduce the memory access time, where l is a positive integer, W. Y. Kim, P. T. Balsara, D. T. Harper, and J. W. Park, IEEE Trans. Circuits and Systems for Video Technology, vol. 5, No.1, pp.1-13, February 1995.
Another example is a fast construction of a Gaussian pyramid or the Hierarchical Discrete Correlation window function, which is useful for compression, texture analysis, or motion analysis, A. Rosenfeld, “Multiresolution Image Processing and Analysis,” Springer-Verlag, 1984; P. J. Burt, Comput. Vision, Graphics, Image processing 16, pp.20-51, 1981; P. J. Burt, Comput. Vision, Graphics, Image processing 21, pp.368-382, 1983. The value of every other node of the previous level or every 2lth node of the level 0 should be assigned to each PE in the SIMD processor in order to compute the level k recursively or directly, respectively. Therefore, a conflict-free memory system that support simultaneous access to the image points within a block or a row with constant interval 2l, l≧0 at an arbitrary position is required for the reduction of the overall memory access time for the construction of a Gaussian pyramid, J. W. Park and D. T. Harper III, IEEE Symp. Parallel and Distributed Processing, pp.444-451, December 1992; J. W. Park and D. T. Harper III, IEEE Trans. Parallel Distrib. Syst., vol. 7, No. 8, pp.855-860, August 1996. For the case that the number of nodes of a row of the target level image is less than that of PE's, a simultaneous access to image points within the block subarray is more useful than the row subarray. Also, for the fast rotation of a subimage by a multiple of 90° or for a mirror image, it is required that a memory system accesses image points within four directional blocks, which is further explained below. For the fast rotation of a subimage by a multiple of 45° or by a multiple of 5°, it is required that a memory system accesses image points within an eight-directional line, J. W. Park, “Efficient Image Analysis and Processing Memory System,” Korean Patent No. 58542 (1993); “Efficient Image Analysis and Processing Memory System,” Japanese Patent No. 2884815 (2000); J. W. Park, S. R. Maeng, and J. W. Cho, Int. J. of High Speed Computing, vol. 2, No. 4, pp.375-385, December 1990.
For the following two-way merge sort algorithm, successive-doubling FFT algorithm, recursive doubling algorithm, and matrix and signal processing operations, a memory system which supports simultaneous access to data elements within the block, row, column, forward-diagonal, and backward-diagonal subarray is required for the SIMD processor in order to reduce the overall memory access time.
The two-way merge sort begins with the input of n sorted date files, each of length 1. These n data files are merged to obtain n/2 files of size 2. These n/2 data files are then merged and so on until only one data file is left. Therefore, a memory system which accesses to data elements within a row of constant interval 2l, l>0, is useful for the fast two-way merge sort by the SIMD processor.
For the successive-doubling FFT algorithm by the SIMD processor, a memory system that can access to data elements within a row of constant interval 2l, l>0, is useful in order to reduce the memory access time, J. W. Cooley, P. A. W. Lewis, and P. D. Welch, IEEE Trans. Educ., vol. E-12, No.1, pp. 27-34, 1969; D. T. Harper III and D. A. Linebarger, in Proc. 1988 Int. Conf Parallel Processing, 1988.
For the recursive doubling algorithm to perform addition, multiplication, maximum, minimum, AND, OR, and XOR operations, a memory system that can access to data elements within a row of constant interval 2l, l>0, is useful in order to reduce the memory access time, H. S. Stone, “High-performance Computer Architecture,” Addison Wesley, 1993. The matrix operations of addition, multiplication, and determinant benefit from a simultaneous access to data elements within a block, row, column, forward-diagonal, and backward-diagonal subarrays. Also, for the speedup of various matrix operations for signal processing by the SIMD processor, a conflict-free memory system which supports a simultaneous access to data elements within a block, row, column, forward-diagonal, or backward-diagonal subarray is required in order to reduce the overall memory access time, J. S. Lim, “Two-dimensional Signal and Image Processing,” Prentice-Hall, 1990; D. H. Johnson and D. E. Dudgeon, “Array Signal Processing,” Prentice-Hall, 1993.
An SIMD processor with pq PE's and a memory system with m memory modules, pq requests are presented at the same time, each to a different memory module. After a memory access time, the pq requests are completed and all memory modules are freed to operate on subsequent requests, where the requests should not place restriction on the locations of data elements within a data array. In order to reduce the memory access time of the memory system and in order to speed up the processing time of the image processing operations, two-way merge sort algorithm, successive-doubling FFT algorithm, recursive doubling algorithm, and many matrix and signal processing operations consequently, the memory system should support simultaneous access to pq data elements within a block, row, column, forward-diagonal, or backward-diagonal subarray with a constant interval r in a data array I(*,*). If the constant interval can be a positive or a negative integer, it is convenient to consider the following 12 subarray types (four directional blocks: South-East Block (SEB), South-West Block (SWB), North-West Block (NWB), North-East Block (NEB); and eight directional lines: East Line (EL), South-East Line (SEL), South Line (SL), South-West Line (SWL), West Line (WL), North-West Line (NWL), North Line (NL), North-East Line (NEL)) with base coordinates (i, j) and a positive constant interval r:SEB(i,j,r)={I(i+ar,j+br)|0≦a<p,0≦b<q}, 0≦i≦M−rp, 0≦j≦N−rq  (1)SWB(i,j,r)={I(i+ar,j−br)|0≦a<p,0≦b<q}, 0≦i≦M−rp,rq≦j≦N  (2)NWB(i,j,r)={I(i−ar,j−br)|0≦a<p,0≦b<q},rp≦i≦M,rq≦j≦N  (3) NEB(i,j,r)={I(i−ar,j+br)|0≦a<p,0≦b<q},rp≦i≦M,0≦j≦N−rq  (4)EL(i,j,r)={I(i,j+ar)|0≦a<pq},0≦i≦M, 0≦j≦N−rpq  (5)SEL(i,j,r)={I(i+ar,j+ar)|0≦a<pq}0≦i≦M−rpq,0≦j≦N−rpq  (6)SL(i,j,r)={I(i+ar,j)|0≦a<pq},0≦i≦M−rpq,0≦j≦N  (7)SWL(i,j,r)={I(i+ar,j−ar)|0≦a<pq},0≦i≦M−rpq,rpq≦j≦N  (8)WL(i,j,r)={I(i,j−ar)|0≦a<pq},0≦i≦M,rpq≦j≦N  (9)NWL(i,j,r)={I(i−ar,j−ar)|0≦a<pq},rpq≦i≦M,rpq≦j≦N  (10)NL(i,j,r)={I(i−ar,j)|0≦a<pq},rpq≦i≦M,0≦j≦N  (11)NEL(i,j,r)={I(i−ar,j+ar)|0≦a<pq},rpq≦i≦M,0≦j≦N−rpq,  (12)where the constant interval r is a positive integer.
The 12 subarray types (1)-(12) with a constant interval r and the base coordinates (i,j) are represented in FIGS. 2(a) and 2(b). FIG. 2(a) shows four direction blocks (SEB, SWB, NWB, NEB), and FIG. 2(b) shows eight direction lines (EL, SEL, SL, SWL, WL, NWL, NL, NEL).
FIG. 3 shows a block diagram of a general design of the conflict-free memory system with an SIMD processor, where the interface is a data register. When a subarray is stored, the components of the memory system perform the following operations sequentially under the control of the control circuitry according to the request of the SIMD processor:    (1) t, i, j and r registers are set by the SIMD processor to indicate the subarray type, base coordinates, and constant interval of the required subarray (1)-(12), and the subarray itself is placed in the data register in a modified row major order, where k=a·q+b,0≦k<pq. For example, the four data elements of SEB(14,15,2) in a modified row major order are I(14,15),I(14,17),I(16,15),I(16,17); the four data elements of SWB(14,15,3) in a modified row major order are I(14,15),I(14,12),I(17,15),I(17,12); the four data elements of NWL(14,15,3) in a modified row major order are I(11,12),I(8,9),I(5,6),I(2,3).    (2) The address calculation and routing circuitry computes the m addresses of the data elements within the subarray, and routes them to the m memory modules;    (3) The memory module selection circuitry enables the pq memory modules to be accessed;    (4) The data routing circuitry routes the subarray in the data register to the m memory modules; and    (5) A WRITE signal causes pq data elements within the subarray to be stored in the pq enabled memory modules, where the operations (2), (3) and (4) are performed in parallel.
Similarly, when a subarray is to be retrieved from the memory system, the components of the memory system perform the following operations sequentially:    (1) t, r, i and j registers are set;    (2) The address calculation and routing circuitry computes the m addresses of the data elements within the subarray, and routes them to the m memory modules;    (3) The memory module selection circuitry enables the pq memory modules to be accessed;    (4) A READ signal causes the pq data elements within the subarray to be retrieved from the pq enabled memory modules; and    (5) The data routing circuitry routes the data elements from the m memory modules to the data register, and arranges the data elements within one of the subarrays (1)-(12) in the modified row major order, where the operations (2) and (3) are performed in parallel.
In order to distribute the data elements of the M×N array I(*,*) among the m memory modules, a memory module assignment function must place in distinct memory modules the data elements that are to be accessed simultaneously. Also, an address assignment function must allocate different addresses to data elements assigned to the same memory module.