Accelerating Numerical Linear Algebra Kernels On A Scalable Run Time Reconfigurable Platform

Accelerating Numerical Linear Algebra Kernels on a Scalable Run Time Recongurable Platform
Prasenjit Biswas, Pramod P Udupa, Rajdeep Mondal, Keshavan Varadarajan, Mythri Alle and S.K. Nandy
CADL, IISc, Bangalore {prasenjit, pramod, rajdeep, keshavan, mythri, nandy}@cadl.iisc.ernet.in
Ranjani Narayan
Morphing Machines, Bangalore, India ranjani.narayan@morphingmachines.com
AbstractNumerical Linear Algebra (NLA) kernels are at the heart of all computational problems. These kernels require hardware acceleration for increased throughput. NLA Solvers for dense and sparse matrices differ in the way the matrices are stored and operated upon although they exhibit similar computational properties. While ASIC solutions for NLA Solvers can deliver high performance, they are not scalable, and hence are not commercially viable. In this paper, we show how NLA kernels can be accelerated on REDEFINE, a scalable runtime recongurable hardware platform. Compared to a software implementation, Direct Solver (Modied Faddeevs algorithm) on REDEFINE shows a 29 improvement on an average and Iterative Solver (Conjugate Gradient algorithm) shows a 15-20% improvement. We further show that solution on REDEFINE is scalable over larger problem sizes without any notable degradation in performance.
Mode 1 Boundary Processor Xin
Mode 2 Mode 2 c21 c11 a21 a11 c22 c12 a22 a12
d21 d11 a21 b11
d22 d12 a22 b12
Out1=P/Xin Out1=Xin/P Mode 1 P=Xin Out1 P=P
Internal Processor Xin Out1=C C P Out1 Out2 Out2= P+C*Xin Out2= Xin+C*P Out1=C Mode 1: For Triangualarisation Mode 2: For Nullification e12 e11 e22 e21
Fig. 2. Operations of Boundary Processor and Internal Processor in a 2 2 systolic array
I. I NTRODUCTION The major categories of Numerical Linear Algebra (NLA) solvers, namely, Direct and Iterative Solvers nd use in several applications. Direct solvers are predominantly required in the domains of DSP, estimation algorithms like Kalman lter etc., where operations need to be performed on dense matrices which are either small or medium sized. Faddeevs Algorithm (FA) [1] is used for solving dense linear system of equations. This algorithm is used to compute the Schur complement of a compound matrix M composed of four matrices A, B , C , D of sizes (n n), (n l), (m n), (m l) respectively, provided A is non-singular [2]. By appropriate choice of the matrices A, B , C and D different possible results can be obtained (gure 1). A variant of this algorithm that is amenable for realization in hardware was proposed by Nash et al. [3]. This is referred to as the Modied Faddeevs algorithm (MFA). MFA involves a two step process i.e. triangularization of matrix A and nullication of the elements of matrix C [4]. The triangularization and nullication can be performed using LU decomposition or QR factorization. MFA is typically realized in hardware as
A C B D
1
A I AB
B 0
A I A B
1
B 0
A I A
I 0
1
A I
1
B D
D CA B
A B+D
Fig. 1.
Different possible Matrix-Solutions using MFA
a systolic array [5][9]. The systolic structure for computing Schur complement of matrix size 2 2 is shown in gure 2. Also shown are the different modes of operation of the Processing Elements (PEs). The core computations performed
are oating point division and multiply-accumulate (MAC). These PEs are connected in a xed1 mesh topology. Systolic arrays realize the functionality in the most optimal manner, however they are not scalable due to their rigid structure. Large unstructured sparse matrices occur in several scientic computational applications. These are generally solved using Iterative solvers. Sparse Matrix Vector Multiplication (SMVM) i.e. A x, is the core computation performed in iterative solver (where A is the sparse matrix and x is the vector with which it is multiplied). It is worthwhile improving the performance of this kernel, since 90% of the time is spent in SMVM. The SMVM kernel has a large number of memory references. A General-Purpose Processor (GPP) used for SMVM computation attains only 10% [10] of its peak oating point performance. According to one study [11], the performance of SMVM degrades mainly due to 1) Indirect memory references 2) Memory dominated operations. For certain matrices, performance is also degraded due to short row lengths and irregular accesses to vector x. In the literature, a number of hardware based solutions [12], [13] and software based solutions [14], [15] are reported. The software based [14], [15] solutions try to optimize sparse matrix computations by reordering data to reduce memory bandwidth (blocking methods), modify algorithms to reuse the data and perform compiler optimizations. But these optimizations depend on the structure of the sparse matix. Hence, hardware based methods are more suitable if memory access latency is handled correctly, making it independent of the structure of the sparse matrix. In the hardware based implementations on FPGAs, size of memory available on the chip restricts the problem size that can be solved. One key observation is that elements of the sparse matrix A are utilized only once (as in the case of
1 Not
programmable.
the Conjugate Gradient (CG) algorithm). A mechanism that could stream in elements of the matrix directly to compute units will eliminate the need to store the matrix. The hardware implementation of SMVM kernel uses a MAC. as the main processing element along with a local memory to store the vector x. A hand-crafted special hardware support, one for Direct Solvers and another for Iterative Solvers, is an expensive proposition and is not scalable. A hardware solution which could be recongured at runtime for either solvers would be an ideal choice. As can be observed from MFA and SMVM, the core computation is a MAC. However, these two kernels have very different datapaths and storage requirements. REDEFINE as proposed in [16], [17] is an architecture framework that can be customized for an application domain to meet the desired performance. In this paper we specically design and implement domain specialization of REDEFINE for NLA kernels. We introduce a custom function unit (CFU) for performing MAC. Additional storage for intermediate results is introduced in each compute element (CE) in the form of scratch-pad memory (SPM). These enhancements are described in detail in the subsequent sections. The rest of the paper is organized as follows. Section II provides brief introduction to REDEFINE architecture followed by the NLA specic enhancements on it. Details of realization of MFA on REDEFINE appear in section III. Section IV covers the implementation details of Iterative Solvers. We summarize the contributions of this paper in section V. II. NLA S PECIFIC S PECIALIZATIONS ON REDEFINE REDEFINE is an execution engine in which multiple tiles are connected through a toroidal honeycomb packet-switched network [17]. Each tile comprises a Compute Element (CE) and a Router. Each CE has a general purpose ALU along with the necessary logic to execute operations according to static dataow paradigm [18]. The global memory is connected to the periphery of the fabric through special routers called access routers. Communication between global memory and CEs is facilitated by the Load-Store Units (LSUs). All load/store requests from the CEs to the LSU go over the Network-onChip (NoC). An application written in a high level language C is transformed into coarse grain operations called HyperOps [19] by RETARGET2 , the compiler for REDEFINE. In addition, the compiler partitions each HyperOp into several pHyperOps and each pHyperOp is assigned to a CE. The compiler generated Compute Metadata species computation to be performed by a CE, while Transport Metadata species the communication requirements of a CE. In order to tailor REDEFINE for a specic application domain, compiler directives may be used to force partitioning and assignment of HyperOps. Further, domain specic Custom Function Units (CFUs), which are micro architectural hardware assists may be handcrafted to work in tandem with the ALU [17]. In the following sub-sections we elaborate NLA-specic enhancements to REDEFINE in order to meet expected performance goals in a scenario where inputs are streamed, to:
2 RETARGET uses the LLVM [20] front end and generates HyperOps containing basic operations dened by the virtual ISA
Reduce delays due to accesses to global memory Address rate-mismatch between producer and consumer CEs Improve performance Reduction of global memory access delays: Each load/store request incurs a long round trip delay time, based on the placement of the CE making the request. Further these latencies are non-deterministic in nature due to the use of NoC. When streaming inputs are needed, if a separate request is to be made for every data element then memory access latencies determine the performance of the kernel. This delay can be reduced if the CE makes one single request and global data gets streamed in. In other words, a push model, would require the global memory to volunteer global data to CEs. Another enhancement to decrease global data access overheads is to distribute and pre-load global data to CEs, provided CEs have local storage. Further, delay associated with indirect references can be reduced if the local memory has an associated logic to resolve these references. The scratch-pad memory (SPM) serves as the local memory within each CE and scratch-pad memory controller (SPMC) has the additional logic for indirect address calculation. Rate-mismatch between producers and consumers: Rate mismatch between a producer and a consumer is addressed by an additional logic to enforce the consumer to request the producer for data, once the consumer completes execution of operations assigned to it. In other words, chaining of several producers and consumers enables loss-free transmission of intra and inter HyperOp data. Performance improvement: The ALU in the CE reported in [17] is capable of performing all instructions from the Virtual ISA of LLVM [20] in an unpipelined fashion. If the CE has to satisfy the throughput requirements for streaming inputs, the ALU has to efciently process both unit-cycle and multi-cycle operations. Towards this, we logically partition the ALU into two units - one that performs unit cycle operations and the other that performs multi-cycle operations. Core computations of both the Solvers use the same CFU, but they differ in their NoC usage. Systolic realization of MFA is realized by chaining CEs, whereas dataow parallelism is exploited in CG algorithm. We show that REDEFINE gives good performance, scalability and recongurability for both Direct and Iterative Solvers. Increase in area due to CFUs and local storage is offset by the overall performance of the application. Domain specic specializations to REDEFINE will be addressed in subsequent sections in the context of NLA solvers.

III. R EALIZATION OF D IRECT S OLVERS ON REDEFINE Systolic array implementations are the most efcient way of realizing MFA in hardware. As indicated previously, this implementation uses a mesh interconnection of processing elements. To emulate this on the REDEFINE, we treat two neighbouring tiles as a single logical entity, as shown in gure 3. We map a portion of the systolic array i.e. sub-array onto a pair of CEs on REDEFINE. Figure 4(a) is the dependence graph for computing Schur complement for a 4 4 matrix. Formation of HyperOps, and assignment of pHyperOps to CEs
00 11 111 000 111 000 11 00

T A T A T A T A T T
000 111 111 000 111 000 11 00

T A T A T A T A T A T A
1 0 0 1
T A
1 0 0 1
1 0 0 1
1 0 0 1 0 1
00 11 11 00 11 00
A A A A
00000 11111 111111 000000 00000000 11111111 000000000 111111111 11 00 00000 11111 000000 111111 00000000 11111111 000000000 111111111 00000 11111 000000 111111 00000000 11111111 000000000 111111111 000000 111111 00000000 11111111 000000000 11 00 000000 111111 00000000 111111111 11111111 000000000 111111111 0000 1111 000000 111111 00000000 11111111 00000000 11111111 00000000 11111111 11 00 0000 1111 000000 111111 00000000 11111111 00000000 11111111 00000000 11111111 0000 1111 000000 111111 00000000 11111111 00000000 11111111 00000000 11111111 000000 111111 00000000 11111111 00000000 11111111 00000000 11111111 0 1 000000 111111 00000000 11111111 00000000 11111111 00000000 11111111 0 1 0 1 0 1 0 1 0 1 0 1
T T T T T T T T T T T T T T T T T T T T T T A T T T T T T T T A A A A A
Fig. 3. Shaded rectangles in the gure show two neighbouring Tiles logically bound together in a mesh interconnection
HyperOp1
2 5
3 6 8
4 7 9 10
11 15 19 23
12 16 20 24
13 17 21 25
14 18
pHyperOp1
(CE1)
pHyperOp2
(CE2)
Fig. 5. Mapping of systolic structures on REDEFINE. Grey regions depict mapping of systolic structure for 88 matrix. Hatched regions depict mapping of systolic structure for 16 16 structure. The HyperOp sizes for those two matrix sizes are 4 4 and 8 8 respectively.
Operation Number (From LOpOr) Sticky Counter
HyperOp 2
22 26
pHyperOp3
(CE3)
pHyperOp4
(CE4)
Operand3 Operand2 Operand1
CE1
CE2
CE3
CE4
(a) Mapping of operations
(b) HyperOps and pHyperOps formations
Fig. 4. Mapping of operations and HyperOps and pHyperOps formations for the 4 4 systolic structure
L W M U
SPMC : Scratch Pad Memory Controller
F r o m
Compute Metadata
ALU
FPCFU
SPMC
Scratch Pad Memory FSM
are shown in gure 4(b). Figure 5 shows the mapping of the systolic sub-array for computing the Schur complement of 8 8 and 16 16 matrices on the REDEFINE fabric. Grey regions in the gure shows the mapping of 8 8 matrix, while the hatched regions depict the mapping of 16 16 matrix. The HyperOp sizes for those two matrix sizes are 4 4 and 8 8 respectively. Since sub-arrays from the systolic array are HyperOps which are in turn mapped to CEs, REDEFINE can potentially scale to realize large systolic arrays. This is achieved by mapping and scheduling HyperOps on the execution fabric in space and time. It is to be noted that the same fabric can be used as a solution for mapping systolic array of any size (theoretically) at the cost of slow-down. This slow-down is proportional to the number of nodes in a systolic array that are mapped to one CE-pair. As shown in gure 2, Division and MAC are the core computations of MFA. A hand-crafted CFU specically realized to efciently perform these operations is introduced in a CE appears in gure 6 (denoted by FPCFU). The oating point MAC operation supported by FPCFU serves the common computational need for both the solvers i.e. MFA and SMVM. FP-CFU is a 2-stage pipelined unit that interfaces with the scratch-pad memory (SPM). A register called Sticky Counter, loaded with the number of times a HyperOp needs to be executed, is used to make a HyperOp persistent for repeated execution [17]. Further, Mode Change Register is used to change the nature of operations executed after certain number of iterations. These registers are initialized with values as indicated by Compute Metadata generated by the compiler. Buffer requirements needed in a systolic solution are realized on the SPM. The FPCPU shown in gure 6 is runtime recongurable, in that it can
Transport Metadata Transporter Bypass Channel To Router
Fig. 6.
Realization of FP-CFU and Memory-CFU in the Compute Element
also perform matrix-vector multiplication without any change to the hardware. The datapaths taken within the CE, are however different. Operands for the division and MAC operations required by Faddeev algorithm are supplied as Operand 1 (from Operation Store), Operand 2 (from Operation Store) and Operand 3 (from SPM). The output of the computation is appropriately forwarded to the dependent instructions. If they serve as input operands to operations held by the same CE, the bypass channel delivers them to the same CE. Routers are used to deliver the outputs, if they are destined for operations held by other CEs. Kalman Filter can be realized as a sequence of MFA stages as described in [21]. For any k -state Kalman Filter, we need to perform MFA on a compound matrix of size 2k 2k . When k 16, this can be realized as two parallel sequences of four MFAs, where each MFA is realized as shown in gure 5. For k > 16, the MFAs of the Kalman Filter need to be realized sequentially. This is because two instances of the MFA cannot be simultaneously accommodated on REDEFINE. A. Results for MFA The number of CE pairs used to map a given systolic array depends on the throughput requirements. Higher throughput is obtained when more number of CE pairs are assigned
Output Matrix Size
SystolicSolution
Realization in REDEFINE
Work Ratio
PEs 22 44 66 88 7 26 57 100
Cyclesa 6 14 22 30
CEs 4 4 8 8 8 14
Cyclesa 79 429 241 613 1508 896 7.524 4.714 5.297 3.911 4.021 4.181
Time takena by GPP running at 2.2 GHz (in sec.) 8 85 356 1278
Speed Up in REDEFINE running at 50 MHz
5 10 17 29 42 71
TABLE I C OMPARISON OF PERFORMANCE WITH GPP AND S YSTOLIC S OLUTIONS
a The Cycle count and Time taken reported here are for the computation of one Schur complement
for computations. In case the number of CEs is less than this optimum number, this computation can be realized by folding multiple sub-arrays to one CE. However this comes at the cost of throughput. Note that, the number of PEs used in systolic array realization is O(n2 ), whereas the number of CEs used in REDEFINE is [3(n/k )2 + n/k ] for k 2 2s and [(3/2)(n2 /s) + (n/2)(k/s)] for k 2 > 2s, where n n is the application size, k k is the substructure size and s is the size of operation store in a CE. The performance comparison of REDEFINE with respect to a GPP is given in Table I. The compiler performs a semiautomatic partitioning and mapping of the full array into sub-arrays. We obtain the execution latencies of different MFA kernels for different matrix sizes on an Intel Pentium 4 Processor running at 2.2 GHz. The total time taken by the function was determined by Intel VTune Performance analyzer. The execution latency numbers indicate that REDEFINE, running at 50MHz provides several times faster solutions than traditional GPP solutions. Realization of larger size matrices gives more performance enhancement because of higher computation-communication ratio. For comparison with systolic solutions, we dene Work Ratio as:
W ork Ratio = N o. of CEs N o. of cycles (in REDEF IN E ) N o. of P Es N o. of cycles(in Systolic array )
As seen in Table I, the low variance in Work Ratio justies the scalability of the solution. IV. R EALIZATION OF I TERATIVE S OLVERS ON REDEFINE Key issues regarding speeding up of Iterative Solvers are presented here. A. SMVM realization on REDEFINE The key computation in SMVM is the multiplication of a non-zero matrix elements A[i, j ] by the corresponding vector elements x[j ] and adding the resulting dot products. The challenges are: Efcient representation of matrix A: The challenge associated with access to a sparse matrix A is two-fold: one is recognition of non-zero elements of A and the other is the latency involved in accessing the elements of A. The rst
challenge is addressed by storing only the non-zero elements of A by well known formats. We choose the Compressed Sparse Row (CSR) format since this format enables us to reduce communication overheads and increase throughput and we need storage only for the nal product. Non-zero elements of matrix A, along with their column index are streamed in from global memory to the CEs using the NoC. Rate mis-match between the source (global memory) and the destination (the CE) is addressed by chaining the producer and the consumer. The column index is used to fetch the corresponding element of vector x. Accesses to vector x: In order to decrease the overhead of fetching elements of vector x from global memory, we use SPM to hold these elements, along with associated logic for address calculation that enables uniform latency for irregular accesses to elements of vector x. Computing the dot products: We adopt Single Instruction Multiple Data (SIMD) scheme to compute the inner products of a row sequentially by using one CE (enhanced with FPCFU) and use multiple CEs to compute dot products of different rows in parallel. Larger the matrix size, more parallelism is extracted, since a large number of CEs operate on rows of the matrix. However, due to resource limitation, multiple rows of matrix A could be assigned to a single CE - this will result in sequentially computing the dot products of all the rows assigned to a CE. Figure 6 shows the CE enhanced with the FP-CFU, the unit that supports oating point computational needs for iterative solvers. Data path within the CE for the SMVM is as follows. Operand 1 is the initial value of the inner product of a row of matrix A by the vector x. An element of matrix A is supplied as Operand 2. The address of the corresponding element of vector x (Operand 3) is given to the SPM controller, which fetches the element from SPM and gives it to FP-CFU (as Operand 3). The result of the operation is an inner product, which serves as Operand 1, to be accumulated with the next inner product. The number of iterations of this operation (taking the data path explained above) continues till all the (non-zero) elements of a row are multiplied by the corresponding elements of the vector. The dot product (which is the accumulated inner products) is then transported to the destination CE. CE continues to compute the dot products of as many rows of matrix A as streamed into it. Accumulation of dot products for inner product computations: SMVM is a regular application, where repeatedly dot products are computed and partial results are accumulated. Figure 7 shows how SMVM is mapped to REDEFINE. Elements of matrix A are streamed into CEs directly connected to Access Routers (CE #2 and CE #4) (Note: Access Routers enable toroidal interconnection of CEs.). Here, the vector which is already stored in the SPM is called p. The vector which is generated after multiplying with the matrix A is called the vector q . This convention is chosen so that we use the same variable names while describing the CG algorithm. Dot products (of the rows that are streamed into a CE) are stored in a neighbouring CE (eg. CE #3 and CE #5). This is illustrated in the grey blocks shown in Figure 7. The gure also shows the tiles which result in effective mapping of this operation.
Elements of matrix streamed from LSU using Access Routers
Elements of matrix streamed from LSU using Access Routers
Type I 1 2 x p T T 9 T 17 T p 25 T 33 T 41 T p 49 T 57 x T 34 26 T r x 10 T Type II 18 q T
3 T 11 r T q
4 T 12 T 20 T 28 T T 36 T 44 T 52 T r T 60 p T A T T p T
5 q 13 T
6 p
7 T q 15
14 T
T 21 T 29
19 T 27
T Type II 23 T T q 30 31 T T r 22 38 T 46 T q 39 T 47 T 55 T 63 T q
000 111 111 000 111 000 11 00

8 T A 16 T A 24 p A T 32 T A 40
Type I 1 q 2 p T T 9 T r 10
Type I Type I 3 q 4 p q 5 p 6 T T T T 11 r 12 r 13 T 21 T 29 T 37 T 45 T 53 T 60 T x T T 61 x T T T 54 T 46 T 38 T 30
Type I q 7 p 8 T T
17
T T T Type III p 18 q 19 r 20 T T T T Type III 25 p 26 q 27 r 28 T T T T Type III 33 p 34 q 35 r 36 T T T T Type III 41 p 42 q 43 r 44 T T T T 49 T 57 T 58 50 T 51 T x 59 52
14 r 15 T T 22 T 31 x T 39 x T 47 x T 55 T 62 T 23 x
00 11 111 000 111 000 11 00

A 16 T A 24 T A 32 T A 40 T 48 T
35
37
T T Type III 42 q 43 r T T 50 T 58 p T 59 T q 51
45 T 53 T 61 T q
A T p x 48 A T 56 A T
54 T 62 p T
Type I A
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1 0 1
64
00 11 11 00 11 00
A
00 11 11 00
A A 56 A 64 A
T 63 x
1 0 0 1
T A
1 0 0 1
1 0 0 1
1 0 0 1 0 1
Fig. 7. Grey blocks depict mapping of SMVM on REDEFINE; Hatched blocks depict templates of CG mapping on REDEFINE. Each template is a topological subset of the honeycomb interconnection of tiles to provide near-neighbour communication among vectors p, q, r and x (Algorithm IV.1). Note: Access routers, labelled A enable toroidal interconnection.
Fig. 8. Mapping of CG Algorithm on REDEFINE. Among the various templates shown in Figure 7, Type I and Type III are used. Note: Access routers, labelled A enable toroidal interconnection.
Any CE which has a direct link to an access router (marked as A in gure 7) and a one hop link to the neighbouring CE can be used for this computation. It is important to note that only a set of rows of A are streamed into a CE. Therefore, each CE produces dot product only for the rows which are assigned to it. For example, if elements of rows 1 to m are streamed into CE #2, and elements of rows m + 1 to 2m are streamed into CE #4, then dot products of rows 1 to m are stored in CE #3 and dot products of rows m +1 to 2m are stored in CE #5. In contrast to [12], we compute the dot products in a distributed manner and communicate the results only at the end of an iteration. B. Realization of CG algorithm on REDEFINE: The CG algorithm, whose pseudocode appears in Algorithm IV.1 is the Iterative Solver solution we have chosen to implement. In this algorithm A is a symmetric positivedenite sparse matrix, x is the solution vector and b is the right hand side vector of the equation Ax = b. p, q and r are vectors for internal use of the algorithm [22]. is the iteration number. The algorithm is said to converge when the norm of vector r(+1) which is called , such that is < a small pre-determined quantity (accuracy needed determines this quantity). All dot products computed by various CEs as illustrated in gure 7 need to be communicated to each other, since the next iteration of the algorithm needs the updated values. We parallelize the algorithm by distributing different rows of A to different CEs, so that each CE contains only partial results. The algorithm needs summation of all dot products at 2 places (for calculation of and ), which necessitates collating results in one place. This is a one element communication; however, the algorithm requires complete update of the p vector at the end of each iteration which is multi-element exchange; as the problem size increases, this communication overhead increases. The other constraint is that for efcient execution, all the vectors have to be mapped within a one hop distance on the xed topology network (honeycomb).
Below we present a mapping strategy that minimizes global communication.

Algorithm IV.1: CG A LGORITHM(A, x,b) x(0) x0 p(0) r (0) b Ax(0) 0 while ( is too big )
q Ap( ) (r() , r() )/(p() , q) x(+1) x() + p()

do
(r ( +1) , r ( +1) )/(r ( ) , r ( ) ) ( +1) r ( +1) + p( ) p

+1
r ( +1) r ( ) q
Figure 7 shows the various templates that can be used to realize one iteration of the CG algorithm. In gure 8, a mapping for an iteration of the CG algorithm on REDEFINE is shown in terms of templates Type I and Type III. The algorithm uses 4 Type I templates3 along with the toroidal connection which connect all of them in a circular manner and 4 Type III templates4 . Vectors p, q , r are mapped in such a way that they are in close neighbourhood to the CE producing vector x, so that there is low communication latency. This way of mapping ensures that there is no global communication needed for one iteration of CG algorithm. However, vector p needs to be updated at the end of each iteration, which necessitates global communication. Exchange of partial results among the Type 1 templates takes two 2-hop transfers, whereas it takes three 3-hop and one 1-hop transfer among Type III templates (The number of cycles will depend upon the number of data elements transferred). Since each router can send and receive data at the same time, most of the communication can happen simultaneously among many template types. This reduces the overall communication time. An updated vector p is obtained
3 For 4 For
eg: CE# 1, 2, 3 and 11 constitute a Type I template eg: CE# 17, 18, 19 and 24 constitute a Type III template
Matrix Size (nnz) 47 x 47 (247) 1000 x 1000 (3750) 332 x 332 (3920) 537 x 537 (37945)
Execution Time on GPP (in ms) 5.9 19.9 23.9 65.6
Execution Time on REDEFINE running at 50Mhz (in ms) 2.3 12.5 18 58
R EFERENCES
[1] D. K. Faddeev and V. N. Faddeeva, Computational methods of linear algebra, vol. 54. Leningrad: Nauka, Leningrad. Otdel., 1975. [2] A. Ghosh and P. Paparao, Performance of modied faddeev algorithm on optical processors, Optoelectronics, IEE Proceedings J, vol. 139, pp. 325330, Oct 1992. [3] J. G. Nash and S. Hassen, Modied Faddeev Algorithm for Matrix Manipulation: an overview, in SPIE : Real Time Signal Processing IV, pp. 3945, 1984. [4] M. Zajc, R. Sernec, and J. Tasic, An efcient linear algebra soc design: implementation considerations, in Electrotechnical Conference, 2002. MELECON 2002. 11th Mediterranean, pp. 322326, 2002. [5] F. Gaston and G. Irwin, Systolic kalman ltering: an overview, Control Theory and Applications, IEE Proceedings D, vol. 137, pp. 235244, Jul 1990. [6] A. El-Amawy, A systolic architecture for fast dense matrix inversion, IEEE Trans. Comput., vol. 38, no. 3, pp. 449455, 1989. [7] F. Gaston, D. Brown, and J. Kadlec, A parallel predictive controller, in Control 96, UKACC International Conference on (Conf. Publ. No. 427), vol. 2, pp. 10701075 vol.2, Sept. 1996. [8] A. El-Amawy and K. Dharmarajan, Parallel vlsi algorithm for stable inversion of dense matrices, Computers and Digital Techniques, IEE Proceedings E, vol. 136, pp. 575580, Nov 1989. [9] A. Bigdeli, M. Biglari-Abhari, Z. Salcic, and Y. T. Lai, A new pipelined systolic array-based architecture for matrix inversion in fpgas with kalman lter case study, EURASIP J. Appl. Signal Process., vol. 2006, pp. 7575. [10] R. Vuduc, Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California, Berkeley, 2003. [11] G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis, and N. Koziris, Understanding the performance of sparse matrix-vector multiplication, in 16th Euromicro Conference on Parallel, Distributed and NetworkBased Processing, 2008, (Washington, DC, USA), pp. 283292, IEEE Computer Society, 2008. [12] V. Prasanna and G. Morris, Sparse matrix computations on recongurable hardware, Computer, vol. 40, pp. 5864, March 2007. [13] M. deLorimier and A. DeHon, Floating-point sparse matrix-vector multiply for fpgas, in FPGA 05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, (New York, NY, USA), pp. 7585, ACM, 2005. [14] A. Bik, Compiler Support for Sparse matrix computations. PhD thesis, Leiden University, Netherlands, 1996. [15] V. Kotlyar, K. Pingali, and P. Stodghill, Compiling parallel code for sparse matrix applications, in Supercomputing 97: Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), (New York, NY, USA), pp. 118, ACM, 1997. [16] M. Alle, K. Varadarajan, A. Fell, R. C. Reddy, N. Joseph, S. Das, P. Biswas, J. Chetia, A. Rao, S. K. Nandy, and R. Narayan, Redene: Runtime recongurable polymorphic asic, ACM Trans. Embed. Comput. Syst., vol. 9, no. 2, pp. 148, 2009. [17] A. Fell, M. Alle, K. Varadarajan, P. Biswas, S. Das, J. Chetia, S. K. Nandy, and R. Narayan, Streaming fft on redene-v2: an applicationarchitecture design space exploration, in CASES 09: Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems, (New York, NY, USA), pp. 127136, ACM, 2009. [18] J. B. Dennis and G. R. Gao, An efcient pipelined dataow processor architecture, in Supercomputing 88: Proceedings of the 1988 ACM/IEEE conference on Supercomputing, (Los Alamitos, CA, USA), pp. 368373, IEEE Computer Society Press, 1988. [19] M. Alle, K. Varadarajan, A. Fell, S. K. Nandy, and R. Narayan, Compiling techniques for coarse grained runtime recongurable architectures, in ARC 09: Proceedings of the 5th International Workshop on Recongurable Computing: Architectures, Tools and Applications, (Berlin, Heidelberg), pp. 204215, Springer-Verlag, 2009. [20] Chris Lattner and Vikram Adve, LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation, in CGO 04: Proceedings of the international symposium on Code generation and optimization, (Palo Alto, California), 2004. [21] M. A. Bayoumi, P. Rao, and B. Alhalabi, Vlsi parallel architecture for kalman lter: an algorithm specic approach, J. VLSI Signal Process. Syst., vol. 4, no. 2-3, pp. 147163, 1992. [22] J. Shewchuck, An introduction to the conjugate gradient algorithm without the agonizing pain, in http://www.cs.cmu.edu/ quakepapers/painless-conjugate-gradient.ps, 1994.
TABLE II C OMPARISON OF P ERFORMANCE FOR SMVM, nnz INDICATES THE NUMBER OF NON - ZERO ELEMENTS IN THE SPARSE MATRIX
Matrix Size (nnz) 47 x 47 (247) 1000 x 1000 (3750) 332 x 332 (3920) 537 x 537 (37945)
Execution Time on GPP (in ms) 17.9 57.6 75.5 314.1
Execution Time on REDEFINE running at 50Mhz (in ms) 5 46.4 62 292
TABLE III C OMPARISON OF P ERFORMANCE FOR CG ALGORITHM , nnz INDICATES THE NUMBER OF NON - ZERO ELEMENTS IN THE SPARSE MATRIX
at the end of an iteration by global exchanges among Type I and Type III templates, which take eight 5-hop5 transfers and two 3-hop transfers, resulting in a total of 56 hops for updating vector p in all templates. C. Results for Iterative Solvers Table II represents the performance comparison of execution times on REDEFINE with that of a GPP for SMVM kernel. The GPP used here is the same as that used for comparison in Direct Solver. The time taken depends on the number of non-zero elements present in the matrix. The entries in the table are arranged in increasing number of non-zero elements. Better performance over a range of sparse matrices is achieved because of addition of specialized hardware (FPCFU and SPM). We use this kernel in the implementation of CG algorithm and the results are compared with GPP (shown in Table III), which shows improved performance. We get the performance benets because of parallelized CG algorithm and efcient management of global communication6 , which happens only at the end of each iteration. V. C ONCLUSION In this paper design solutions for Direct and Iterative Numerical Linear Algebra (NLA) solvers on REDEFINE have been presented. We further explored the scalability and run time recongurability of these solutions. The methodology used to implement Modied Faddeevs Algorithm (MFA) can be generalized for the realization of other systolic solutions onto REDEFINE. We achieved enhanced performance over GPP solutions by providing custom hardware support needed to execute core computations in NLA application domain.
ACKNOWLEDGMENTS
This project was partly funded by the Ministry of Communication and Information Technology, Govt. of India.
5 Due to lack of space, derivations of the number of hops are not presented here in the paper. 6 Communication between different template types

Accelerating Numerical Linear Algebra Kernels On A Scalable Run Time Reconfigurable Platform

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Accelerating Numerical Linear Algebra Kernels On A Scalable Run Time Reconfigurable Platform

Uploaded by

Copyright:

Available Formats

Accelerating Numerical Linear Algebra Kernels on a Scalable Run Time Recongurable Platform

Mode 1 Boundary Processor Xin

d21 d11 a21 b11

d22 d12 a22 b12

Out1=P/Xin Out1=Xin/P Mode 1 P=Xin Out1 P=P

Fig. 2. Operations of Boundary Processor and Internal Processor in a 2 2 systolic array

Different possible Matrix-Solutions using MFA

00 11 111 000 111 000 11 00

000 111 111 000 111 000 11 00

Operand3 Operand2 Operand1

(a) Mapping of operations

(b) HyperOps and pHyperOps formations

SPMC : Scratch Pad Memory Controller

Scratch Pad Memory FSM

Transport Metadata Transporter Bypass Channel To Router

Realization of FP-CFU and Memory-CFU in the Compute Element

Output Matrix Size

Speed Up in REDEFINE running at 50 MHz

TABLE I C OMPARISON OF PERFORMANCE WITH GPP AND S YSTOLIC S OLUTIONS

Elements of matrix streamed from LSU using Access Routers

Elements of matrix streamed from LSU using Access Routers

000 111 111 000 111 000 11 00

T T T Type III p 18 q 19 r 20 T T T T Type III 25 p 26 q 27 r 28 T T T T Type III 33 p 34 q 35 r 36 T T T T Type III 41 p 42 q 43 r 44 T T T T 49 T 57 T 58 50 T 51 T x 59 52

00 11 111 000 111 000 11 00

Below we present a mapping strategy that minimizes global communication.

q Ap( ) (r() , r() )/(p() , q) x(+1) x() + p()

(r ( +1) , r ( +1) )/(r ( ) , r ( ) ) ( +1) r ( +1) + p( ) p

Execution Time on GPP (in ms) 5.9 19.9 23.9 65.6

Execution Time on REDEFINE running at 50Mhz (in ms) 2.3 12.5 18 58

Execution Time on GPP (in ms) 17.9 57.6 75.5 314.1

Execution Time on REDEFINE running at 50Mhz (in ms) 5 46.4 62 292

You might also like