Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/LCA.2015.2458318, IEEE Computer Architecture Letters
AbstractLarge-scale workloads often show parallelism of different levels. which offers acceleration potential for clusters and parallel
processors. Although processors such as GPGPUs and FPGAs show good performance of speedup, there is still vacancy for a low
power, high efficiency and dynamically reconfigurable one, and coarse-grained reconfigurable architecture(CGRA) seems to be one
possible choice. In this paper, we introduce how we use our CGRA fabric Chameleon to realize a dynamically reconfigurable
acceleration to MapReduce-based(MR-based) applications. A FPGA-shell-CGRA-core(FSCC) architecture is designed for the
acceleration PCI-Express board, and a programming model with compilation flow for CGRA is presented. With the supports above, a
small evaluation cluster with Hadoop framework is set up, and experiments on compute-intensive applications show that the
programming process is significantly simplified, with an 30-60x speedup offered under low power.
F
1 I NTRODUCTION
S.Liang, S.Yin, L.Liu and S.Wei are with the Institute of Microelectronics, 2 CGRA- CORED ACCELERATOR BOARD DESIGN
Tsinghua University, Beijing, China, 100084. S.Yin is the corresponding
2.1 Architecture and working mechanism of our CGRA
author.
E-mail: yinsy@tsinghua.edu.cn Based on the typical CGRA architecture shown in Figure 1,
Y.Guo is with the Department of Computing, Imperial College London, we implemented a prototype chip called Chameleon with
UK.
TSMC 65nm LP1P8M CMOS technology, which has an area of
1556-6056 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/LCA.2015.2458318, IEEE Computer Architecture Letters
1556-6056 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/LCA.2015.2458318, IEEE Computer Architecture Letters
TABLE 1
MapReduce models for the benchmark applications
function. The reduce function searches for the elements with the Function MM KMC CONV
same key value, and yields vector inner productions in parallel. Euclid distance,
After the map and reduce functions presented, we should Mapper Replicate Multiply
Compare, Sum
consider how to compile the compute-intensive parts of the Reducer Multiply, Sum
Euclid distance,
Sum
functions to offload them onto CGRAs. Just as shown in Figure Sum, Mean
4, the compute-intensive parts(CIPs) can always be present- Complexity O(M N P ) O(N KD) O(K 2 N 2 )
Tested size M=N=P=128 N=10e5,K=10,D=2 K=7,N=224
ed as control data flow graphs(CDFGs). With the hardware
parameters such as PE number and memory bandwidth, the
original CDFG will be optimized by transformations [5] to scheduling reference. In our cluster, four servers are configured
suit with Chameleon array. Then the original CDFG will be as Datanodes and each inserted with one Chameleon accelera-
transformed into a series of subgraphs, and the key parameters tor boards through the PCI-Express slots.
such as kernel pattern, subgraph iteration number, iteration
dependence and input data address will be abstracted. With
these parameters, configuration contexts will be generated and 5 E XPERIMENTAL RESULTS
packed up in a parametric execution package. The above pro- Our experiments mainly focus on three aspects: the simplicity
cedures can be called as a compilation flow for Chameleon, of CGRA programming, the standalone performance improve-
which can be accomplished manually or by a custom compiler, ment on CGRAs and the overall improvement in the MR-based
and currently we have developed an LLVM-based compiler [9] prototype cluster. The benchmark applications we have chosen
[3], which can go through the procedures automatically. are: matrix multiplication(MM), K-means clustering(KMC), and
We write a C driver for the access of bottom hardware due 2-D convolution(CONV) in convolutional neural networks, for
to the register arrangement, and use Java native interface(JNI) they all show compute-intensive characteristics and are easy to
calling C libraries and linking it with MapReduce applications. be paralleled. The MapReduce models, the timing complexities
As shown in Figure 4, with the pointers of source and end and the tested size of these applications are listed in Table 1.
given, the exe CGRA function will realize the CIPs in the We extract the CIPs in the MapReduce functions for acceler-
original mapper/reducer on CGRAs. ation. To compare the strengths and weaknesses between FPGA
and CGRA implementations, the CIPs are realized in both
HDL and C++. We respectively generate configurations using
4 E VALUATION SYSTEM
Xilinx ISE Design suite v14.7 for FPGA and our LLVM-based
To verify the effectiveness CGRA acceleration, we need to set C compiler for Chameleon. For the configuration generation
up a distributed environment. Since Hadoop [2] is a widely time(CGT), FPGA takes 10-14 minutes to generate bitstream
used framework of MapReduce, and it offers abundant tool- files with 90.3-112.1 Mb size, while CGRA only takes 23-37
s and libraries for development, so we choose to set up a seconds to generate execution packages with 1.65-5.15Kb size.
Hadoop-based cluster. The overall system workflow is given The main difference comes from the elimination of structure
in Figure 5. Here we apply five IBM x3650 M4 servers with description of PEs and the reduction of interconnection number
a Xeon E5-2650 2GHz CPU and 8GB ECC DDR3 memory in and combinations. Therefore, FPGAs can only be statically con-
each, connected with local network. In Hadoop framework, figured before the chip actually works, while our Chameleon
Namenode is the manager of the whole system, and Datanodes CGRA can be dynamically configured with much simpler para-
work under the control of Namenode with status feedbacks for metric description of the workload provided.
1556-6056 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/LCA.2015.2458318, IEEE Computer Architecture Letters
TABLE 2
Comparison between CPU, FPGA and CGRA
Xeon in 2GHz Virtex 6 LX550T in 200MHz Chameleon in 200MHz
Func # of operations
Latency(ms) Cycles Latency(ms) Speedup Cycles Latency(ms) Speedup
MM 4.19e6 14.41 59.96 16385 8.19e-2 175.89 4.59e3 77824 3.89e-1 37.03 1.09e5
KMC 8.10e6 12.29 135.77 41032 2.05e-1 59.90 3.49e3 53125 2.66e-1 46.27 3.08e5
CONV 4.74e6 20.07 48.68 47526 2.37e-1 84.46 1.77e3 72604 3.63e-1 55.29 1.32e5
*
Here we set the energy efficiency as = ( of operations)/(P ower(mW ) Execution time(ms))
ACKNOWLEDGMENTS
This work is supported by the National Nature Science founda-
tion of China(No.61274131), the International S&T Cooperation
Project of China(No. 2012DFA11170), the Tsinghua Indigenous
Research Project(No.20111080997) and the China National High
Technologies Research Program(No. 2012-AA012701).
R EFERENCES
(a) (b)
[1] Powerstat for desktop. http://sourceforge.net/projects/
Fig. 6. The speedup of three applications with different number of nodes: powerstatfordes/.
(a)non-accelerated; (b)CGRA-accelerated. [2] D. Borthakur. The hadoop distributed file system: Architecture
and design. Hadoop Project Website, 11:21, 2007.
[3] Y. Chongyong, Y. Shouyi, L. Leibo, and W. Shaojun. Compiler
framework for reconfigurable computing architecture. IEICE trans-
For the standalone performance, we implement three ver- actions on electronics, 92(10):12841290, 2009.
sions on CPU, FPGA and CGRA respectively. The detailed [4] J. Dean and S. Ghemawat. Mapreduce: simplified data processing
comparison is given in Table 2. The power consumption of Xeon on large clusters. Communications of the ACM, 51(1):107113, 2008.
CPU is measured by the Powerstat tool [1] with a wattsUp [5] V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish,
PRO watt meter. As we can see, both FPGA and CGRA K. Sankaralingam, and C. Kim. Dyser: Unifying functionality and
parallelism specialization for energy-efficient computing. IEEE
shows significant speedup and power advantages comparing Micro, (5):3851, 2012.
with CPU. However, due to the limitation of PE number(512), [6] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: a
Chameleon shows worse speedup than Virtex-6 FPGA, while mapreduce framework on graphics processors. In Proceedings of the
offering a high efficiency which is almost 30-90 times of FPGAs. 17th international conference on Parallel architectures and compilation
We should mention that Chameleons technology is 65nm while techniques, pages 260269. ACM, 2008.
[7] M. Horowitz, E. Alon, D. Patil, S. Naffziger, R. Kumar, and
Virtex-6 is 40nm, and the area of Chameleon is only a quarter K. Bernstein. Scaling, power, and the future of cmos. In Electron
of Virtex-6. Due to the scaling principle [7], with reduction of Devices Meeting, 2005. IEDM Technical Digest. IEEE International,
CGRA I/O number, more CGRA PEs will be integrated on pages 7pp. IEEE, 2005.
board, which will bring an even better timing improvement. [8] I. Kuon and J. Rose. Measuring the gap between fpgas and
Finally we test the applications in a CGRA accelerated asics. Computer-Aided Design of Integrated Circuits and Systems, IEEE
Transactions on, 26(2):203215, 2007.
cluster environment. We rewrite the applications in MapReduce [9] C. Lattner and V. Adve. Llvm: A compilation framework for
forms, and make another version with the loops(the CIPs) lifelong program analysis & transformation. In Code Generation and
compiled into CGRA execution packages and called by the Optimization, 2004. CGO 2004. International Symposium on, pages
exe CGRA function with the key parameters defined. We take 7586. IEEE, 2004.
the non-accelerated single node case as the baseline, and we [10] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constan-
tinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray,
test the applications with different Datanode number from one et al. A reconfigurable fabric for accelerating large-scale datacenter
to four. We process 104 copies of the applications, in order to services. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st
raise the datasets to a Gigabyte level. The normalized speedup International Symposium on, pages 1324. IEEE, 2014.
of the cluster is shown in Figure 6. We can observe that the [11] O. Segal, M. Margala, S. R. Chalamalasetti, and M. Wright. High
relationship between the node number and speedup is non- level programming for heterogeneous architectures. arXiv preprint
arXiv:1408.4964, 2014.
linear, for the I/O communication ratio will raise relatively [12] D. Soderman and Y. Panchul. Implementing c designs in hard-
as the single-node compute time drops. However, we can see ware: a full-featured ansi c to rtl verilog compiler in action. In
that the node number does not affect the effect of CGRA- Verilog HDL Conference and VHDL International Users Forum, 1998.
acceleration very much, for the PCI-Express bus brings fast IVC/VIUF. Proceedings., 1998 International, pages 2229. IEEE, 1998.
communication and the parallelism on CGRAs remains stable. [13] J. A. Stuart and J. D. Owens. Multi-gpu mapreduce on gpu
clusters. In Parallel & Distributed Processing Symposium (IPDPS),
2011 IEEE International, pages 10681079. IEEE, 2011.
[14] B. Sukhwani, H. Min, M. Thoennes, P. Dube, B. Brezzo, S. Asaad,
6 C ONCLUSION and D. E. Dillenberger. Database analytics: A reconfigurable-
In this paper, we present how we bring CGRA accelerator in computing approach. IEEE Micro, 34(1):1929, 2014.
a MR-based system, with hardware architecture and software [15] K. H. Tsoi and W. Luk. Axel: a heterogeneous cluster with
fpgas and gpus. In Proceedings of the 18th annual ACM/SIGDA
programming model declared. Dynamic programmability has
international symposium on Field programmable gate arrays, pages
been primarily achieved, and a considerable speedup and a re- 115124. ACM, 2010.
markable energy efficiency value is realized in both standalone [16] J. H. Yeung, C. Tsang, K. H. Tsoi, B. S. Kwan, C. C. Cheung, A. P.
and cluster cases. The power saving will be even more tempting Chan, and P. H. W. Leong. Map-reduce as a programming model
if the cluster scale grows big enough. Future work lies in the for custom computing machines. In Field-Programmable Custom
Computing Machines, 2008. FCCM08. 16th International Symposium
CGRA mapping optimization and a real-time scheduler with
on, pages 149159. IEEE, 2008.
CGRA status feedbacks.
1556-6056 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.