You are on page 1of 4

High Performance Reconfigurable Multi-Processor-Based Computing on FPGAs

Diana Ghringer (3rd year PhD student)


Fraunhofer IOSB Ettlingen, Germany e-mail: dgoehringer@fom.fgan.de
Abstract Multi-processor architectures are a promising solution to provide the required computational performance for applications in the area of high performance computing. Multi- and many-core Systems-on-Chip offer the possibility to host an application, partitioned in a number of tasks, on the different cores on one silicon die. Unfortunately, a partitioning of the tasks near to the performance optimum is the challenge in this domain and often a show-stopper for the success story of multi- and many-core hardware. The missing feature of these architectures is runtime adaptivity of the underlying hardware, which offers to tailor the hardware to the application in order to meet the task mapping process coming from top-down development. Especially, this Meet-in-theMiddle solution offers the novel hardware and software approach of RAMPSoC, which is described in this paper. Keywords- Multiprocessor System-on-Chip (MPSoC); Dynamic and Partial Reconfiguration; Field Programmable Gate Array (FPGA); Design Methodology; Image Processing

Jrgen Becker (Advisor)


Karlsruher Institute of Technologie (KIT) Karlsruhe, Germany e-mail: becker@kit.edu architecture of RAMPSoC can be optimized at design- and at runtime to the requirements of the application leading to the novel Meet-in-the-Middle designflow. For the runtime adaptation the dynamic and partial reconfiguration feature [4], which is supported for example by Xilinx FPGAs, is used. This feature allows to exchange at runtime configurations of the die corresponding to specific hardware modules, while the other modules stay operative and are not affected from this process. This way, e.g. one of the processors could be removed or exchanged at runtime without interrupting the execution of the other processors on the chip. Furthermore, the RAMPSoC approach supports adaptation of the processors, the communication infrastructure, the hardware accelerators and the instruction set of the processors to meet constraints regarding performance, area and power consumption more efficiently. To support a time and complexity efficient programmability of the novel RAMPSoC hardware architecture, a software toolchain is required. This toolchain hides the complexity of the underlying hardware architecture from the programmer. Moreover, a specialized operating system (OS) is needed to manage at runtime the hardware resources and the scheduling and allocation of the tasks onto the multiprocessor system. Especially, these topics amongst others, where a clear separation of the RAMPSOC to existing MPSOC architectures are obvious, are the challenges in this research work. In order to integrate the RAMPSoC approach into the state-of-the-art research, a new taxonomy scheme has been introduced [5] to classify both static and reconfigurable single-/ and multi-processor systems-on-chip. This new taxonomy scheme makes it possible to compare different classes of processor systems against each other in respect to their supported temporal and spatial parallelism. Also, it illustrates the flexibility and the high degree of freedom given by the RAMPSoC approach. The outline of the paper is the following: In Section II related work and their shortcomings are presented. Section III describes the RAMPSoC approach in detail. The current status of the proposed approach and the results obtained so far are presented in Section IV. Finally, conclusions are drawn and remaining objectives and challenges are given in Section V. II. RELATED WORK In [5] multiple examples for static and dynamic single and multi-processor systems-on-chip are presented and compared against each other in respect to their degree of

I.

INTRODUCTION AND MOTIVATION

To fulfill the real-time requirements for high performance computing applications such as image processing, Multi-Processor Systems-on-Chip (MPSoCs) are a promising solution. Nowadays, several different MPSoC architectures are available from industry and academic institutions. These architectures vary from homogeneous multi-core architectures [1], which are mainly used for general purpose applications, to heterogeneous MPSoCs [2], which are mostly optimized for a specific application. While a broad variety of hardware architectures exists, one of the main challenges in R&D is the programmability. This means achieving a well balanced and performance optimum targeting workload distribution, and therefore an optimal speed-up by efficiently mapping the given application onto the hardware architecture. While it may be feasible for some applications to find an optimal allocation at design-time, an optimal task allocation at runtime is highly difficult. The reason to investigate in this is the fact, that a runtime adaptive hardware enables to tailor the processing platform to the current requirements of the application and is therefore able to reduce power consumption and optimize the performance. The reason for this lack in current hardware is that the architecture, e.g. the processors, the memory and the communication infrastructure, is fixed and cannot be adapted to runtime requirements. To solve this issue, a novel runtime adaptive MPSoC, called RAMPSoC [3] has been developed. The hardware

978-1-4244-6534-7/10/$26.00 2010 IEEE

Authorized licensed use limited to: PONDICHERRY ENGG COLLEGE. Downloaded on June 14,2010 at 07:09:25 UTC from IEEE Xplore. Restrictions apply.

flexibility. The shortcomings of static multiprocessor systems, as e.g. [1], IBM CELL Broadband Engine, or Nvidia General Purpose Graphic Processing Units, are that the user has to adapt and modify his application to map it onto these architectures. As the number and type of available processors, their communication infrastructure, e.g. Busbased or Network-on-Chip (NoC), and the local and global memory bandwidth are fixed, the application can only be mapped suboptimal onto these architectures. This leads to an unbalanced workload, were some of the processors remain idle, while others need to compute most of the tasks. To achieve a better tradeoff between performance and power consumption, several research groups investigate in the usage of reconfigurable hardware in single- and multiprocessor systems. Burke et al. [6] are working at the RAMP Blue system, which consists of multiple FPGA Boards with multiple homogeneous processors. This system is used to investigate different application mapping strategies for future many-core systems. Due to the power consumption of these multiple FPGA boards, it cannot be used for embedded high performance systems. Also, a runtime adaptation of the system is not supported so far. There also exists approaches for runtime adaptive MPSoCs like the MORPHEUS Approach [7], consisting of an ARM processor and three heterogeneous and reconfigurable accelerators. Bobda et al. [8] and Claus et al. [9] have both presented FPGA-based MPSoCs with reconfigurable accelerators. The shortcomings of these three approaches are that they only support the runtime adaptation of the accelerators, but neither the communication infrastructures nor the processors are reconfigurable.To the best of our knowledge no such holistic approach as RAMPSoC exists. Other approaches are focused either on static single/multi-processor systems or on systems that support the runtime adaptation of the hardware accelerators. RAMPSoC provides a higher degree of freedom, because not only the accelerators, but also the communication infrastructure and whole processors can be adapted, added or removed on-demand at runtime as well as at design-time. This way, requirements such as performance and power consumption can be achieved more efficiently III. THE RAMPSOC APPORACH To abstract the complexity of the underlying hardware from the application programmer four abstraction layers have been introduced in [10]: 1. MPSoC-, 2. Communication-, 3. Processor- and 4. Physical-level. As shown in Fig. 1, the hardware architecture and the software toolchain are conforming to these abstraction layers. The MPSoC-level is the level with the highest grade of abstraction from the physical realization. This level represents the whole RAMPSoC hardware system architecture and is used by the application programmer. The communication-level represents the miscellaneous communication infrastructures, which are not fixed to a single realization option. E.g. different topologies and protocols are supported. The different types of processors and the hardware accelerators belong to the processor-level. The physical implementation of the MPSoC on the FPGA is handled in the physical-level. The communication- , the

processor- and the physical-level are hidden from the programmer and are handled by the software development toolchain. At the MPSoC-level the programmer implements the application in C or C++. The software development toolchain transforms this into a task graph for analysis and partitioning into code fragments for different processors. This results in an adjacency matrix specifying the needed communication infrastructure at the communication-level and leads to the physically instantiation. At the processorlevel, the C-code fragments are profiled and then a Hardware/Software Co-design together with a partitioning in time and space is done. The processors used are all state-ofthe-art processors, this means their design tools (e.g. compiler, linker, etc) exist.
Hardware System Architecture
MPSoC Switch
MicroProcessor Switch MicroProcessor Switch MicroProcessor Switch MicroProcessor Switch MicroProcessor Switch MicroProcessor

Software Toolchain
Taskgraph

MPSoC-Level

NoC

Point-to-Point

Communication-Level
BUS

Adjacent Matrix

Processors
MicroProcessor Accelerator MicroProcessor MicroProcessor Accelerator

Dataflowgraph

Processor-Level

Accelerator Accelerator Accelerator

Physical Level (FPGA)

Figure 1. The four abstraction layers of RAMPSoC

A. Hardware Architecture The hardware architecture of RAMPSoC consists of a processor set, communication infrastructures and hardware accelerators. The supported IP is included in an IP library, which will be used by the software toolchain. So far, several 32-bit processors (PowerPC, Xilinx MicroBlaze and Leon Sparc) and one 8-bit Processor (Xilinx PicoBlaze) are supported. Additionally, the communication infrastructure can be chosen from several different types, such as Busbased (PLB, OPB, XPS, AMBA), Point-to-Point (Xilinx Fast Simplex Links (FSL)), NoCs (see [11], [12] and [13]) or a combination of these. Several accelerators (Gauss-, Sobel, SAD-, 2D-Normalized Correlation-, Hotspot-, Coldspot, Median-Filter, etc) are also included for speeding up image processing algorithms. The diversity of the library supports the implementation of homogeneous and heterogeneous multiprocessor architecture on the System-on-Chip (SoC). For the demands of image processing applications a novel NoC was developed called Star-Wheels Network-on-Chip [13]. This network has a heterogeneous topology, as shown in Fig. 2. It uses the so called wheel topology within the subnets and a star topology is used between the different subnets. For the communication protocol a synergy of a packet- and a circuit-switching is used. Packet-switching is used to establish a communication channel between two communication partners. Moreover, it is used for control purposes, e.g. it checks regularly, if the communication partners still exist or, if their addresses have changed. This way, it supports the exchange, the addition or the deletion of

Authorized licensed use limited to: PONDICHERRY ENGG COLLEGE. Downloaded on June 14,2010 at 07:09:25 UTC from IEEE Xplore. Restrictions apply.

processors at runtime. Its advantages are a high throughput and a low latency. Also, it was shown that it is deadlock free and area efficient for implementation on FPGAs. Additional benefits are scalability, the support of different clock domains and the integration into a high level design tool from Xilinx.
FPGA
4 5 3

special purpose operating system (OS) called Configuration Access Port-OS (CAP-OS) [14], which is described more in detail in subsection C. It is responsible for the runtime management of the hardware resources and for scheduling and allocation of the tasks of the task graph. B. Software Toolchain Together with the RAMPSoC architecture a semiautomatic software toolchain [15] was designed. The toolsuite of this toolchain consists of a combination of commercial and custom tools. At the current development state some manual steps are also required, but these will be reduced in future versions of the toolchain. The toolchain consists of three phases as shown in Fig. 4.
C/C++ Program Results:
Suggested partitioning for the application

PE2

PE1
3

2 1

1
7

6 5

0
7

Analysis

5 6

3 2

Phase 1
Systemarchitecture

Profiling Tracing
Communication Analysis

Legend:
: subswitch : superswitch : rootswitch PE : Processing Element
2 4 6

PE3

Iteration
SW / SW Partitioning

Suggested MPSoC architecture (number of processors, communication infrastructure)

3 4

PE4

Phase 2
HW code

Profiling HW / SW Partitioning SW code


Inter-Processor Communication

Results:
Identified hotspots for each processor

Phase 3

Results:
Partitioned application FPGA bitstream for the complete MPSoC (number of processors, communication infrastructure, hardware accelerators) including software executables for each processor

C-to-FPGA Compiler

Figure 2. Star-Wheels Network-on-Chip

HW-Synthesis System integration

Compiler

Fig. 3 shows an example for such a RAMPSoC system at one point in time. In this example the processing elements (PEs) are connected over a switched-based NoC. RAMPSoC supports either processors with or without accelerators or a Finite State Machine (FSM) combined with a pure hardware function as PEs. The figure shows further a special network component called Virtual-IO. This was designed to receive the incoming data from a video camera and to divide the images into several tiles. In parallel it collects the computed results from the processors and sends them out to a screen.
External Memory

Tool-Suite
Commercial tools Custom tools Tracing SW/SW Partitioning HClustering HW/SW Partitioning Profile_Analyser Manual steps

Figure 4. Semi-automatic software toolchain of RAMPSoC

FPGA Switch Switch MicroProcessor (Type 1)

User applications

CAP-OS +RTOS +Microprocessor ICAP

Switch MicroProcessor (Type 2) Accelerator Switch

Switch

Virtual-I/O

Switch MicroProcessor (Type 1)

FSM + Hardware Function Accelerator 1

Accelerator Accelerator Accelerator 4 3 2

Figure 3. RAMPSoC system at one point in time

One of the processors is used to communicate with the user. It receives the applications represented as task graphs and partial bitstreams. The task graphs and the partial bitstreams have been generated using the software toolchain of RAMPSoC, which is described in detail in subsection B. The processor, which communicates with the user, runs a

Phase 1 receives the sequential C/C++ application as an input. The individual functions of the application are then profiled and the communication overhead between them is analyzed. Tracing with a set of common input data is used to extract the call graph of the application. Out of these steps results the parameters required for the clustering decision of the hierarchical clustering algorithm, which is used to do a Software/Software partitioning. This way the system architecture and a set of application modules, one for each processor, are defined. Therefore, this phase happens at the MPSoC- and the communication-level. The processor-level is handled in Phase 2. Here, each application module is profiled on a code line basis. Out of this, one or several compute intensive loops or blocks are extracted and are suggested for implementation into a hardware accelerator. Finally, the physical level is managed in Phase 3. Here the executables for the application modules are implemented. The hardware accelerators proposed in Phase 2, can be generated using C-to-Gates tools, like ImpulseC, if they are not within the IP library. Finally, the Xilinx tools are used to generate the full and partial bitstreams. For leveraging the design of the partial bitstream an additional tool has been developed called GenerateRCS, which is described in detail in [16].

Authorized licensed use limited to: PONDICHERRY ENGG COLLEGE. Downloaded on June 14,2010 at 07:09:25 UTC from IEEE Xplore. Restrictions apply.

C. Special Purpose Runtime Operating System To manage the hardware resources at runtime, to schedule and allocate the tasks to the individual PEs, a special purpose OS, as shown in Fig. 3 was integrated on one of the processors. This OS is called CAP-OS [14], because it controls the access to the single internal configuration access port (ICAP) of the Xilinx FPGAs. Newer FPGAs from Xilinx have two of these ports, but they can only be accessed sequentially. Therefore, the access to these ports has to be managed. CAP-OS receives the task graph and the corresponding partial bitstreams, generated at design-time with the software toolchain. Its major tasks are runtime scheduling of the tasks, resource allocation and configuration management. It is implemented on top of a real-time OS from Xilinx, called Xilkernel, to reuse the available thread scheduling and hardware drivers. IV. RESULTS All these characteristics were exemplarily demonstrated on FPGA-based rapid prototyping platforms in order to evaluate the theoretical approach with real world application scenarios. In [3] performance and area requirements for a pure software and a pure hardware implementation of an image processing application have been presented. In [10] the linear speedup achieved using one, two and four processors and the Virtual-IO component, for splitting the input image in an appropriate number of tiles, have been presented. In [13] the results for a RAMPSoC using the StarWheels NoC and an image processing application are presented. In [14] the results obtained using the CAP-OS and five processors, which are connected over FSLs, are presented. In [15] the results achieved by using the semiautomatic toolchain to partition a complex image processing algorithm are shown. V. CONCLUSIONS AND FUTURE WORK RAMPSoC provides a holistic approach for FPGA-based runtime reconfigurable Hardware-Software Co-design through a consistent abstraction of the different levels of software and hardware. RAMPSoC differs from existing MPSoC approached in the capability to handle flexibility and adaptation mechanisms at design- and runtime. The increased flexibility comes with an extended design space, which complexity is hidden by a specific toolchain. Exactly due to this, the high complexity of the lower levels is hidden from the user. This will allow RAMPSoC to gain a broader acceptance in several development domains than previous solutions, where developers were forced to have a deep understanding of the hardware layers. The advantages of this Meet in the Middle solution approach will be evaluated with a real-world and real-time application from the image processing domain. The novel hardware architecture combined with the corresponding software toolchain and the CAP-OS of RAMPSoC is a promising solution for the upcoming frontiers in multi- and many-core architectures.

The novel Meet-in-the-Middle approach of RAMPSoC opens up new degrees of freedom for efficient application mapping onto future MPSoC systems at design-time as well as at runtime. REFERENCES
[1] [2] Intel Core 2 Quad Processor Product Brief, Available at http://www.intel.com W. Wolf: The Future of Multiprocessor Systems-on-Chips; In Proc. Design Automation Conference (DAC 2004), pp. 681-685, June 2004. D. Ghringer, M. Hbner, V. Schatz, J. Becker: Runtime Adaptive Multi-Processor System-on-Chip: RAMPSoC; In Proc. of IPDPS 2008, April 2008. P. Lysaght, B. Blodget, J. Mason, J. Young, B. Bridgford: Invited Paper: Enhanced Architectures, Design Methodologies and CAD Tools for Dynamic Reconfiguration of Xilinx FPGAs; In Proc of FPL 2006, pp. 16, Aug. 2006. D. Ghringer, T. Perschke, M. Hbner, J. Becker: A Taxonomy of Reconfigurable Single-/ Multi-Processor Systems-on-Chip; International Journal of Reconfigurable Computing, vol. 2009, Article ID 395018, Hindawi, 2009. D. Burke, J. Wawrzynek, K. Asanovic, A. Krasnov, A. Schultz, G. Gibeling, P.-Y. Droz: RAMP Blue: Implementation of a Manycore 1008 Processor System; In Proc of RSSI 2008, July 2008. Dynamic System Reconfiguration in Heterogeneous Platforms: The MORPHEUS Approach; Springer, 2009. C. Bobda, T. Haller, F. Mhlbauer, D. Rech, and S. Jung, Design of adaptive multiprocessor on chip systems, in Proc. of SBCCI 07, pp. 177183, Sept. 2007. C. Claus, W. Stechele, and A. Herkersdorf, Autovisiona run-time reconfigurable MPSoC architecture for future driver assistance systems, Information Technology Journal, vol. 49, no. 3, pp. 181 187, 2007. D. Ghringer, M. Hbner, T. Perschke, J. Becker: New Dimensions for Multiprocessor Architectures: On Demand Heterogeneity, Infrastructure and Performance through Reconfigurability: The RAMPSoC Approach; In Proc. of FPL 2008, pp. 495-498, Sept. 2008. L. Braun, D. Ghringer, T. Perschke, V. Schatz, M. Hbner, J. Becker: Adaptive Real Time Image Processing exploiting 2 Dimensional Reconfigurable Architecture; Journal of Real-Time Image Processing, vol. 4, no. 2, pp. 109-125, Springer, 2009. M. Hbner, L. Braun, D. Ghringer, J. Becker: Run-Time Reconfigurable Adaptive Multilayer Network-on-Chip for FPGAbased Systems; In Proc. of IPDPS 2008, April 2008. D. Ghringer, B. Liu, M. Hbner, J. Becker: Star-Wheels Networkon-Chip Featuring a Self-Adaptive Mixed Topology and a Synergy of a Circuit- and a Packet-Switching Communication Protocol; In Proc. of FPL 2009, pp.320-325, Sept. 2009. D. Ghringer, M. Hbner, E. Nguepi Zeutebouo, J. Becker: CAPOS: Operating System for Runtime Scheduling, Task Mapping and Resource Management on Reconfigurable Multiprocessor Architectures; In Proc. of IPDPS 2010, April 2010, in press. D. Ghringer, M. Hbner, M. Benz, J. Becker: A Semi-Automatic Toolchain for Reconfigurable Multiprocessor Systems-onChip: Architecture Development and Application Partitioning; In Proc. of FPGA 2010, Feb. 2010. D. Ghringer, J. Luhmann, J. Becker: GenerateRCS: A High-Level Design Tool for Generating Reconfigurable Computing Systems; In Proc. of VLSI-SoC 2009, Oct. 2009.

[3]

[4]

[5]

[6]

[7] [8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

Authorized licensed use limited to: PONDICHERRY ENGG COLLEGE. Downloaded on June 14,2010 at 07:09:25 UTC from IEEE Xplore. Restrictions apply.

You might also like