Professional Documents
Culture Documents
I.
To fulfill the real-time requirements for high performance computing applications such as image processing, Multi-Processor Systems-on-Chip (MPSoCs) are a promising solution. Nowadays, several different MPSoC architectures are available from industry and academic institutions. These architectures vary from homogeneous multi-core architectures [1], which are mainly used for general purpose applications, to heterogeneous MPSoCs [2], which are mostly optimized for a specific application. While a broad variety of hardware architectures exists, one of the main challenges in R&D is the programmability. This means achieving a well balanced and performance optimum targeting workload distribution, and therefore an optimal speed-up by efficiently mapping the given application onto the hardware architecture. While it may be feasible for some applications to find an optimal allocation at design-time, an optimal task allocation at runtime is highly difficult. The reason to investigate in this is the fact, that a runtime adaptive hardware enables to tailor the processing platform to the current requirements of the application and is therefore able to reduce power consumption and optimize the performance. The reason for this lack in current hardware is that the architecture, e.g. the processors, the memory and the communication infrastructure, is fixed and cannot be adapted to runtime requirements. To solve this issue, a novel runtime adaptive MPSoC, called RAMPSoC [3] has been developed. The hardware
Authorized licensed use limited to: PONDICHERRY ENGG COLLEGE. Downloaded on June 14,2010 at 07:09:25 UTC from IEEE Xplore. Restrictions apply.
flexibility. The shortcomings of static multiprocessor systems, as e.g. [1], IBM CELL Broadband Engine, or Nvidia General Purpose Graphic Processing Units, are that the user has to adapt and modify his application to map it onto these architectures. As the number and type of available processors, their communication infrastructure, e.g. Busbased or Network-on-Chip (NoC), and the local and global memory bandwidth are fixed, the application can only be mapped suboptimal onto these architectures. This leads to an unbalanced workload, were some of the processors remain idle, while others need to compute most of the tasks. To achieve a better tradeoff between performance and power consumption, several research groups investigate in the usage of reconfigurable hardware in single- and multiprocessor systems. Burke et al. [6] are working at the RAMP Blue system, which consists of multiple FPGA Boards with multiple homogeneous processors. This system is used to investigate different application mapping strategies for future many-core systems. Due to the power consumption of these multiple FPGA boards, it cannot be used for embedded high performance systems. Also, a runtime adaptation of the system is not supported so far. There also exists approaches for runtime adaptive MPSoCs like the MORPHEUS Approach [7], consisting of an ARM processor and three heterogeneous and reconfigurable accelerators. Bobda et al. [8] and Claus et al. [9] have both presented FPGA-based MPSoCs with reconfigurable accelerators. The shortcomings of these three approaches are that they only support the runtime adaptation of the accelerators, but neither the communication infrastructures nor the processors are reconfigurable.To the best of our knowledge no such holistic approach as RAMPSoC exists. Other approaches are focused either on static single/multi-processor systems or on systems that support the runtime adaptation of the hardware accelerators. RAMPSoC provides a higher degree of freedom, because not only the accelerators, but also the communication infrastructure and whole processors can be adapted, added or removed on-demand at runtime as well as at design-time. This way, requirements such as performance and power consumption can be achieved more efficiently III. THE RAMPSOC APPORACH To abstract the complexity of the underlying hardware from the application programmer four abstraction layers have been introduced in [10]: 1. MPSoC-, 2. Communication-, 3. Processor- and 4. Physical-level. As shown in Fig. 1, the hardware architecture and the software toolchain are conforming to these abstraction layers. The MPSoC-level is the level with the highest grade of abstraction from the physical realization. This level represents the whole RAMPSoC hardware system architecture and is used by the application programmer. The communication-level represents the miscellaneous communication infrastructures, which are not fixed to a single realization option. E.g. different topologies and protocols are supported. The different types of processors and the hardware accelerators belong to the processor-level. The physical implementation of the MPSoC on the FPGA is handled in the physical-level. The communication- , the
processor- and the physical-level are hidden from the programmer and are handled by the software development toolchain. At the MPSoC-level the programmer implements the application in C or C++. The software development toolchain transforms this into a task graph for analysis and partitioning into code fragments for different processors. This results in an adjacency matrix specifying the needed communication infrastructure at the communication-level and leads to the physically instantiation. At the processorlevel, the C-code fragments are profiled and then a Hardware/Software Co-design together with a partitioning in time and space is done. The processors used are all state-ofthe-art processors, this means their design tools (e.g. compiler, linker, etc) exist.
Hardware System Architecture
MPSoC Switch
MicroProcessor Switch MicroProcessor Switch MicroProcessor Switch MicroProcessor Switch MicroProcessor Switch MicroProcessor
Software Toolchain
Taskgraph
MPSoC-Level
NoC
Point-to-Point
Communication-Level
BUS
Adjacent Matrix
Processors
MicroProcessor Accelerator MicroProcessor MicroProcessor Accelerator
Dataflowgraph
Processor-Level
A. Hardware Architecture The hardware architecture of RAMPSoC consists of a processor set, communication infrastructures and hardware accelerators. The supported IP is included in an IP library, which will be used by the software toolchain. So far, several 32-bit processors (PowerPC, Xilinx MicroBlaze and Leon Sparc) and one 8-bit Processor (Xilinx PicoBlaze) are supported. Additionally, the communication infrastructure can be chosen from several different types, such as Busbased (PLB, OPB, XPS, AMBA), Point-to-Point (Xilinx Fast Simplex Links (FSL)), NoCs (see [11], [12] and [13]) or a combination of these. Several accelerators (Gauss-, Sobel, SAD-, 2D-Normalized Correlation-, Hotspot-, Coldspot, Median-Filter, etc) are also included for speeding up image processing algorithms. The diversity of the library supports the implementation of homogeneous and heterogeneous multiprocessor architecture on the System-on-Chip (SoC). For the demands of image processing applications a novel NoC was developed called Star-Wheels Network-on-Chip [13]. This network has a heterogeneous topology, as shown in Fig. 2. It uses the so called wheel topology within the subnets and a star topology is used between the different subnets. For the communication protocol a synergy of a packet- and a circuit-switching is used. Packet-switching is used to establish a communication channel between two communication partners. Moreover, it is used for control purposes, e.g. it checks regularly, if the communication partners still exist or, if their addresses have changed. This way, it supports the exchange, the addition or the deletion of
Authorized licensed use limited to: PONDICHERRY ENGG COLLEGE. Downloaded on June 14,2010 at 07:09:25 UTC from IEEE Xplore. Restrictions apply.
processors at runtime. Its advantages are a high throughput and a low latency. Also, it was shown that it is deadlock free and area efficient for implementation on FPGAs. Additional benefits are scalability, the support of different clock domains and the integration into a high level design tool from Xilinx.
FPGA
4 5 3
special purpose operating system (OS) called Configuration Access Port-OS (CAP-OS) [14], which is described more in detail in subsection C. It is responsible for the runtime management of the hardware resources and for scheduling and allocation of the tasks of the task graph. B. Software Toolchain Together with the RAMPSoC architecture a semiautomatic software toolchain [15] was designed. The toolsuite of this toolchain consists of a combination of commercial and custom tools. At the current development state some manual steps are also required, but these will be reduced in future versions of the toolchain. The toolchain consists of three phases as shown in Fig. 4.
C/C++ Program Results:
Suggested partitioning for the application
PE2
PE1
3
2 1
1
7
6 5
0
7
Analysis
5 6
3 2
Phase 1
Systemarchitecture
Profiling Tracing
Communication Analysis
Legend:
: subswitch : superswitch : rootswitch PE : Processing Element
2 4 6
PE3
Iteration
SW / SW Partitioning
3 4
PE4
Phase 2
HW code
Results:
Identified hotspots for each processor
Phase 3
Results:
Partitioned application FPGA bitstream for the complete MPSoC (number of processors, communication infrastructure, hardware accelerators) including software executables for each processor
C-to-FPGA Compiler
Compiler
Fig. 3 shows an example for such a RAMPSoC system at one point in time. In this example the processing elements (PEs) are connected over a switched-based NoC. RAMPSoC supports either processors with or without accelerators or a Finite State Machine (FSM) combined with a pure hardware function as PEs. The figure shows further a special network component called Virtual-IO. This was designed to receive the incoming data from a video camera and to divide the images into several tiles. In parallel it collects the computed results from the processors and sends them out to a screen.
External Memory
Tool-Suite
Commercial tools Custom tools Tracing SW/SW Partitioning HClustering HW/SW Partitioning Profile_Analyser Manual steps
User applications
Switch
Virtual-I/O
One of the processors is used to communicate with the user. It receives the applications represented as task graphs and partial bitstreams. The task graphs and the partial bitstreams have been generated using the software toolchain of RAMPSoC, which is described in detail in subsection B. The processor, which communicates with the user, runs a
Phase 1 receives the sequential C/C++ application as an input. The individual functions of the application are then profiled and the communication overhead between them is analyzed. Tracing with a set of common input data is used to extract the call graph of the application. Out of these steps results the parameters required for the clustering decision of the hierarchical clustering algorithm, which is used to do a Software/Software partitioning. This way the system architecture and a set of application modules, one for each processor, are defined. Therefore, this phase happens at the MPSoC- and the communication-level. The processor-level is handled in Phase 2. Here, each application module is profiled on a code line basis. Out of this, one or several compute intensive loops or blocks are extracted and are suggested for implementation into a hardware accelerator. Finally, the physical level is managed in Phase 3. Here the executables for the application modules are implemented. The hardware accelerators proposed in Phase 2, can be generated using C-to-Gates tools, like ImpulseC, if they are not within the IP library. Finally, the Xilinx tools are used to generate the full and partial bitstreams. For leveraging the design of the partial bitstream an additional tool has been developed called GenerateRCS, which is described in detail in [16].
Authorized licensed use limited to: PONDICHERRY ENGG COLLEGE. Downloaded on June 14,2010 at 07:09:25 UTC from IEEE Xplore. Restrictions apply.
C. Special Purpose Runtime Operating System To manage the hardware resources at runtime, to schedule and allocate the tasks to the individual PEs, a special purpose OS, as shown in Fig. 3 was integrated on one of the processors. This OS is called CAP-OS [14], because it controls the access to the single internal configuration access port (ICAP) of the Xilinx FPGAs. Newer FPGAs from Xilinx have two of these ports, but they can only be accessed sequentially. Therefore, the access to these ports has to be managed. CAP-OS receives the task graph and the corresponding partial bitstreams, generated at design-time with the software toolchain. Its major tasks are runtime scheduling of the tasks, resource allocation and configuration management. It is implemented on top of a real-time OS from Xilinx, called Xilkernel, to reuse the available thread scheduling and hardware drivers. IV. RESULTS All these characteristics were exemplarily demonstrated on FPGA-based rapid prototyping platforms in order to evaluate the theoretical approach with real world application scenarios. In [3] performance and area requirements for a pure software and a pure hardware implementation of an image processing application have been presented. In [10] the linear speedup achieved using one, two and four processors and the Virtual-IO component, for splitting the input image in an appropriate number of tiles, have been presented. In [13] the results for a RAMPSoC using the StarWheels NoC and an image processing application are presented. In [14] the results obtained using the CAP-OS and five processors, which are connected over FSLs, are presented. In [15] the results achieved by using the semiautomatic toolchain to partition a complex image processing algorithm are shown. V. CONCLUSIONS AND FUTURE WORK RAMPSoC provides a holistic approach for FPGA-based runtime reconfigurable Hardware-Software Co-design through a consistent abstraction of the different levels of software and hardware. RAMPSoC differs from existing MPSoC approached in the capability to handle flexibility and adaptation mechanisms at design- and runtime. The increased flexibility comes with an extended design space, which complexity is hidden by a specific toolchain. Exactly due to this, the high complexity of the lower levels is hidden from the user. This will allow RAMPSoC to gain a broader acceptance in several development domains than previous solutions, where developers were forced to have a deep understanding of the hardware layers. The advantages of this Meet in the Middle solution approach will be evaluated with a real-world and real-time application from the image processing domain. The novel hardware architecture combined with the corresponding software toolchain and the CAP-OS of RAMPSoC is a promising solution for the upcoming frontiers in multi- and many-core architectures.
The novel Meet-in-the-Middle approach of RAMPSoC opens up new degrees of freedom for efficient application mapping onto future MPSoC systems at design-time as well as at runtime. REFERENCES
[1] [2] Intel Core 2 Quad Processor Product Brief, Available at http://www.intel.com W. Wolf: The Future of Multiprocessor Systems-on-Chips; In Proc. Design Automation Conference (DAC 2004), pp. 681-685, June 2004. D. Ghringer, M. Hbner, V. Schatz, J. Becker: Runtime Adaptive Multi-Processor System-on-Chip: RAMPSoC; In Proc. of IPDPS 2008, April 2008. P. Lysaght, B. Blodget, J. Mason, J. Young, B. Bridgford: Invited Paper: Enhanced Architectures, Design Methodologies and CAD Tools for Dynamic Reconfiguration of Xilinx FPGAs; In Proc of FPL 2006, pp. 16, Aug. 2006. D. Ghringer, T. Perschke, M. Hbner, J. Becker: A Taxonomy of Reconfigurable Single-/ Multi-Processor Systems-on-Chip; International Journal of Reconfigurable Computing, vol. 2009, Article ID 395018, Hindawi, 2009. D. Burke, J. Wawrzynek, K. Asanovic, A. Krasnov, A. Schultz, G. Gibeling, P.-Y. Droz: RAMP Blue: Implementation of a Manycore 1008 Processor System; In Proc of RSSI 2008, July 2008. Dynamic System Reconfiguration in Heterogeneous Platforms: The MORPHEUS Approach; Springer, 2009. C. Bobda, T. Haller, F. Mhlbauer, D. Rech, and S. Jung, Design of adaptive multiprocessor on chip systems, in Proc. of SBCCI 07, pp. 177183, Sept. 2007. C. Claus, W. Stechele, and A. Herkersdorf, Autovisiona run-time reconfigurable MPSoC architecture for future driver assistance systems, Information Technology Journal, vol. 49, no. 3, pp. 181 187, 2007. D. Ghringer, M. Hbner, T. Perschke, J. Becker: New Dimensions for Multiprocessor Architectures: On Demand Heterogeneity, Infrastructure and Performance through Reconfigurability: The RAMPSoC Approach; In Proc. of FPL 2008, pp. 495-498, Sept. 2008. L. Braun, D. Ghringer, T. Perschke, V. Schatz, M. Hbner, J. Becker: Adaptive Real Time Image Processing exploiting 2 Dimensional Reconfigurable Architecture; Journal of Real-Time Image Processing, vol. 4, no. 2, pp. 109-125, Springer, 2009. M. Hbner, L. Braun, D. Ghringer, J. Becker: Run-Time Reconfigurable Adaptive Multilayer Network-on-Chip for FPGAbased Systems; In Proc. of IPDPS 2008, April 2008. D. Ghringer, B. Liu, M. Hbner, J. Becker: Star-Wheels Networkon-Chip Featuring a Self-Adaptive Mixed Topology and a Synergy of a Circuit- and a Packet-Switching Communication Protocol; In Proc. of FPL 2009, pp.320-325, Sept. 2009. D. Ghringer, M. Hbner, E. Nguepi Zeutebouo, J. Becker: CAPOS: Operating System for Runtime Scheduling, Task Mapping and Resource Management on Reconfigurable Multiprocessor Architectures; In Proc. of IPDPS 2010, April 2010, in press. D. Ghringer, M. Hbner, M. Benz, J. Becker: A Semi-Automatic Toolchain for Reconfigurable Multiprocessor Systems-onChip: Architecture Development and Application Partitioning; In Proc. of FPGA 2010, Feb. 2010. D. Ghringer, J. Luhmann, J. Becker: GenerateRCS: A High-Level Design Tool for Generating Reconfigurable Computing Systems; In Proc. of VLSI-SoC 2009, Oct. 2009.
[3]
[4]
[5]
[6]
[7] [8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
Authorized licensed use limited to: PONDICHERRY ENGG COLLEGE. Downloaded on June 14,2010 at 07:09:25 UTC from IEEE Xplore. Restrictions apply.