Professional Documents
Culture Documents
Abstract Hard
Soft vector processors in embedded FPGA platforms such as Processor MXP Soft Vector Processor
the VectorBlox MXP engine can match the performance and ex-
ceed the energy-efciency of commercial off-the-shelf embedded ARMv7
SoCs with SIMD or GPU accelerators for OpenCV applications 32b
such as Saliency detection. We are also able to beat spatial CPU
hardware designs built from high-level synthesis while requiring
signicantly lower programming effort. These improvements are NEON
possible through careful scheduling of DMA operations to the SIMD
vector engine, extensive use of line-buffering to enhance data
reuse on the FPGA and limited use of scalar fallback for non-
vectorizable code. The driving principle is to keep data and Caches Vector
computation on the FPGA for as long as possible to exploit (512 KB) Scratchpad (64KB)
parallelism, data locality and lower the energy requirements of
communication. Using our approach, we outperform all platforms Zynq SoC
in our architecture comparison while needing less energy. At
640480 image resolution, our implementation of MXP soft
vector processor on the Xilinx Zedboard exceeds the performance
of the Jetson TK1-GPU by 1.5 while needing 1.6 less energy, Offchip DRAM (512MB)
Beaglebone Black by 4.7 at 2.3 less energy, Raspberry Pi
by 9 at 4 less energy, and Intel Galileo by 28 at 16 less
energy. Our vector implementation also outperforms Vivado HLS Fig. 1: High-Level Block Diagram of the Zynq SoC with the
generated OpenCV library implementation by 1.5. VectorBlox MXP [9] Soft Vector Processor
I. I NTRODUCTION beyond the capability of most embedded developers. To make
The availability of cheap embedded hardware and cheap these systems truly viable to the broader embedded developer
cameras are enabling a range of innovative applications in community, we need to lower the development cost of spatial
embedded vision. Novel implementation domains such as hardware designs.
remote monitoring with drones, industrial vision, and home Recent attention to high-level synthesis (HLS) from Xilinx
automation along with medical image processing are a bur- (Vivado HLS, SDAccel) and Altera (OpenCL) suggests a
geoning market for these platforms where cost, power (energy) way for FPGA-based designs to bridge the productivity gap
and performance are simultaneously critical. Commercial off- with competing hardware platforms. Vivado HLS ow even
the-shelf embedded platforms are often based on ARM SoCs supports OpenCV library [2], [12] (an open-source computer
but now ship with high-performance NEON SIMD engines vision library) for simpler high-level specication of image
and CUDA-programmable GPU accelerators that can, in prin- processing pipelines. However, this veneer of simplicity masks
ciple, handle the challenging demands of vision processing. the underlying system-level design challenges such as design
Computations in computer vision are characterized by data- partitioning, hardware reuse, and data movement. The devel-
parallel operations on image pixels that naturally t the SIMD oper must manually select the buffering strategies to optimize
organization of the NEON engines as well as the SIMT style of data movement subject to FPGA on-chip memory constraints
GPUs. However, in practice, the limited DRAM bandwidths, and consider limited FPGA logic capacity to carefully parti-
xed ISA-based CPU organizations, limited cache capaci- tion the design for maximum performance. The long FPGA
ties and inefciencies in accelerators prevent these systems simulation and compilation time slows down the classic edit-
from achieving their raw potential. When latency and power compile-debug design cycle, further discouraging prospective
considerations are strict, FPGA-based heterogeneous SoCs developers. Additionally, when newer, more capable FPGAs
based on Zynq are an attractive alternative. However, despite become available, the system-level design process needs to
performance and power advantages, FPGA programming is be repeated all over again. While this is still preferred to
77
Zedboard Jetson TK1 Intel Galileo Raspberry Pi Beaglebone Black
Technology 28nm 28nm 32nm 40nm 45nm
SoC Xilinx NVIDIA Intel Broadcom TI
Zynq 7020 Tegra K1 Quark X1000 BCM2835 AM3359
Processor ARMv7 ARMv7 i586 ARMv6 ARMv7
Accelerator NEON, FPGA NEON, CUDA cores Nil Nil NEON
Clock Freq. 667 MHz CPU 2.3 GHz CPU 400 MHz 700 MHz 1 GHz
250 MHz FPGA 852 MHz GPU
On-chip 32 KB L1 32 KB L1 16 KB L1 16 KB L1 32 KB L1
Memory 512 KB L2 2048 KB L2 128 KB L2 256 KB L2
560 KB FPGA 64 KB L1 + CSM -
Off-chip 512 MB 2048 MB 256 MB 512 MB 512 MB
Memory 32b DDR3-1066 64b DDR3-933 32b DDR3-800 32b DDR2-400 16b DDR3-606
Steady-state Power 5.1 W 3.6 W 3.6 W 2.1 W 2W
TABLE I: Datasheet specications of various COTS embedded SoC platforms
vector instrinsics. The scratchpad memory is banked for fast III. O PEN CV I MPLEMENTATION ON FPGA S
concurrent access to the various lanes and is fully managed In this section, we outline the computational requirements
by software through the vector DMA API. For optimum of the Saliency stack, identify opportunities for parallelism and
performance on image data, line buffers can be copied to the describe our HLS and MXP programming methodologies.
scratchpad using efcient image row-sized DMA transfers over
the AXI ports. The scratchpad can be additionally tuned by the A. Compute Requirements
programmer to support double-buffered DMA transfers to hide Embedded vision computations typically operate on a
memory loading costs. Within the context of chained OpenCV stream of image frames delivered to the processor from
evaluation, it is further possible to carefully pipeline multiple the camera sensor. Saliency computations are notoriously
function calls while keeping data resident in the scratchpad. challenging due to the sheer complexity of the underlying
Additionally, as most OpenCV operations are on 8b data items computations. A rst-cut implementation of Saliency detection
(few are 16b), the type specied in the instruction is decoded running on the ARMv7 CPU without NEON optimizations
by the soft processor to automatically switch from 16-lane 32b on 640480 VGA frame requires 5.3s to process a single
design to 32-lane 16b design or 64-lane 8b design for higher image (300M 8b integer operations) while consuming 4.6 W
throughput pixel operations. of power on the Xilinx Zedboard. A NEON optimized imple-
mentation of the same code shows a substantial performance
improvement over vanilla CPU version to deliver a compute
D. Embedded Platforms time of 0.49s (10 saving) at 5.7 W. This is not unexpected
as bulk of the calculations are arithmetic operations in the
Ultimately we are interested in choosing the most power 2D lter resulting in high SIMD potential. In Table II, we
efcient platform for implementing embedded vision com- list the arithmetic and memory complexity of the various
putations. To support this quest, we compare performance image processing tasks required to assemble a complete
and power measurements across a set of low-power COTS Saliency stack. While FILTER2D has large arithmetic inten-
embedded systems such as the Jetson TK1, Beaglebone Black, sity (several multiply-accumulate operations per memory ac-
and the Raspberry Pi boards. In particular, we are interested in cess), NORMALIZE and ABSDIFF are memory bound, while
comparing the performance and power efciency of SIMD and SUBSAMPLE and INTERPOLATE have irregular, staggered
GPU acceleration wherever they are available. We tabulate the memory accesses. In the nal column of Table II, we show
raw operating specications and micro-benchmarking results the relative distribution of runtime for the various OpenCV
of these platforms in Table I. From these raw specications, functions running on ARMv7 CPU with NEON acceleration.
the faster 4-core ARMv7 CPU on the Jetson TK1 along with
its 192 CUDA cores and faster DRAM interface makes it OpenCV Arithmetic Operations Memory Operations Time
a formidable entry into this architecture comparison. While Function + / Rd Wr (ms)
78
void cpu_filter2D(uint8_t *in, uint8_t *out) void vbx_filter2D(uint8_t *in, uint8_t *out)
{ {
// loop over row/col // Allocate scratch pad memory DMA not shown
for (int row = 0; row < M; row++) { uint8_t top = vbx_sp_malloc(N);
for (int col = 0; col < N; col++) { uint8_t mid = vbx_sp_malloc(N);
int sop = 0; uint8_t bot = vbx_sp_malloc(N);
// loop over 3x3 filter // Allocated 9 buffers for 3x3 temp results
for (int krow = 0; krow < 3; krow ++) { // top_11, top_12, top_13
for (int kcol = 0; kcol < 3; kcol++) { // top_21, top_22, top_23
sop += kernel[krow][kcol] * // top_31, top_32, top_33
in[row+krow][col+kcol];
} for (row = 1; row < M; row++) // row loop
} {
out[row][col] = sop/scale; // load new row
} vbx_dma_to_vector(bot, in+N*(row+2), N);
}
} // multiply rows in unrolled fashion
vbx(SVBHU, VSHL, top_11, k_11, top);
Fig. 3: C implementation of 2D lter vbx(SVBHU, VSHL, top_12, k_12, top);
vbx(SVBHU, VSHL, top_13, k_13, top);
B. Optimizing Soft Vector Performance // repeat for mid and bot rows
We rst show a scalar representation of a 2D lter code // add rows in unrolled fashion
snippet in Figure 3 with doubly-nested for loops to process vbx(SVBHU, VSHL, top_11, k_11, top);
each pixel and an additional doubly-nested for loops within to vbx(VVHU, VADD, top_sum, top_11, top_12+1);
vbx(VVHU, VADD, top_sum, top_sum, top_13+2);
process each kernel coefcient. The vectorized MXP version
// repeat for other rows
of this code using the vector APIs is shown in Figure 4. You
can observe that the outer loop over rows is still retained and // scale result and copy row back
controlled from the scalar CPU core, while the inner loop over vbx_dma_to_host(out+(N*row), out, N);
columns has been vectorized. Additionally, we also unroll the
// update pointers
kernel loops into separate vector API calls. This separation
temp=top; top=mid; mid=bot; bot=temp;
allows a straightforward decoupling of memory and compute }
operations, and encourages overlapping of compute and DMA vbx_sync();
transfers. Note that it is possible to use an alternative form }
of vectorization where the internal 2D kernel operation is Fig. 4: MXP implementation of 2D lter
vectorized, but we observed faster performance when using (signicantly simplied)
the row-level parallelization strategy shown in Figure 4.
Based on this example, we can generalize the MXP pro- power constrains, keeping data in on-chip memories for as
gramming strategy as consisting of three key components long as possible is important for performance as well as power.
which we illustrate in Figure 5: However, when processing large image frames on limited
The scalar ARMv7 CPU allocates the input image buffer hardware, this may not always be possible. This leads to two
in the memory for transfer to the MXP. strategies (1) limited use of line buffering in on-chip BRAM
The CPU then sends DMA API commands to the MXP wherever possible, and (2) relegation to frame buffering in
DMA engine to load the scratchpad with appropriate data. off-chip DRAM wherever necessary, for composing OpenCV
Finally, the vector instructions are sent to the MXP decoder compute pipelines in hardware. We show an example OpenCV
based on the vector API calls used in CPU code. function sequence f g in Figure 6 such that OpenCV
Using this ow, a straight-forward functionally-correct MXP compute time is dwarfed by DMA transfer time. In this
implementation of saliency can be easily developed. However, situation, keeping data on-chip is great for performance as it
there are opportunities for optimization that can improve permits f and g to be scheduled back-to-back. Line-buffering
performance by as much as 4050%. The choice of buffering enables overlapped evaluation of downstream OpenCV func-
strategy is crucial to enable high-performance composition of tions that can directly consume data from the scratchpad
multiple OpenCV functions in a sequence. This fact has long thereby avoiding a round-trip to the DRAM. Even if OpenCV
been recognized when designing high-performance streaming compute times are longer than DMA times, avoiding DRAM
spatial hardware and is not unique to vision applications. access is good for lowering IO power.
However, choosing which mix of buffering strategies to use As an example, consider the PYRAMID FILTER calcu-
and to what extent is still tricky. While we perform the lation which contains an alternating series of FILTER2D
obvious double-buffering of inputs and outputs to overlap and SUBSAMPLE stages (up to 5 levels deep in our im-
vector computation with DMA memory transfers to/from the plementation). When implemented in spatial hardware, these
DRAM, largest wins come from line buffering [13]. can be implemented in dataow streaming style where both
Line Buffering From the perspective of performance and functions operate concurrently. For soft vector processors, this
79
FPGA Logic FPGA Logic
Vector
ARMv7 Instr. ARMv7 AXI2 OpenCV Mat2
32b 3 32b Mat Block AXI
CPU MXP CPU
2 2
Vector Scratchpad VDMA Engine
Copy 1 Copy 1 DMA
DMA
frame Instr. frame API
Fig. 5: MXP Flow showing different stages of execution Fig. 7: HLS Operation
Rd A0 Rd A1 Rd A2 Rd B0 Rd B1 Rd B2
F0 F1 F2 G0 G1 G2 can be considered dead pixels. We can safely remove the
Wr B0 Wr B1 Wr B2 Wr C0 Wr C1 Wr C2
computation associated with these pixels without affecting
Rd A0 Rd A1 Rd A2 Rd/Wr DMA from DRAM Buffer correctness. An example of this is the FILTER2D and
F0 G0 F1 G1 F2 G2 Fi
Wr C0 Wr C1 Wr C2 Gi
OpenCV Function F Row i
OpenCV Function G Row i
SUBSAMPLE sequence where odd rows and columns can
be removed.
Fig. 6: Comparing Line-Buffering (bottom) and
Frame-Buffering (top) C. Optimizing High-Level Synthesis Performance
opportunity allows us to chain the OpenCV functions at the Traditional high-level synthesis tools compile high-level
granularity of rows (or lines) instead of entire frames by C/C++ code blocks, typically organized as for loops, into
scheduling consecutive row operations from the dependent pipelined RTL pipelines with suitable streaming interfaces.
OpenCV functions together. While this form of parallelism is While its possible to directly compile the C code shown in
natural for image processing pipelines, we must analyze func- Figure 3 with synthesis directives and compiler hints, Xilinx
tion dependencies and memory access patterns across function provides an alternative method that directly uses an optimized
boundaries to enable this global optimization. The careful OpenCV library. In Figure 8, we show the 2D lter being
reader may observe that this cannot be applied throughout described using the Vivado HLS OpenCV library approach.
the compute pipeline the Normalize operator breaks the The HLS compiler translates this description into pipelined
link by requiring a complete frame buffer accumulation before OpenCV RTL datapaths from their internal pre-synthesized
applying a suitable scaling factor to all pixels. Additionally, the library. When integrating with the Zynq SoC, VDMA engines
64KB on-chip scratchpad capacity further limits the number are needed for connecting the AXI streaming interfaces of the
of rows that can be buffered before we must spill out into the OpenCV datapaths to the DRAM where the frame buffers are
DRAMs. Line buffering is a critical optimization that helps allocated. Such a streaming design where multiple OpenCV
MXP performance approach that of fully spatial hardware datapaths are chained together with adequate internal buffering
designs with streaming on-chip buffers. is the key source of performance benets of the FPGA fabric.
Beyond buffering, there are various computation-related
void hls_filter2D(AXI_STREAM& in,
optimizations necessary for efciency: AXI_STREAM& out)
Variable-Resolution Pipelining Saliency is characterized {
by variations in image resolutions through the execution GRAY_IMAGE in_img(ROWS, COLS);
ow due to SUBSAMPLE and INTERPOLATE operations. GRAY_IMAGE out_img(ROWS, COLS);
This means there is often a mismatch between input and // Convert in image stream to Mat
hls::AXIvideo2Mat(in, in_img);
output data rates to the vector blocks resulting in a potential hls::GaussianBlur<3,3>(in_img, out_img);
for wasted cycles. A nave approach would ofoad the data hls::Mat2AXIvideo(out_img, out);
rearrangement functions back to the CPU, but that would }
result in needless round-trip DMA transfers of data. Instead,
Fig. 8: HLS implementation of 2D lter
we use type conversion vector operations to repack 8b
data into 16b boundaries. While this is admittedly a slight For spatial pipelines generated by high-level synthesis, a
waste of parallel lanes, (1) this avoids DMA overheads, key consideration is whether the design will t the intended
by rearranging data in-place, and (2) keeps the hardware FPGA platform. In our case, preliminary experiments for a
permutation fabric simple by avoiding the need for an 640480 design suggests that, for the Zedboard, we may be
expensive interconnect between the vector lanes that can able to barely t an Intensity Map calculation on the FPGA
reduce operating frequency. fabric. A fully spatial implementation of the complete design
Dead-pixel Removal: If we analyze certain OpenCV is estimated to require 11 Zedboards worth of logic. For
function sequences, we realize that in some instances pixel our application scenario, we are forced to perform logic reuse
results are unused by the downstream operators. These within the allocated resource constraints. HLS-based directives
pixels are invisible to the rest of the compute stages and for resource sharing can help generate smaller footprints for
80
Platform OS gcc Total Power (W)
g++ Active G Zedboard NEON
Zedboard Xillinux 1.3, MXP is bare 4.6.3 5.7 (NEON) 1.3
G Jetson NEON
81
Line Buffering Double Buffering
G
G Zedboard NEON Compress+Stretch Unoptimized
G Jetson NEON G
Jetson GPU
G Saliency Map
Runtime (ms)
G G Pyramid Filter
G
G Center Surround
G
100 G
Interpolate
G
G Subsample
0
20
0
08
16
24
36
48
Filter2D
x7
x1
0x
0x
0x
0x
80
20
24
32
48
64
10
19
0 25 50 75 100
Image Resolution Relative Time
Fig. 11: Impact of Image Resolution on total compute time Fig. 12: Quantifying the impact of MXP Optimizations
(Zedboard vs. Jetson TK1)
OpenCV LUTs FFs DSPs BRAMs Time (ms)
less energy efcient consuming 1.8 J of energy. The rest of Function (18kB) 640480
the platforms are neither fast nor energy efcient compared to
FILTER2D 812 533 4 3 3.18
the Zedboard MXP and Jetson NEON/GPU implementations. (% of FPGA) 1.5% 0.5% 1.8% 1%
We also show a breakdown total and dynamic power usage RESIZE 6K 6.8K 16 2 3.331
of the different platforms in Table III. We plot the FPS/Watt 11.5% 6.5% 7.2% 0.7% 6.42
NORMALIZE 2.2K 1.9K 4 3 3.12
gures for the Jetson and Zedboard platforms in Figure 10 4.3% 1.8% 1.8% 1%
and observe that the MXP (FPGA) has better energy efciency ADDWEIGHTED 3.9K 2.6K 28 0 4.4
compared to all other platforms at 0.8 FPS/W. 7.5% 2.5% 12.7% 0%
ABSDIFF 361 302 0 0 4.5
We further quantify performance scaling trends at vari- 0.7% 0.3% 0% 0%
ous image resolutions in Figure 11 for the high-performing
PYRAMID 26.3K 28.6K 68 11 8.9
Zedboard and Jetson TK1 platforms. At resolutions below FILTER 50% 27% 31% 4%
640480, the 2.3 GHz Jetson ARMv7 CPU easily outperforms CENTER 5.9K 6.9K 16 2 1.76.43
all other platforms by a wide margin. And at the largest SURROUND 11% 6.5% 7.2% 0.7%
INTENSITY 72.6K 70.2K 232 21 36.8
19201080 resolution, the Jetson 192-CUDA core GPU beats MAP 136% 66% 105% 7.5%
the MXP implementation by 40%. Across all resolutions, the COLOR 100K 98.8K 300 32 87.5
MXP is either second best or the best running implementation MAP 189% 93% 136% 11.5%
ORIENTATION 406K 331K 1.1K 104 140
(at 640480 and 1080720). MAP 763% 311% 525% 37%
82
VI. D ISCUSSION pixel processing. To efciently support variations in resolution
and data movement within the image frame, the vector lanes
A. Related Work
should support a congurable routing pattern. Another key
FPGA accelerators for saliency detection have previously constraint that limits the ltering operations is the absence of
been demonstrated in [7] (Berkeley Calinx board) and [1] fused multiply-accumulation support (3 reads, 1 write).
(ML605 board). In these designs, the complete compute stack
ts on the FPGAs through manual time-multiplexing of stencil VII. C ONCLUSIONS
hardware with intermediate results stored on offchip DRAM Soft-vector processors such as the MXP deliver a competi-
when needed. The ML605 implementation runs at 4Mpixels/s tive energy-efcient solution for embedded vision applications
for a modied implementation of Saliency while the Calinx such as Saliency detection. For moderate image resolutions
version reports no results. In [3], the authors make a case for such as 640480 and 12801024, the Zedboard-MXP imple-
FPGAs in mobile vision processing applications using Haar mentation is the fastest and the most energy-efcient solution
detection as a case study. Oddly, they choose to compare a compared to the Jetson TK1, Beaglebone Black, Raspberry
Google Nexus One with an FPGA implementation on a large Pi and the Intel Galileo platforms. We are able to deliver
Stratix IV GX EP4SGX530 chip (DE4-530 board) instead these improvements by careful scheduling of DMA opera-
of using a lower-power Cyclone series. With the large chip, tions and exploiting line-buffering techniques to enhance data
an array of SIMD units were used to implement the entire reuse. In capacity-constrained systems for embedded visions
cascade of Haar classiers, with each SIMD unit congured application, we expect soft vector processors to offer a high-
as a streaming hardware block. They were able to classify performance, low-energy solution for demanding applications.
320240 images in 20 ms on the large FPGA which is 60 R EFERENCES
faster than the mobile SoC. An embedded implementation of [1] S. Bae, Y. C. P. Cho, S. Park, K. M. Irick, Y. Jin, and V. Narayanan.
vision processing tasks on the Zynq platform is presented An FPGA implementation of information theoretic visual-saliency sys-
in [4], where multiple pre-compiled Acadia Vision IP cores tem and its optimization. In Field-Programmable Custom Computing
Machines (FCCM), 2011 IEEE 19th Annual International Symposium
are mapped to the FPGA fabric and interface with the ARMv7 on, pages 4148. IEEE, 2011.
host via VDMA blocks. For certain vision tasks like contrast [2] G. Bradski. The OpenCV Library. Doctor Dobbs Journal, 2000.
normalization, image stabilization and moving target indi- [3] B. Brousseau and J. Rose. An energy-efcient, fast FPGA hard-
ware architecture for OpenCV-Compatible object detection. In Field-
cation, they report processing rates of 15fps for 640480 Programmable Technology (FPT), 2012 International Conference on,
resolutions on the ZC702 and ZC706 boards . IPPro [10] is pages 166173, 2012.
a soft processor core optimized for image processing on the [4] E. Gudis, P. Lu, D. Berends, K. Kaighn, G. van der Wal, G. Buchanan,
S. Chai, and M. Piacentino. An embedded vision services framework for
Zedboard platform and has been demonstrated to deliver 2.3fps heterogeneous accelerators. In Computer Vision and Pattern Recognition
for trafc sign detection algorithm on images of 600400 Workshops (CVPRW), 2013 IEEE Conference on, pages 598603, 2013.
resolution when using a 32 core design. Only color ltering [5] L. Itti and C. Koch. A model of saliency-based visual attention for
rapid scene analysis. Pattern Analysis and Machine Intelligence, IEEE
and morphological operations ran on the soft cores with the Transactions on, 20(11), Jan. 2002.
rest running on the ARMv7 CPU. Our MXP implementation [6] S. J. Jie and N. Kapre. Comparing soft and hard vector processing
targets a far more complex vision task than the ones reported in in fpga-based embedded systems. In 24th International Conference
on Field Programmable Logic and Applications, FPL 2014, Munich,
[4], [10] on the Zedboard while running at 5fps for 640480 Germany, 2-4 September, 2014, pages 17, 2014.
images. An alternative to Vivado HLS OpenCV library is pre- [7] N. Kapre, D. Walther, C. Koch, and A. DeHon. Saliency on a chip: a
sented in [8], where few OpenCV functions such as Gaussian digital approach with an FPGA . The Neuromorphic Engineer, pages
12, Nov. 2004.
Filter and Sobel Filter are implemented as optimizable nested [8] M. Schmid, N. Apelt, F. Hannig, and J. Teich. An image processing
loops. While they deliver slightly faster performance compared library for C-based high-level synthesis. In Field Programmable Logic
to Vivado OpenCV, their key contribution is that they enable and Applications (FPL), 2014 24th International Conference on, pages
14, 2014.
design space exploration by exposing optimization hooks [9] A. Severance and G. G. Lemieux. Embedded supercomputing in FPGAs
(HLS directives) unlike the black-box approach taken by the with the VectorBlox MXP Matrix Processor. In Hardware/Software
Vivado OpenCV library. Compared to MXP implementations, Codesign and System Synthesis (CODES+ ISSS), 2013 International
Conference on, pages 110. IEEE, 2013.
these still need a full place-and-route ow. [10] F. M. Siddiqui, M. Russell, B. Bardak, R. Woods, and K. Rafferty. IPPro:
FPGA based image processing processor. In Signal Processing Systems
B. Notes on Soft Vector Portability (SiPS), 2014 IEEE Workshop on, pages 16, 2014.
[11] M. Weinhardt and W. Luk. Pipeline vectorization for recongurable
A key advantage of the MXP is ease of design portabil- systems. In Field-Programmable Custom Computing Machines, 1999.
ity across platforms from the perspective of the embedded FCCM 99. Proceedings. 7th Annual IEEE Symposium on, pages 5262,
1999.
developer. We effortlessly setup the DE2-115 board with a [12] Xilinx, Inc. Xilinx XAPP1167 Accelerating OpenCV Applications
16-lane 100 MHz 64KB scratchpad design that ran as fast withZynq-7000 All Programmable SoC usingVivado HLS Video Li-
the the Zedboard NEON implementation. Similarly, we also braries. Technical report, Feb. 2009.
[13] H. Yu and M. Leeser. Automatic Sliding Window Operation Optimiza-
recompiled exact same saliency vector code to target a 32-lane, tion for FPGA-Based Computing Boards. Field-Programmable Custom
185 MHz, 128 KB scratchpad design on the DE4-230 to run Computing Machines, 2006. FCCM 06. 14th Annual IEEE Symposium
about 3-4 faster. Despite all the advantages, there are several on, pages 7688, 2006.
areas for improvement to the soft vector processor design for
83