Professional Documents
Culture Documents
Multimedia Core
Dominic A. Antonelli
University of California,
Berkeley
Computer Science Division
Berkeley, CA, 94720-1776
dantonel at eecs.berkeley.edu
University of California,
Berkeley
Computer Science Division
Berkeley, CA, 94720-1776
smith at eecs.berkeley.edu
NXP Semiconductors
1140 Ringwood Court
San Jose, CA
jan-willem.van_de_waerdt at
nxp.com
ABSTRACT
system designers face a tough task in choosing which techniques to apply and must rely heavily on prior and published
evaluations of the effectiveness of these techniques. Unfortunately, this type of evaluation often makes assumptions
about the software and hardware that may not hold in a
production environment, as opposed to a research study.
For evaluations based on multimedia workloads, many researchers use benchmark code from MediaBench [4] or other
multimedia benchmark suites. However, this software is at
most compiler optimized, and seldom optimized for the target processor(s), resulting in different behavior than with
production (hand-optimized) software. Also, techniques are
often compared against baselines that are not realistic. Researchers often ignore architectural features such as guarded
(predicated) execution that can greatly reduce the number
of branches. Many cache techniques are compared against a
baseline of an N-way set associative cache when all N tagand data- ways are accessed in parallel. Because of the high
power overhead, this baseline configuration is rarely used in
modern embedded processors such as the TriMedia TM3270
processor from NXP Semiconductors. This can result in inflated power savings measurements. As we shall see, often
a simple optimization can eliminate a large percentage of
the power consumption in a particular portion of a processor, which makes further optimizations to the same portion
insignificant.
In order to show that these issues are significant, we start
with a real design - the TM3270 - and use commercial synthesis and simulation tools to obtain a detailed breakdown of
where power is being consumed. This yields accurate power
consumption data, at the cost of large amounts of simulation time. We run both reference (compiler-optimized) and
hand-optimized versions of several applications, both with
the same, standard level of compiler-optimizations turned
on, and we compare the results. The applications we use are
an adpcm decoder, an mpeg2 encoder, and an mp3 decoder.
From this data, we find that there is a significant difference
between the power profiles of reference and optimized code,
mainly due to increased functional unit utilization and differing instruction mixes. Finally, we evaluate several simple
power optimizations that we found during this analysis such
as skipping tag accesses on sequential cache accesses.
This paper summarizes a much more comprehensive study
available in [2], and the reader is referred to that for a much
more extensive and complete discussion of the issues presented here and of related issues.
General Terms
Measurement, Performance
1.
INTRODUCTION
Peak power and total energy consumption are key factors in the design of embedded microprocessors. High peak
power increases packaging costs, and high energy consumption reduces battery life. A wide variety of techniques targeting many different aspects of processor design have been
proposed to keep these factors under control. Embedded
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CF'09, May 1820, 2009, Ischia, Italy.
Copyright 2009 ACM 978-1-60558-413-3/09/05 ...$5.00.
171
2.
EFFECTS OF OPTIMIZATION
In this section, we look at some of the effects of handoptimization. First, we examine direct power and energy increases caused by hand-optimization, then we look at changes
in the amount of loop processing, address generation, data
transformation and memory access overheads. We find that
hand-optimization can reduce these overheads, thus reducing total energy consumption, as well as improving performance.
2.1
Instr. Fetch
Load/Store
Func. Units
Reg. File
Decode
Overall
adpcm
1.11
2.22
1.48
1.55
1.32
1.47
Power
mpeg2
1.31
1.85
1.66
1.71
1.28
1.51
mp3
1.29
3.01
1.69
1.76
1.10
1.71
adpcm
0.363
0.730
0.486
0.510
0.435
0.484
Energy
mpeg2
0.080
0.113
0.102
0.105
0.079
0.093
mp3
0.038
0.089
0.050
0.052
0.033
0.051
Table 1: This table shows the relative power consumption and energy consumption of the handoptimized version of each application compared to
the reference version when running at 400MHz on
the design synthesized for 400MHz.
Application:
Version:
ALU (5)
DSPALU (3)
DSPMUL (2)
Branch (2)
Load (1)
Super Load (1)
Store (2)
FPALU (2)
FPMUL (2)
Total
NOP
ADPCM
Ref.
Opt.
1.94
3.11
0.11
0.35
0.00
0.00
0.06
0.04
0.19
0.43
0.00
0.00
0.06
0.17
0.00
0.00
0.00
0.00
2.36
4.11
2.64
0.89
MPEG-2
Ref.
Opt.
0.84
1.66
0.01
0.53
0.06
0.43
0.06
0.06
0.14
0.21
0.02
0.04
0.04
0.20
0.07
0.00
0.05
0.00
1.33
3.13
3.67
1.87
MP3
Ref.
Opt.
0.88
1.04
0.00
0.02
0.04
0.00
0.06
0.05
0.12
0.22
0.05
0.49
0.04
0.18
0.07
0.67
0.08
0.60
1.39
3.28
3.61
1.72
2.2
Overhead Operations
172
Also, the reduction in memory accesses hints at another effect: hand-optimization can eliminate repeated loads/stores
from/to the same location. An implementation that does
the DCT, quantization, etc. in separate functions would
have to load and store the entire 8x8 block of pixels in each
function, but once they are in-lined, all those extra loads and
stores disappear. Once the data only has to be read once and
written once, a mechanism for loading/storing data directly
to/from registers without polluting the cache could be useful for increasing effective cache capacity and reducing cache
energy.
3.
3.1
173
Application
True
(1)
(1)
(1)
(1)
(2)
(3)
256
256
144
144
288
1058
P3 MMX
TM3270 Base
TM3270 Opt A
TM3270 Opt B
TM3270 Opt C
TM3270 Pipe
Loop-Related
216
730
26
1
1
7
Addr Gen
122
346
10
0
0
230
Overhead
Data Tranform
192
192
0
0
64
272
Mem Access
384
384
80
74
74
182
Total
914
1460
116
75
139
691
% True
21.9%
13.4%
55.4%
65.8%
67.5%
60.5%
Table 3: Amount of true computation and various types of overhead operations for different implementations
of (1) a 1-D 8x8 DCT, (2) a 2-D 8x8 DCT, and (3) an MPEG-2 texture pipeline
both of the two words. Also, for performance reasons, the
instruction scheduler does not always put sequential data
accesses back-to-back.
3.2
Functional Units
ADPCM
Ref.
Opt.
6.18
10.2
1.21
1.21
80%
88%
246.6
363.5
240.3
353.4
2.5%
2.8%
MPEG-2
Ref.
Opt.
6.64
15.8
2.85
11.4
57%
28%
185.2
280.1
180.9
275.2
2.3%
1.8%
5.
MP3
Ref.
Opt.
6.34
9.13
2.30
1.04
64%
89%
185.8
318.0
180.9
308.8
2.6%
2.9%
Table 4: This table shows the power savings obtained by giving the first stage of the DSPMUL unit
its own flip-flops and clock-gating them when not in
use.
4.
REFERENCES
CONCLUSIONS
174