You are on page 1of 4

Power Consumption and Reduction in a Real, Commercial

Multimedia Core
Dominic A. Antonelli

Alan Jay Smith

Jan-Willem van de Waerdt

University of California,
Berkeley
Computer Science Division
Berkeley, CA, 94720-1776
dantonel at eecs.berkeley.edu

University of California,
Berkeley
Computer Science Division
Berkeley, CA, 94720-1776
smith at eecs.berkeley.edu

NXP Semiconductors
1140 Ringwood Court
San Jose, CA
jan-willem.van_de_waerdt at
nxp.com

ABSTRACT

system designers face a tough task in choosing which techniques to apply and must rely heavily on prior and published
evaluations of the effectiveness of these techniques. Unfortunately, this type of evaluation often makes assumptions
about the software and hardware that may not hold in a
production environment, as opposed to a research study.
For evaluations based on multimedia workloads, many researchers use benchmark code from MediaBench [4] or other
multimedia benchmark suites. However, this software is at
most compiler optimized, and seldom optimized for the target processor(s), resulting in different behavior than with
production (hand-optimized) software. Also, techniques are
often compared against baselines that are not realistic. Researchers often ignore architectural features such as guarded
(predicated) execution that can greatly reduce the number
of branches. Many cache techniques are compared against a
baseline of an N-way set associative cache when all N tagand data- ways are accessed in parallel. Because of the high
power overhead, this baseline configuration is rarely used in
modern embedded processors such as the TriMedia TM3270
processor from NXP Semiconductors. This can result in inflated power savings measurements. As we shall see, often
a simple optimization can eliminate a large percentage of
the power consumption in a particular portion of a processor, which makes further optimizations to the same portion
insignificant.
In order to show that these issues are significant, we start
with a real design - the TM3270 - and use commercial synthesis and simulation tools to obtain a detailed breakdown of
where power is being consumed. This yields accurate power
consumption data, at the cost of large amounts of simulation time. We run both reference (compiler-optimized) and
hand-optimized versions of several applications, both with
the same, standard level of compiler-optimizations turned
on, and we compare the results. The applications we use are
an adpcm decoder, an mpeg2 encoder, and an mp3 decoder.
From this data, we find that there is a significant difference
between the power profiles of reference and optimized code,
mainly due to increased functional unit utilization and differing instruction mixes. Finally, we evaluate several simple
power optimizations that we found during this analysis such
as skipping tag accesses on sequential cache accesses.
This paper summarizes a much more comprehensive study
available in [2], and the reader is referred to that for a much
more extensive and complete discussion of the issues presented here and of related issues.

Peak power and total energy consumption are key factors in


the design of embedded microprocessors. Many techniques
have been shown to provide great reductions in peak power
and/or energy consumption. Unfortunately, several unrealistic assumptions are often made in research studies, especially in regards to multimedia processors. This paper
focuses on power reduction in real commercial processors,
and how that differs from more abstract research studies.
We study the power consumption of the TriMedia TM3270,
an embedded, synthesized microprocessor used in several
commercial products, on both reference benchmark code and
hand-optimized code using commercial synthesis and simulation tools. We find that increased functional unit utilization
and memory access density causes significant differences in
power consumption between compiler-optimized and carefully hand-optimized code. We also apply some simple techniques for power savings with no performance degradation,
though the focus of the paper is the evaluation of such techniques, not the techniques themselves.

Categories and Subject Descriptors


C.1.2 [Processor Architectures]: Multiple Data Stream
Architectures (Multiprocessors)Single-instruction-stream,
multiple-data-stream processors (SIMD)

General Terms
Measurement, Performance

1.

INTRODUCTION

Peak power and total energy consumption are key factors in the design of embedded microprocessors. High peak
power increases packaging costs, and high energy consumption reduces battery life. A wide variety of techniques targeting many different aspects of processor design have been
proposed to keep these factors under control. Embedded

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CF'09, May 1820, 2009, Ischia, Italy.
Copyright 2009 ACM 978-1-60558-413-3/09/05 ...$5.00.

171

2.

EFFECTS OF OPTIMIZATION

smaller than the maximum of 28 bytes per VLIW. For highly


optimized code, it is much less worthwhile to try to reduce
the power wasted by this discarded instruction data.
The increase in power in the functional units is clearly
due to the increase in ALU, DSPALU, and DSPMUL (and
FPALU and FPMUL in the MP3 application) operations per
cycle. The MPEG-2 and MP3 applications have a slightly
bigger increase in such operations and show a larger increase
in power in the functional units. In the register file, we see a
very similar effect - more operations per VLIW leads to an
increase in power, and the increase is slightly larger in the
MPEG-2 and MP3 applications due to the larger increase in
operations per VLIW.
The power increase in the decode logic is smaller than
that in the register file, even though they should both be
affected similarly by the increase in operations per VLIW.
The reason the increase is smaller in the decode logic is that
a significant piece of it has to be on for every VLIW, no
matter how many operations there are.
Other important effects of hand-optimization can inferred
from Table 2. First, the larger number of operations per
VLIW means that delaying operations to reduce peak power
will have more impact on performance. Unfortunately, this
means that techniques such as those in [7] will be less effective on hand-optimized code. Second, even if each VLIW
is fairly full, individual functional units are still unused a
large fraction of the time. Because of this, clock gating of
individual functional units is vital to keeping the dynamic
power low.
Note that although the power increases by about 50%
70% in each application, the total energy consumed decreases because the speedup is larger than the power increase. From Table 1, we can see that the total energy required over the entire course of each application is reduced
(by 95% in the MP3 application). Also note that some
or all of the speedup gained by hand-optimization can be
traded for greatly decreased power consumption by reducing the CPU frequency and voltage.
For example, assume that our optimized MP3 decoder is
fast enough to meet real time requirements when run at
200MHz, but we have a TM3270 that will run at 350MHz.
Then, we can reduce the operating frequency and voltage.
We tried this on a real TM3270 processor and found that we
could reduce the voltage from 1.226V to 1.028V, which yields
a 55% reduction in power consumption and a 24% reduction
in energy per cycle for the optimized MP3 decoder. Comparing this with running the reference MP3 decoder at 350MHz,
we found that this yielded a 12% reduction in power consumption. The optimized MP3 decoder running at 200MHz
is still 20x faster than the reference MP3 decoder running at
350MHz; therefore, we obtain both a performance increase
and a power reduction, which also yields a large decrease in
energy consumption (by a factor of 23.3). Also, any further
performance improvements that do not greatly increase the
power consumption can likewise be traded for power and
energy reduction.

In this section, we look at some of the effects of handoptimization. First, we examine direct power and energy increases caused by hand-optimization, then we look at changes
in the amount of loop processing, address generation, data
transformation and memory access overheads. We find that
hand-optimization can reduce these overheads, thus reducing total energy consumption, as well as improving performance.

2.1

Power and Energy

Table 1 shows the relative increase in power and energy


caused by hand-optimization. Although there is an increase
in power in all cases, the largest increase is in the load/store
unit (LSU) while the smallest is in the instruction fetch unit
(IFU) and decode logic. The large increase in LSU power
makes sense since the optimizations used generally reduced
the amount of computation needed per load and store, thus
increasing the density of loads and stores. As we can see
from Table 2, the total number of load and store operations
per VLIW more than doubled in all three applications when
hand-optimized.

Instr. Fetch
Load/Store
Func. Units
Reg. File
Decode
Overall

adpcm
1.11
2.22
1.48
1.55
1.32
1.47

Power
mpeg2
1.31
1.85
1.66
1.71
1.28
1.51

mp3
1.29
3.01
1.69
1.76
1.10
1.71

adpcm
0.363
0.730
0.486
0.510
0.435
0.484

Energy
mpeg2
0.080
0.113
0.102
0.105
0.079
0.093

mp3
0.038
0.089
0.050
0.052
0.033
0.051

Table 1: This table shows the relative power consumption and energy consumption of the handoptimized version of each application compared to
the reference version when running at 400MHz on
the design synthesized for 400MHz.
Application:
Version:
ALU (5)
DSPALU (3)
DSPMUL (2)
Branch (2)
Load (1)
Super Load (1)
Store (2)
FPALU (2)
FPMUL (2)
Total
NOP

ADPCM
Ref.
Opt.
1.94
3.11
0.11
0.35
0.00
0.00
0.06
0.04
0.19
0.43
0.00
0.00
0.06
0.17
0.00
0.00
0.00
0.00
2.36
4.11
2.64
0.89

MPEG-2
Ref.
Opt.
0.84
1.66
0.01
0.53
0.06
0.43
0.06
0.06
0.14
0.21
0.02
0.04
0.04
0.20
0.07
0.00
0.05
0.00
1.33
3.13
3.67
1.87

MP3
Ref.
Opt.
0.88
1.04
0.00
0.02
0.04
0.00
0.06
0.05
0.12
0.22
0.05
0.49
0.04
0.18
0.07
0.67
0.08
0.60
1.39
3.28
3.61
1.72

Table 2: Number of each type of operation per


VLIW. The values in parenthesis show the number
of available functional units of each type. Note that
super load operations take two issue slots, so we
count each one as two operations.
At first glance, the small increase in IFU power is baffling.
All else being equal, we would expect the instruction buffer
to end up full more and more as the operation density decreases. However, since the hand-optimized code does a lot
more loop unrolling and function inlining, there are far fewer
branches. Branches tend to cause power to be wasted in
the instruction cache because enough instruction data must
be fetched for 5 maximum size VLIWs in the branch delay
slots, but those VLIWs may be (and often are) considerably

2.2

Overhead Operations

Performance optimization also has a big impact on the


amount and types of overhead operations. In [6], D. Talla et
al. analyze how much loop processing, address generation,
data transformation, and memory access overhead exists in
compiler-optimized Pentium III MMX (P3 MMX) software.

172

Here we look at a similar analysis for highly hand-optimized


TM3270 code, namely the DCT and texture pipeline of the
MPEG-2 encoder application. Table 3 shows the amount
of true computation and overhead operations for the P3
MMX DCT code, as well as several different versions of the
TM3270 code (also DCT, except for TM3270 Opt D, which
is the entire MPEG-2 texture pipeline).
The TM3270 Baseline implementation is similar to the P3
MMX implementation in that it uses compiler optimizations
and SIMD instructions, though it has more loop and address
generation overhead. The number of memory access instructions is basically the same because though the P3 MMX can
read in a double-word with one load, rather than the singleword loads available on the TM3270, it loads both the data
values and the multiplication coefficients inside the inner
loop, while our implementation reads the coefficients only
once per outer loop. Our compiler automatically applied
this simple optimization.
The TM3270 Opt A implementation uses a variant of the
Loeffler algorithm[5] for computing the DCT, which requires
fewer multiplies and adds than the brute force method used
in the P3 MMX and TM3270 Baseline implementations. In
addition, using this method implicitly unrolls the inner loop
and partially unrolls the outer loop, since this algorithm is
designed to do an entire 8x1 DCT in a single outer-loop iteration and we use SIMD instructions to do an entire 8x2
DCT per loop iteration. Because of this unrolling, this implementation has much less loop overhead. This version also
has less address generation overhead since all data is read
out of a small 8x8 buffer and less memory access overhead
because each input value is only read once.
The TM3270 Opt B implementation takes this one step
further and fully unrolls the outer loop which cuts the loop
related overhead down to a single return instruction. The
TM3270 Opt C implementation completes the 2-D DCT by
doing both the horizontal and vertical 1-D DCTs. This doubles the number of true computation operations, and adds 64
data organization operations due to the transposing needed
before and after the vertical 1-D DCT. However, this still
yields an overall reduction in the overhead because no additional loads or stores are necessary, since all data values are
kept in registers.
The final implementation, TM3270 Pipe, shows what happens when the entire texture pipeline of an MPEG-2 encoder is optimized. This code includes difference calculation, DCT, quantization, run-length encoding, zig-zag reorganization, inverse quantization, IDCT, and motion reconstruction. This is much more than a small DCT kernel, so
its statistics will be more relevant to real applications. As
we can see, this highly optimized code contains only 40%
overhead, much less than in the P3 MMX implementation.
Further, most of the address generation and some of the
data organization overhead are actually necessary parts of
the zig-zag and run-length encoding (though they do not
utilize the SIMD units in the processor) and are irregular
enough that they would not map well to the PL instruction
proposed in [6].
This reduction in overhead operations has a few important implications for low-power processor design. First, additional hardware support for reducing such overhead is less
useful. Instead, one should look at the portions of the
code that are difficult to hand-optimize like this, and decide if hardware support might give significant gains there.

Also, the reduction in memory accesses hints at another effect: hand-optimization can eliminate repeated loads/stores
from/to the same location. An implementation that does
the DCT, quantization, etc. in separate functions would
have to load and store the entire 8x8 block of pixels in each
function, but once they are in-lined, all those extra loads and
stores disappear. Once the data only has to be read once and
written once, a mechanism for loading/storing data directly
to/from registers without polluting the cache could be useful for increasing effective cache capacity and reducing cache
energy.

3.

POWER REDUCTION TECHNIQUES

The breakdown of power among the various components


of the design point to several possible methods for reducing
the power consumption. We look at a few of such methods,
and evaluate their effectiveness at reducing power and their
performance and area implications.

3.1

Skipping Duplicate Tag SRAM Reads

Instruction cache lines are 128 bytes long, but only 32


bytes of instruction data are read each cycle. In the absence
of branches, all four 32-byte blocks of each cache line will
be read sequentially. In this case, the tag SRAM is accessed
four times in a row, even though the tag and index arent
changing. This observation leads to a very simple optimization that reduces the tag SRAM power by nearly 75% by
disabling the tag SRAM whenever the current PC and next
PC belong to the same cache line. This is very similar to the
idea of the quadword buffer used in [3] and the cache line
buffer in [1], except that we do not need to read an entire
cache line in one cycle, and no prediction is necessary since
the TM3270 uses jump delay slots instead of branch prediction. Synthesizing the design with this new technique did
not change the total area or maximum operating frequency
significantly.
Using this technique reduces the power consumption in
the IFU tag SRAM by 65-72% depending on the application. However, the overall processor power consumption of
this technique is about 6-7% in the reference applications,
but only about 4-5% in the optimized application, because
the IFU tag SRAM consumes a smaller fraction of the total
processor power in the optimized applications. Using the reference applications overestimates the power savings of this
technique by an average of 30%. In addition, any further optimizations to the tag SRAM power will be nearly insignificant compared to the power consumed by the other SRAMs.
Thus, using techniques like way-prediction and way-halting
just to save the tag power will not be very useful. These
techniques make more sense when the tag and data are accessed in parallel. Even if they are accessed in parallel, we
can predict that the cache way will be the same way that
was accessed in the previous cycle whenever the current and
next PC belong to the same cache line without any extra
lookup tables.
This same technique could be used in the data cache to
reduce the power of tag accesses; however, the data cache
tag accesses arent as large a percentage of the total power,
and sequential access isnt as common in the data cache.
In addition, when two addresses are statically known to be
sequential, the compiler or programmer can combine two
load operations into one super load operation, which already
reduces the tag power because the tag is only read once for

173

Application

True

(1)
(1)
(1)
(1)
(2)
(3)

256
256
144
144
288
1058

P3 MMX
TM3270 Base
TM3270 Opt A
TM3270 Opt B
TM3270 Opt C
TM3270 Pipe

Loop-Related
216
730
26
1
1
7

Addr Gen
122
346
10
0
0
230

Overhead
Data Tranform
192
192
0
0
64
272

Mem Access
384
384
80
74
74
182

Total
914
1460
116
75
139
691

% True
21.9%
13.4%
55.4%
65.8%
67.5%
60.5%

Table 3: Amount of true computation and various types of overhead operations for different implementations
of (1) a 1-D 8x8 DCT, (2) a 2-D 8x8 DCT, and (3) an MPEG-2 texture pipeline
both of the two words. Also, for performance reasons, the
instruction scheduler does not always put sequential data
accesses back-to-back.

3.2

forts elsewhere. Also, giving each functional unit its own


flip-flops for holding source values and intermediate results
can eliminate useless switching, particularly important for
power-hungry functional units such as multipliers.
Finally, we looked at the difference in processor power consumption when running reference (compiler optimized) and
hand-optimized software. We found that hand-optimization
was enormously effective, and the total energy consumption
dropped by a large fraction, swamping most hardware and
architecture improvements. We found that hand-optimization
had a moderate impact on how effective various power savings techniques were, and a big impact on the amount of loop
processing, address generation, and other overhead. Handoptimized versions of more modern software, such as a VC-1
decoder or temporal video up-converter, may show an even
larger impact as applications become more memory-bound.
The reader also is referred to [2] for further results and
analysis; that paper presents a much more extensive and
complete discussion of the issues presented here and of related issues.

Functional Units

Looking into the design, we notice that the first stage


of some of the functional units share the same source and
opcode flip-flops. There are only two sets of flip-flops, one
for the floating point multiply and divide units, and one for
everything else. This means that when an ALU operation is
executed, the first stage of the DSPMUL unit and FPALU
both see new source values, causing useless switching. This
may have been done as an area or timing optimization, but it
increases the power consumption. Table 4 shows the power
savings that result if we give the first stage of the DSPMUL
unit its own flip-flops. As we can see, in the applications
that contain no integer multiplies, the power consumption
in the DSPMUL is almost completely eliminated. A similar
optimization could be applied to the FPALUs, though the
gains would be much more modest since they consume much
less power to begin with. Similarly, some power could be
saved in the integer ALUs using this technique; even though
they are the most utilized functional units, they are still on
but unused some of the time.
Application:
Version:
Old DSPMUL
New DSPMUL
Savings
Old Total
New Total
Savings

ADPCM
Ref.
Opt.
6.18
10.2
1.21
1.21
80%
88%
246.6
363.5
240.3
353.4
2.5%
2.8%

MPEG-2
Ref.
Opt.
6.64
15.8
2.85
11.4
57%
28%
185.2
280.1
180.9
275.2
2.3%
1.8%

5.

MP3
Ref.
Opt.
6.34
9.13
2.30
1.04
64%
89%
185.8
318.0
180.9
308.8
2.6%
2.9%

Table 4: This table shows the power savings obtained by giving the first stage of the DSPMUL unit
its own flip-flops and clock-gating them when not in
use.

4.

REFERENCES

[1] Kashic Ali, Mokhtar Aboelaze, and Suprakash Datta.


Predictive Line Buffer: A Fast, Energy Efficient Cache
Architecture. In Proc. IEEE SoutheastCon, pages
291295, 2006.
[2] Dominic Aldo Antonelli, Alan Jay Smith, and
Jan-Willem van de Waerdt. Power Consumption in a
Real, Commercial Multimedia Core. UC Berkeley.
http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS2008-24.pdf
[3] James Cho, Alan J Smith, and Howard Sachs. The
Memory Architecture and the Cache and Memory
Management Unit for the Fairchild CLIPPER
Processor. Technical report, Berkeley, CA, USA, 1986.
[4] Chunho Lee, Miodrag Potkonjak, and William H.
Mangione-Smith. MediaBench: A Tool for Evaluating
and Synthesizing Multimedia and Communicatons
Systems. In Proc. 30th Ann. ACM/IEEE Int. Symp.
Microarch., pages 330335, 1997.
[5] Mihai Sima, Sorin Cotofana, Jos T. J. van Eijndhoven,
Stamatis Vassiliadis, and Kees Vissers. An 8x8 IDCT
Implementation on an FPGA-Augmented TriMedia.
fccm, 00:160169, 2001.
[6] Deependra Talla, Lizy K. John, and Doug Burger.
Hardware Support to Reduce Overhead in Fine-Grain
Media Codes. Technical report, Austin, TX, USA, 2001.
[7] W. Zhang, N. Vijaykrishnan, M. Kandemir, M. J.
Irwin, D. Duarte, and Y-F. Tsai. Exploiting VLIW
Schedule Slacks for Dynamic and Leakage Energy
Reduction. In Proc. 34th Ann. ACM/IEEE Int. Symp.
Microarch., pages 102113, 2001.

CONCLUSIONS

We have taken a detailed look at the power consumptions


of a real, commercially available embedded multimedia processor. To obtain accurate and detail power consumption
data, we used commercial synthesis, simulation, and power
analysis tools. We compared the power consumption on
reference benchmark code with the consumption on handoptimized code.
In addition, we found a few simple changes could have a
significant impact on the power consumption and help us
to focus on more critical components. For example, disabling the instruction tag access on sequential instructions
is extremely simple, but makes the instruction tag power
consumption almost negligible, so we can focus further ef-

174

You might also like