You are on page 1of 23

Energy

 Awareness  for  
Embedded  Systems
OP TI M I Z I NG  E M BED D ED  S OF TWA RE  F OR   P OW ER
Introduction

• Review  of  Power  Consumption


• Understanding  Power  for  Embedded  Systems
• Software  and  Hardware  Optimizations
Review  of  Power  Consumption
1.  Application 2.  Technology

3.  Voltage 4.  Frequency
hat will be covered very heavily here.

s. dynamic power consumption

Review  of  Power  Consumption


wer consumption consists of two types of power: dynamic and static (also know
leakage) consumption, so total device power is calculated as:
Ptotal 5 PDynamic 1 PStatic
ave just discussed, clock transitions
Types  are
of  Paower:
large portion of the dynamic
ption, but what is this “dynamic consumption”? Basically, in software we have
over dynamic consumption, but •weMaximum
do not have control over static consumption
• Average
wer consumption • Worst-­‐Case
• Typical
consumption is the power that a device consumes independent of any activity
core is running, because even in a steady state there is a low “leakage” current
sistor tunneling current, reverse diode leakage, etc.) from the device’s Vin to
The only factors that affect the leakage consumption are supply voltage,
ture, and process.
Minimizing  Power  Consumption

1. Hardware  Techniques
2. Data  Flow  Optimization
3. Algorithmic  Optimization
Hardware  Techniques  (1)

Low  Power  Modes

Power  gating Clock   gating Voltage  Control Frequency   Control


Hardware  Techniques  (2)
Considerations

Available  Block   Functionality


•Memory  states  and   validity  must   be  considered
•Certain  peripherals   will   not  be   available

Overhead
•Not   break   real-­‐time   constraints
Data  Flow  Optimization– Memory  Access  (1)  
Principle   of  locality
Data  Flow  Optimization– Memory  Access  (2)  
Interleaving
Data  Flow  Optimization– Memory  Access  (3)  
Burst   Access
Data  Flow  Optimization– Memory  Access  (4)  
Avoidance

Code  Optimization Code  Size

Algorithms Avoid   Packing  instructions Compression


Constants Zeroing Functions
different offsets (different “sets”) from the start of a wa
different offsets
declarations (different “sets”) from the start of a way. C
below:
declarations below:
int array1[ array_size ];
int array1[ array_size ];
int
intarray2[
array2[ array_size
array_size ];];
Optimizing Embedded
Optimizing Software
Embedded canfor
Software
The compiler Power
forthese
merge 403
Power 403 as shown belo
two arrays
The compiler can merge these two arrays as shown below:

Compiler struct
structmerged_arrays
Compilercache
cacheoptimizations
optimizations merged_arrays

{{
In order to assist with the above, compilers may be used to optimize cache power
Data  Flow  Optimization– Memory  Access  (5)  
In order to assist with the above, compilers may be used to optimize cache power
intarray1;
int array1;
consumption by reorganizing memory or memory accesses for us.int Two main
array2; techniques
consumption by reorganizing memory or memory accesses for int us. Two main techniques
array2;
available are array merging and loop interchanging, explained below.
available are array merging and loop interchanging, explained } }new_array[
new_array[ array_ size ]
below. array_ size ]
Array merging organizesCompiler  
memory so thatcache  
arraysoaccessed
ptimizations
In order to re-order
simultaneously will the
be way
at that high-level memory is read
Array merging organizes memory In order to re-order the way that high-level memory is r
different offsets (different “sets”) fromsothe
that arrays
start accessed
of a way. simultaneously
chunks
Consider to reduce
the thewill
following twobearray
chance atthrashing
of loop interchangin
different offsets
declarations (different “sets”) from the start of a way.chunks
below: below:
Considerto reduce the chance of thrashing loop interchan
the following two array
Array  
declarations Merging
below:
below: Loop   Interchanging
for (i 5 0; i,100; i 5 i 1 1)
int array1[ array_size ]; for (i(j
for 550;
0;i,100;
j,200; ij5
5i 1)
j 11)
int array2[
int array1[array_size
array_size]; ]; for (j(k
for 550;
0;j,200; 1)1)
j 5kj51k 1
k,10000;
int array2[ array_size ]; forz[(k
k5][0;
j ]k,10000;
5 10 * z[ kk ][
5 kj1];1)
The compiler can merge these two arrays as shown below:
z[ k ][ j ] 5 10 * z[ k ][ j ];
The compiler
struct can merge these two arrays as shown below:By interchanging the second and third nested loops, the co
merged_arrays
Byfollowing code, decreasing the likelihood of unnecessary th
interchanging the second and third nested loops, the
struct
{ merged_arrays loop.
following code, decreasing the likelihood of unnecessary
{int array1; loop. for (i 5 0; i,100; i 5 i 1 1)
int array2; for (k 5 0; k,10000; k 5 k 1 1)
int array1; forfor
(i(j
5 0;
5 0;i,100;
j,200;ij55ij111)
1)
} new_array[ array_ size ]
int array2; forz[(kk5][0;j k,10000; k k 1)
] 5 10 * z[ k ][ j1];
5
In order to re-orderarray_
} new_array[ the way that] high-level memory is read into cache,
size for (j 5reading
0; j,200;in jsmaller
5 j 1 1)
z[used.
chunks to reduce the chance of thrashing loop interchanging can be k ][ j Consider
] 5 10 * z[the
k ][code
j ];
In order
below: to re-order the way that high-level memory is Peripheral/communication
read into cache, reading in utilization
smaller
chunks to reduce the chance of thrashing loop interchanging can be used. Consider the code
for (i 5 0; i,100; i 5 i 1 1) When considering reading and writing of data, of course, w
below: Peripheral/communication utilization
memory access: we need to pull data into and out of the de
for (j 5 0; j,200; j 5 j 1 1)
for
for(i
(k550;
0; i,100; i5
k,10000; i1
k5 1)1)
k1 When considering reading and writing of data, of course
Data  Flow  Optimization  – Peripherals
• Coprocessors
Ø DMA

• Bus  Configuration

• Core  Communication
Ø Polling
Ø Time-­‐Based   Processing
Ø Interrupt   Processing
but also how code is organized.
Instruction packing
Loop unrolling revisited
Instruction packing was included in the data path optimization section above, but may also
be listed as an algorithmic optimization as it involvesWe
notbriefly
only discussed
how memory
usingisaltering
accessed,
loops in code in or
but also how code is organized. before. As we discussed earlier, another method for op
power in embedded processors is via loop unrolling. Th
unravels a loop, as shown in the code snippets below:
Algorithmic  Optimization  (1)
Loop unrolling revisited
Regular loop:
for (i 5 0; i,100; i 5 i 1 1)
We briefly discussed using altering loops in code in order to optimize cache utilization
for (k 5 0; k,10000; k 5 k 1 1)
Loop  
before. As we discussed earlier, another Unrolling
method for optimizing both performance and
a[i] 5 10 * b[k];
power in embedded processors is via loop unrolling. This method effectively partially
unravels a loop, as shown in the code snippets below: Loop unrolled by 4x:
for (i 5 0; i,100; i 5 i 1 4)
Regular loop: for (k 5 0; k,10000; k 5 k 1 4)
for (i 5 0; i,100; i 5 i 1 1) {
for (k 5 0; k,10000; k 5 k 1 1)
a[i] 5 10 * b[k];
a[i] 5 10 * b[k];
a[i 1 1] 5 10 * b[k 1 1];

Loop unrolled by 4x: a[i 1 2] 5 10 * b[k 1 2];

for (i 5 0; i,100; i 5 i 1 4) a[i 1 3] 5 10 * b[k 1 3];


}
for (k 5 0; k,10000; k 5 k 1 4)
{

a[i] 5 10 * b[k];
a[i 1 1] 5 10 * b[k 1 1];
a[i 1 2] 5 10 * b[k 1 2];
a[i 1 3] 5 10 * b[k 1 3];
}
minimization efforts we discussed in the data path section, which would lead to extra
412
memory accesses and the possibility of increased cache missChapter
penalties.13

Now we see how to parallelize the loop and pip


Software pipelining have some “set-up”, also known as loading the p
Another technique common to both embedded processorinstructions
performancewe performed
optimization andabove. After this we
embedded processor power optimization is software pipelining. Software loading
pipelining is a stage
Algorithmic  Optimization  (2)
//pipeline ! first
technique where the programmer splits up a set of interdependent instructions that would
a[i] 5 10 * b[i];
normally have to be performed one at a time so that the DSP core can begin processing
//pipeline loading ! second stage
multiple instructions in each cycle. Rather than explaining in words, the easiest way to
Software   Pipelining b[i] 5 10 * c[i];
follow this technique is to see an example.
a[i 1 1] 5 10 * b[i 1 1];
Say we have the following code segment: //pipelined loop
Regular Loop: for (i 5 0; i,100-2; i 5 i 1 1)
for (i 5 0; i,100; i 5 i 1 1) {
{ c[i] 5 10 * d[i];
a[i] 5 10 * b[i];
b[i 1 1] 5 10 * c[i 1 1];
b[i] 5 10 * c[i];
a[i 1 2] 5 10 * b[i 1 2];
c[i] 5 10 * d[i];
}
}
//after this, we still have 2 more partial loo
Right now, although we have three instructions occurring perc[i loop,
1 1]the
5 compiler
10 * d[i 1will
1];see
that the first instruction depends on the second instruction, and thus could not be pipelined
b[i 1 2] 5 10 * c[i 1 2];
with the second, nor can the second instruction be pipelined with the third due to
//final partial iteration
interdependence: a[i] cannot be set to b[i] as b[i] is simultaneously being set to c[i], and so
on. So right now the DSP processor has to execute the abovec[i 1 2] 5 10 * d[i 1 2];
loop 100 times with each
iteration performing three individual instructions per cycle
By(not very efficient),
pipelining the loop,for we
a total of
enabled the compiler
from 300 to:
fn!ðnÞ 5 fn!ðn 2 1Þ
If this recursive factorial function is called with
calls entailing 100 branches to subroutines (whic
the program counter and software stack). Each c
execute because not only is the core pipeline dis
adds at least a return address to the call stack. A
passed, these also must be pushed onto the stack
Algorithmic  Optimization  
(3)
This means that this recursive subroutine require
memory and related stall as writes/reads to mem
Eliminating   Recursion
pipeline stalls due to change of flow.
We can optimize this by moving to a simple loo
Optimizing Embedded Software for Power 413
int res 5 1;
fn!ð0Þ 5 1 For n 55 0 for(int i 5 0; i , n; i1 1)
fn!ðnÞ 5 fn!ðn 2 1Þ; For n . 0 {
#
res
factorial function is called with n 5 100, there would be B100 function
5 i;
}
00 branches to subroutines (which are change of flow routines which affect
unter and software stack). Each change of flow instruction takes longer
This to
function requires no actual writes to the s
e not only is the core pipeline disrupted during execution, but every branch
function calls/jumps. As this function only inv
eturn address to the call stack. Additionally, if multiple variables are being
so must be pushed onto the stack. “short loop” on certain devices, whereby the lo
Thanks to this feature, there are no change of
this recursive subroutine requires 1003 individual writes to physical
this effectively acts like a completely unrolled
ated stall as writes/reads to memory will not be pipelined and 1003
cost).
ue to change of flow.
e this by moving to a simple loop Compared to the recursive routine, using the loo
Algorithmic  Optimization  (4)

Reducing  Accuracy

Low  Power  Code  Sequences


Algorithmic  Optimization  (5)
OptAlg
§Tool  that   automates   the   optimization   of   power-­‐intensive   algorithmic   constructs  
using  symbolic  algebra   with   energy   profiling
Algorithmic  Optimization  (6)
OptAlg Flow
Architecture  Level  Optimization  (1)
Architecture  Level  Optimization  (2)
Clustered  Length-­‐Adaptive  Word  Processor  (CLAW)
§Allows  dynamic   modification   of  the  issue  width
References
1)  Length  Adaptive  Processors:   A  Solution  for  the  Energy/Performance  Dilemma  in  Embedded  
Systems  ,  Iyer,  Conte,  School  of  Computer  Science,  College  of  Computing,  Georgia  Institute  of  
Technology,  Atlanta,  GA  

2)  Low  Power  Embedded  Software  Optimization  using  Symbolic  Algebra,  Peymandoust,  Simunic,  
De  Micheli,  Stanford  University
Questions?

You might also like