Professional Documents
Culture Documents
2 B C
2
A t 4.2 Simple Static Power Management (S-SPM)
C
1/1 0 1 2 3 4 5 6
a. An application b. Static schedule for the application Assuming that every task in an application executes for
exactly ci units of time, the optimal static speed for a
Figure 1. An example and its static schedule uniprocessor system to get minimal energy consumption
can be obtained by proportionally distributing the static
If there is some slack in the system, the system can ap- slack to each task according to its ci [12]. Following the
propriately slow down the processor to save energy. We same idea, a simple static power management (S-SPM)
scheme for multiprocessor systems was proposed by dis- f
tributing global static slack over the length of a schedule T21 T11 T01 T12 D
[8]. Applying S-SPM to the task graph in Figure 1a, we see L0
that every task will run at 23 fmax and the static schedule is B C
shown in Figure 2b.
Note that, for multiprocessor systems, S-SPM is not op- A t
timal in terms of energy consumption because of the differ-
ent degrees of parallelism in a schedule. For the example 0 1 2 3 4 5 6
in Figure 1, S-SPM consumes 49 E, where E is the energy
Figure 3. Parallelism in the schedule
consumed when no power management (NPM) is applied
(assuming that a processor consumes no power when it is
in idle state). Another static power management is shown that the amount of static slack allocated to Ti is denoted by
in Figure 2c. It allocates 2 units of time to task A and C, li (i = 0, 1, 2), the total energy consumption E in the worst
and 4 units of time to task B. The energy consumption will case after allocating the global static slack L0 will be:
be 14 E, which is less than 49 E. Actually, S-SPM consumes X
even more energy than G-SPM, which consumes 29 72 E. The E= Ei (1)
reason for this is that S-SPM wastes an additional 1 units X
of slack by uniformly stretching the whole schedule. For a = (Cef i fi3 (Ti + li )) (2)
given static mapping and schedule, we now explore the allo- X Ti
cation of global static slack (if any) in terms of minimizing = Cef (i ( fmax )3 (Ti + li ))(3)
Ti + li
energy consumption. X
3 Ti3
= Cef fmax (i ) (4)
4.3 Static Power Management with Parallelism (Ti + li )2
(P-SPM) where Cef is the effective switch capacitance and fi is the
speed for sections with parallelism i. For simplicity, the
From the above discussion, we observe that S-SPM is
idle state is assumed to consume no energy. To optimally
not optimal for parallel and distributed systems when par-
allocate L0 , we need to minimize E subject to:
allelism varies in an application. The intuition is that more
energy savings can be obtained by giving more slack to sec- l0 + l1 + l2 L0
tions with higher parallelism, thereby reducing the idle pe-
0 li L0 ; i = 0, 1, 2
riods in the system. We propose a static power management
for parallelism (P-SPM) scheme which takes into consider- Using l0 = L0 l1 l2 in Equation (4), and setting E
=
l1
ation the degree of parallelism when allocating global static E
slack to different sections of a schedule. l2 = 0, we can get the following optimal solutions:
l0 = 0;
4.3.1 P-SPM for 2-processor systems
T1 (L0 (21/3 1) T2 )
l1 = ;
For applications running on a 2-processor system, the de- T1 + 21/3 T2
gree of parallelism (DP) in a static schedule will range from T2 (21/3 L0 + (21/3 1) T2 )
0 to 2 (when communication cost is significant, part of the l2 = ;
T1 + 21/3 T2
schedule may be used only for communication with zero
parallelism; see time 2-3 in Figure 3). where 0 l1 , l2 L0 .
The static schedule will first be partitioned according to From the solutions, if L0 (21/3 1)T2 then l1 = 0
parallelism. For the example in Figure 1, the first time unit (since there is a constraint that li 0) and l2 = L0 , that is,
has parallelism of 2, the second and fourth time units have all the global static slack will be allocated to the sections of
parallelism of 1, and the third time unit has parallelism of schedule with parallelism 2. For the example in Figure 1,
0. We define Tij as the length of the j th section of a sched- there will be l0 = 0, l1 = 1.0676 and l2 = 0.9324, the min-
ule with parallelism of i, and define the Ptotal length with imal energy consumption is 0.3464E. Compared with the
parallelism of i in a schedule as Ti = j Tij . The static energy consumption when using S-SPM 94 E = 0.4444E,
schedule for the example will be partitioned as in Figure 3. an additional 22% energy is saved by P-SPM. Note that, P-
Here, we have T0 = 1, T1 = 2 and T2 = 1. SPM consumes more energy than the static schedule in Fig-
In general, suppose that an application runs on a 2- ure 2c, which is only 0.25E. The reason is that the sched-
processor system with global static slack of L0 and the par- ule in Figure 2c claim the gap in the middle of the schedule
titioned static schedule has specific T0 , T1 and T2 . Assume while P-SPM does not.
Algorithm 1 P-SPM Algorithm where Ei and Ei0 are the energy consumptions for sections
P
Partition static schedule and generate Ti = Tij ; with parallelism i before and after getting L, respectively.
L0
Divide L0 into L intervals, each of length L; In general1 , the smaller L is, the more accurate the solu-
while there is still slack L do tion is. But the more allocation steps there are, the more
Allocate L to Ti such as to maximize time consuming the algorithm is. After allocating each L
Ei = Cef i fi3 Ti L(2T i +L)
(Ti +L)2 ; segment, li will be re-distributed to Tij . Finally, each task
Update fi = fi Ti T+Li
and Ti = Ti + L. will gather all slack allocated to it and a single static speed
end while for the task is computed.
T
For all i, allocate li to Tij with lij = li Tiji ; Due to synchronization of tasks and parallelism of an ap-
For every task i , gather it allotted time ti , and determine plication, gaps may exist in the middle of a static schedule.
the speed fi = ctii ; After distributing global static slack, gaps in the middle of
the schedule can be further explored. Finding an optimal
usage of such gaps seems to be a non-trivial problem. One
4.3.2 P-SPM for N-processor systems simple scheme is to stretch tasks adjacent to the gap when
such stretching does not affect the application timing con-
The above idea can be easily extended to N-processor sys- straints.
tems. Assuming that there are N processors in a system, the From the above discussion, we can notice that even P-
degree of parallelism (DP) in a static schedule will range SPM is not optimal. Since dynamic power management is
from 0 to N . Suppose that a schedule section with DP = i needed to save more energy by taking advantage of tasks
has total length of Ti (which may consist of several sub- actual run-time behavior (which varies significantly [7]), we
sections Tij , j = 1, . . . , ui , where ui is the total number of explore dynamic power management schemes next.
sub-sections with parallelism i) and the global static slack in
the system is L0 . The amount of slack allocated to Ti is li .
The total energy consumption E after allocating L0 would 5 Dynamic Power Management
be the same as shown in Equation (1). Here i = 0, . . . , N .
The problem of finding an optimal allocation of L0 to Ti Dynamic slack is generated when tasks of the applica-
in terms of energy consumption will be to find l0 , . . . , lN so tion execute less than their worst case execution time. Dy-
as to namic power management is applied in addition to static
power management and used to reclaim dynamic slack. We
minimize(E) use two techniques to reclaim dynamic slack. The first is
subject to: greedy, that is, all available slack on one processor is given
li 0 to the next expected task running on that processor. If the ex-
X pected task is not ready when the previous task finishes ex-
li L0
ecution, the processor will enter the idle state if no preemp-
where i = 0, . . . , N . The constraints put limitations on how tion is allowed. The second technique, gap filling, is used
to allocate global static slack. when preemption is allowed. Instead of putting a processor
Solving the above problem is similar to solving the to the idle state if there is some slack and the next expected
constrained optimization problem presented in [3]. The task is not ready, the gap filling technique will fetch the first
P-SPM scheme in Algorithm 1 approximates the solution, future ready task in the local queue.This gap is added to the
where fi is the speed of section Ti and initially fi = fmax . allotted time of this future task to allow it to execute at a
reduced speed. The execution of the out-of-order task will
First, the algorithm partitions the static schedule into sec- be preempted by the next expected task when it receives all
tions according to parallelism and Ti is generated. The slack its data and is ready.
L0 is divided into L segments and there will be L 0
L such The dynamic power management algorithm is illustrated
segments. Then, the algorithm will allocate one L to some in Algorithm 2. After the schedule is generated and static
Ti in each iteration of the while-loop. In each iteration, L power management is applied, each processor will execute
is allocated to schedule sections with DP = i such that tasks from its local queue until the queue is empty. We use
energy reduction Ei is maximized. the function sleep() to put a processor to sleep and assume
it will wake up when the head of the queue is ready or a new
Ei = Ei Ei0
frame will begin. If the task executed is an out-of-order task
Ti
= Cef i (fi3 Ti ( fi )3 (Ti + L)) fetched by gap filling(), it will be preempted when the head
Ti + L task of the queue is ready.
Ti L (2 Ti + L)
= Cef i fi3 1 In
(Ti + L)2 Section 6.2, we discuss this issue in details.
Algorithm 2 DPM - Main Algorithm DPM-S: S-SPM + dynamic power management;
while local queue is not empty do DPM-P: P-SPM + dynamic power management;
if (next expected task k is ready) then
Reclaim dynamic slack (if any) and execute k with The algorithms DPM-G, DPM-S, DPM-P are executed
reduced speed; first using only dynamic greedy technique and then us-
else if (k is not ready) AND (future task j is ready ing gap filling technique on top of it. We use the term
in local queue) then DPM GAPFILL ON and DPM GAPFILL OFF to distin-
Gap filling(j ); guish between the two techniques in the graphs.
else
Sleep(); 6.2 Sensitivity Analysis
end if
end while
1.41e+11
Energy Consumed
mentation. However, although we have not evaluated it,
1.37e+11
it is possible to apply a technique similar to gap filling in
non-preemptive systems. This would necessitate delaying 1.36e+11
1.33e+11
0 10 20 30 40 50 60 70 80 90 100
6 Evaluation and Analysis Granularity
L0
a. energy vs. L for 4 Processors
6.1 Simulation Methodology 2.2e+11
2.1e+11
the results are more or less the same, we only show the re-
sults for a 50-node graph.
The ci and the communication times are randomly gen- 2.05e+11
and get the values of i from a discretized normal distribu- b. energy vs. L0
for 8 Processors
L
tion with average and standard deviation 0.48 (1 ) if
> 0.5 and 0.48 if 0.5 (the 0.48 value comes from
discretizing the values of the normal distribution). Figure 4. Sensitivity Analysis for a randomly gen-
We compare the energy consumed by all the schemes erated 50 node graph
with the one consumed by no power management (NPM).
To summarize, we consider the following schemes:
Sensitivity analysis is done offline to find out the opti-
G-SPM: greedy static power management; mal value of L unit for P-SPM. Intuitively a smaller value
S-SPM: simple static power management; of L will lead to better results due to the fine granularity
P-SPM: Static Power Management for parallelism; of slack distribution. However, small L values may signifi-
DPM-G: G-SPM + dynamic power management; cantly increase the cost of the algorithm. We plot the energy
consumption values obtained for varying L on 4 and 8 pro- effect of slack in the middle of schedule decreases as laxity
cessors and the graphs obtained are as shown in Figure 4a increases. Finally, the P-SPM scheme saves 5% more than
and Figure 4b. However, the graphs indicate that as we de- S-SPM and 40% more than G-SPM when using 4 proces-
crease the granularity of L, the energy consumption does sors. When there are 8 processors these values are 10% and
not decrease strictly. 50%, respectively.
The fact that energy does not decrease uniformly with
reducing L can be explained by taking the structure of 1
NPM
G-SPM
the task graph and the nature of algorithm into account. If 0.9
S-SPM
P-SPM
0.9
6.3 Performance Comparison
Normalized Energy Consumed
0.85
0.8
We start by comparing the energy consumed by our new
scheme P-SPM with S-SPM, NPM and SHIFT (the scheme 0.75
0.8
0.7
Normalized Energy Consumed
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2 0.2
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
alpha alpha
0.8
0.7
0.75
0.6 0.7
0.65
0.5
0.6
0.4 0.55
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
alpha alpha
Figure 6. The normalized energy vs. for DPM Figure 7. The normalized energy vs. for DPM
(laxity = 1.25, for both GAPFILL ON and GAP- (Laxity = 1.75, for both GAPFILL ON and GAP-
FILL OFF) FILL OFF)
because when much dynamic slack is generated, DPM-S schedule according to their degree of parallelism. Second,
does not use the whole available slack at once, saving the gap-filling technique enhances the greedy algorithm by
slack for future tasks. However, when slack is dynamically allowing out-of-order execution when preemption is con-
generated, this turns out to be a very conservative approach sidered; that is, if there is some slack and the next expected
that benefits DPM-G. Also, by increasing the number task is not ready, the processor will run the future ready
of processors the relative gain increases but at the same tasks mapped to it.
time the overall energy consumption also increases due to We compared our schemes with some previous pro-
increase in the total idle time. posed schemes, the simulation results show that P-SPM
can save 10-20% more energy compared with simple static
power management (S-SPM) for parallel systems, which
7 Conclusion distributes global static slack proportionally to the length
of the schedule, and save 10% more than the static scheme
In this paper, we propose two novel techniques for power proposed in [15]. While the gap-filling technique can save
management in distributed systems. First, static power 5% more energy when applied after greedy.
management for parallelism (P-SPM) allocates global static In this work, we assume continuous speed changes. The
slack, defined as the difference between the length of the schemes can be easily adapted for processors with discrete
static schedule and the deadline, to different sections of the speed levels as shown in [12, 28].
References [16] J. Luo and N. K. Jha. Static and dynamic variable volt-
age scheduling algorithms for real-time heterogeneous dis-
[1] N. AbouGhazaleh, D. Mosse, B. Childers, and R. Melhem. tributed embedded systems. In Proc. of 15th International
Toward the placement of power management points in real Conference on VLSI Design, Bangalore, India, Jan. 2002.
time applications. In Proc. of Workshop on Compilers and [17] R. Melhem, N. AbouGhazaleh, H. Aydin, and Daniel Mosse.
Operating Systems for Low Power, Barcelona, Spain, 2001. Power Management Points in Power-Aware Real-Time Sys-
[2] H. Aydin, R. Melhem, D. Mosse, and P. Mejia-Alvarez. tems, chapter 7, pages 127152. Power Aware Computing.
Dynamic and aggressive scheduling techniques for power- Plenum/Kluwer Publishers, 2002.
aware real-time systems. In Proc. of The 22th IEEE Real- [18] D. Mosse, H. Aydin, B. Childers, and R. Melhem. Compiler-
Time Systems Symposium, London, UK, Dec. 2001. assisted dynamic power-aware scheduling for real-time ap-
[3] H. Aydin, R. Melhem, D. Mosse, and P. Mejia-Alvarez. Opti- plications. In Proc. of Workshop on Compiler and OS for
mal reward-based scheduling for periodic real-time systems. Low Power, Philadelphia, PA, Oct. 2000.
IEEE Trans. on Computers, 50(2):111130, 2001. [19] P. Pillai and K. G. Shin. Real-time dynamic voltage scal-
[4] T. D. Burd and R. W. Brodersen. Energy efficient cmos mi- ing for low-power embedded operating systems. In Proc.
croprocessor design. In Proc. of The HICSS Conference, of 18th ACM Symposium on Operating Systems Principles
pages 288297, Maui, Hawaii, Jan. 1995. (SOSP01), Banff, Canada, Oct. 2001.
[5] A. Chandrakasan, V. Gutnik, and T. Xanthopoulos. Data [20] R. RajKumar, K. Juvva, A. Molano, and S. Oikawa. Re-
driven signal processing: An approach for energy efficient source kernels: A resource-centric approach to real-time sys-
computing. In Proc. Intl Symposium on Low-Power Elec- tems. In Proc. of the SPIE/ACM Conference on Multimedia
tronic Devices, Monterey, CA, 1996. Computing and Networking, Jan. 1998.
[6] A. Chandrakasan, S. Sheng, and R. Brodersen. Low-power [21] S. Selvakumar and C. Siva Ram Murthy. Scheduling prece-
cmos digital design. IEEE Journal of Solid-State Circuit, dence constrained task graphs with non-negligible intertask
27(4):473484, 1992. communication onto multiprocessors. IEEE Trans. on Par-
[7] R. Ernst and W. Ye. Embedded program timing analysis allel and Distributed Systems, 5(3):328336, 1994.
based on path clustering and architecture classification. In [22] D. Shin, J. Kim, and S. Lee. Intra-task voltage scheduling
Proc. of The International Conference on Computer-Aided for low-energy hard real-time applications. IEEE Design &
Design, pages 598604, San Jose, CA, Nov. 1997. Test of Computers, 18(2):2030, 2001.
[8] F. Gruian. System-level design methods for low-energy [23] M. Weiser, B. Welch, A. Demers, and S. Shenker. Schedul-
architectures containing variable voltage processors. In ing for reduced cpu energy. In Proc. of The First USENIX
Proc. of The Workshop on Power-Aware Computing Systems Symposium on Operating Systems Design and Implementa-
(PACS), Cambridge, MA, Nov. 2000. tion, pages 1323, Monterey, CA, Nov. 1994.
[9] F. Gruian and K. Kuchcinski. Lenes: Task scheduling for [24] P. Yang, C. Wong, P. Marchal, F. Catthoor, D. Desmet,
low-energy systems using variable supply voltage proces- D. Kerkest, and R. Lauwereins. Energy-aware runtime
sors. In Proc. of Asia and South Pacific Design Automation scheduling for embedded-multiprocessor socs. IEEE Design
Conference (ASP-DAC), Yokohama, Japan, Jan. 2001. & Test of Computers, 18(5):4658, 2001.
[10] http://developer.intel.com/design/intelxscale/. [25] F. Yao, A. Demers, and S. Shenker. A scheduling model for
[11] http://www.transmeta.com. reduced cpu energy. In Proc. of The 36th Annual Sympo-
[12] T. Ishihara and H. Yauura. Voltage scheduling problem for sium on Foundations of Computer Science, pages 374382,
dynamically variable voltage processors. In Proc. of The Milwaukee, WI, Oct. 1995.
1998 International Symposium on Low Power Electronics [26] Y. Zhang, X. Hu, and D. Z. Chen. Task scheduling and
and Design, pages 197202, Monterey, CA, Aug. 1998. voltage selection for energy minimization. In Proc. of The
[13] P. Kumar and M. Srivastava. Predictive strategies for low- 39th Design Automation Conference, pages 183188, New
power rtos scheduling. In Proc. of the 2000 IEEE Interna- Orleans, LA, Jun. 2002.
tional Conference on Computer Design: VLSI in Computers [27] D. Zhu, N. AbouGhazaleh, D. Mosse, and R. Melhem.
and Processors, Austin, TX, Sep. 2000. Power aware scheduling for and/or graphs in multi-processor
[14] F. Liberato, S. Lauzac, R. Melhem, and D. Mosse. Fault- real-time systems. In Proc. of The Intl Conference on Par-
tolerant real-time global scheduling on multiprocessors. In allel Processing, pages 593601, Vancouver, Cannda, Aug.
Proc. of The 10th IEEE Euromicro Workshop in Real-Time 2002.
Systems, York, UK, Jun. 1999. [28] D. Zhu, R. Melhem, and B. Childers. Scheduling with dy-
[15] J. Luo and N. K. Jha. Power-conscious joint scheduling of namic voltage/speed adjustment using slack reclamation in
periodic task graphs and aperiodic tasks in distributed real- multi-processor real-time systems. In Proc. of The 22th
time embedded systems. In Proc. of International Confer- IEEE Real-Time Systems Symposium, pages 8494, London,
ence on Computer Aided Design (ICCAD), San Jose, CA, UK, Dec. 2001.
Nov. 2000.