You are on page 1of 11

doi:10.1145/1941487 .

1 9 4 1 5 0 7

Energy efficiency is the new fundamental


limiter of processor performance,
way beyond numbers of processors.
by Shekhar Borkar and Andrew A. Chien

The Future
of
Microprocessors

Mic rop ro ce s s o r s — sin gl e -c h ip c ompu t e r s —are


the building blocks of the information world. Their
performance has grown 1,000-fold over the past 20
years, driven by transistor speed and energy scaling, as
well as by microarchitecture advances that exploited
the transistor density gains from Moore’s Law. In the

next two decades, diminishing tran- vent new ideas and solutions address
sistor-speed scaling and practical en- how to sustain computing’s exponen-
ergy limits create new challenges for tial improvement.
continued performance scaling. As Microprocessors (see Figure 1) were
a result, the frequency of operations invented in 1971,28 but it’s difficult to-
will increase slowly, with energy the day to believe any of the early inventors
key limiter of performance, forcing could have conceived their extraor-
designs to use large-scale parallel- dinary evolution in structure and use
ism, heterogeneous cores, and accel- over the past 40 years. Microprocessors
erators to achieve performance and today not only involve complex micro-
energy efficiency. Software-hardware
partnership to achieve efficient data key insights
orchestration is increasingly critical in
the drive toward energy-proportional M oore’s Law continues but demands
computing. radical changes in architecture and
software.
Our aim here is to reflect and proj-
ect the macro trends shaping the fu- A rchitectures will go beyond
ture of microprocessors and sketch in homogeneous parallelism, embrace
heterogeneity, and exploit the bounty
broad strokes where processor design of transistors to incorporate
is going. We enumerate key research application-customized hardware.
challenges and suggest promising
research directions. Since dramatic S oftware must increase parallelism
and exploit heterogeneous and
changes are coming, we also seek to application-customized hardware
inspire the research community to in- to deliver performance growth.

may 2 0 1 1 | vo l . 5 4 | n o. 5 | c o m m u n icat io n s o f t he acm 67


contributed articles

architectures and multiple execution 20 Years of Exponential dimensions by 30% every generation
engines (cores) but have grown to in- Performance Gains (two years) and keeping electric fields
clude all sorts of additional functions, For the past 20 years, rapid growth in constant everywhere in the transis-
including floating-point units, caches, microprocessor performance has been tor to maintain reliability. This might
memory controllers, and media-pro- enabled by three key technology driv- sound simple but is increasingly diffi-
cessing engines. However, the defin- ers—transistor-speed scaling, core mi- cult to continue for reasons discussed
ing characteristics of a microprocessor croarchitecture techniques, and cache later. Classical transistor scaling pro-
remain—a single semiconductor chip memories—discussed in turn in the vided three major benefits that made
embodying the primary computation following sections: possible rapid growth in compute per-
(data transformation) engine in a com- Transistor-speed scaling. The MOS formance.
puting system. transistor has been the workhorse for First, the transistor dimensions are
Because our own greatest access decades, scaling in performance by scaled by 30% (0.7x), their area shrinks
and insight involves Intel designs and nearly five orders of magnitude and 50%, doubling the transistor density
data, our graphs and estimates draw providing the foundation for today’s every technology generation—the fun-
heavily on them. In some cases, they unprecedented compute performance. damental reason behind Moore’s Law.
may not be representative of the entire The basic recipe for technology scaling Second, as the transistor is scaled, its
industry but certainly represent a large was laid down by Robert N. Dennard of performance increases by about 40%
fraction. Such a forthright view, solidly IBM17 in the early 1970s and followed (0.7x delay reduction, or 1.4x frequen-
grounded, best supports our goals for for the past three decades. The scal- cy increase), providing higher system
this article. ing recipe calls for reducing transistor performance. Third, to keep the elec-
tric field constant, supply voltage is re-
Figure 1. Evolution of Intel microprocessors 1971–2009. duced by 30%, reducing energy by 65%,
or power (at 1.4x frequency) by 50%
(active power = CV2f). Putting it all to-
gether, in every technology generation
transistor integration doubles, circuits
are 40% faster, and system power con-
sumption (with twice as many transis-
tors) stays the same. This serendipi-
tous scaling (almost too good to be
true) enabled three-orders-of-magni-
tude increase in microprocessor per-
formance over the past 20 years. Chip
Intel 4004, 1971 Intel 8088, 1978 Intel Mehalem-EX, 2009
1 core, no cache 1 core, no cache 8 cores, 24MB cache
architects exploited transistor density
23K transistors 29K transistors 2.3B transistors to create complex architectures and
transistor speed to increase frequency,
achieving it all within a reasonable
power and energy envelope.
Figure 2. Architecture advances and energy efficiency. Core microarchitecture tech-
niques. Advanced microarchitectures
have deployed the abundance of tran-
  Die Area   FP Performance (X)
  Integer Performance (X)   Int Performance/Watt (X) sistor-integration capacity, employing
a dizzying array of techniques, includ-
386 to 486
4 ing pipelining, branch prediction,
out-of-order execution, and specula-
486 to Pentium tion, to deliver ever-increasing perfor-
3 mance. Figure 2 outlines advances in
microarchitecture, showing increases
Increase (X)

P6 to Pentium 4 in die area and performance and en-


Pentium to P6
2 ergy efficiency (performance/watt),
all normalized in the same process
Pentium 4 technology. It uses characteristics of
to Core
1 Intel microprocessors (such as 386,
486, Pentium, Pentium Pro, and Pen-
tium 4), with performance measured
0 by benchmark SpecInt (92, 95, and
On-die cache, Super-scalar OOO-Speculative Deep pipeline Back to non-deep 2000 representing the current bench-
pipelined pipeline
mark for the era) at each data point.
It compares each microarchitecture
advance with a design without the ad-

68 commun ic ations of th e acm | May 2 0 1 1 | vol . 5 4 | no. 5


contributed articles

vance (such as introducing an on-die caused designers to forego many of Unaddressed, the memory-latency gap
cache by comparing 486 to 386 in 1μ these microarchitecture techniques. would have eliminated and could still
technology and superscalar microar- As Pollack’s Rule broadly captures eliminate most of the benefits of pro-
chitecture of Pentium in 0.7μ technol- area, power, and performance trade- cessor improvement.
ogy with 486). offs from several generations of mi- The reason for slow improvement
This data shows that on-die caches croarchitecture, we use it as a rule of DRAM speed is practical, not tech-
and pipeline architectures used tran- of thumb to estimate single-thread nological. It’s a misconception that
sistors well, providing a significant performance in various scenarios DRAM technology based on capacitor
performance boost without compro- throughout this article. storage is inherently slower; rather, the
mising energy efficiency. In this era, Cache memory architecture. Dy- memory organization is optimized for
superscalar, and out-of-order archi- namic memory technology (DRAM) density and lower cost, making it slow-
tectures provided sizable performance has also advanced dramatically with er. The DRAM market has demanded
benefits at a cost in energy efficiency. Moore’s Law over the past 40 years but large capacity at minimum cost over
Of these architectures, deep-pipe- with different characteristics. For ex- speed, depending on small and fast
lined design seems to have delivered ample, memory density has doubled caches on the microprocessor die to
the lowest performance increase for nearly every two years, while perfor- emulate high-performance memory
the same area and power increase as mance has improved more slowly (see by providing the necessary bandwidth
out-of-order and speculative design, Figure 4a). This slower improvement and low latency based on data locality.
incurring the greatest cost in energy in cycle time has produced a memory The emergence of sophisticated, yet
efficiency. The term “deep pipelined bottleneck that could reduce a sys- effective, memory hierarchies allowed
architecture” describes deeper pipe- tem’s overall performance. Figure 4b DRAM to emphasize density and cost
line, as well as other circuit and mi- outlines the increasing speed dispar- over speed. At first, processors used a
croarchitectural techniques (such as ity, growing from 10s to 100s of proces- single level of cache, but, as processor
trace cache and self-resetting domino sor clock cycles per memory access. It speed increased, two to three levels of
logic) employed to achieve even high- has lately flattened out due to the flat- cache hierarchies were introduced to
er frequency. Evident from the data is tening of processor clock frequency. span the growing speed gap between
that reverting to a non-deep pipeline
reclaimed energy efficiency by drop- Figure 3. Increased performance vs. area in the same process technology follows
Pollack’s Rule.
ping these expensive and inefficient
techniques.
When transistor performance in- 10.0
creases frequency of operation, the Performance ~ Sqrt(Area)
performance of a well-tuned system
Integer Performance (X)

386 to 486
generally increases, with frequency Pentium to P6
486 to Pentium
subject to the performance limits of
1.0 P6 to Pentium 4
other parts of the system. Historically,
Pentium 4 to Core
microarchitecture techniques exploit-
Slope =0.5
ing the growth in available transistors
have delivered performance increases
empirically described by Pollack’s 0.1

Rule,32 whereby performance increas- 0.1 1.0 10.0


es (when not limited by other parts Area (X)

of the system) as the square root of


the number of transistors or area of
a processor (see Figure 3). According
to Pollack’s Rule, each new technol- Figure 4. DRAM density and performance, 1980–2010.

ogy generation doubles the number


of transistors on a chip, enabling a
100,000 1,000
new microarchitecture that delivers a
CPU Clocks/DRAM Latency

40% performance increase. The faster 10,000 DRAM Density


CPU 100
transistors provide an additional 40% 1,000 Speed
Relative

performance (increased frequency),


100
almost doubling overall performance GAP 10
within the same power envelope (per 10 DRAM Speed
scaling theory). In practice, however, 1
1
implementing a new microarchitec-
ture every generation is difficult, so 1980 1990 2000 2010 1980 1990 2000 2010
microarchitecture gains are typically (a) (b)
less. In recent microprocessors, the in-
creasing drive for energy efficiency has

may 2 0 1 1 | vo l . 5 4 | n o. 5 | c o m m u n icat io n s o f t he acm 69


contributed articles

processor and memory.33,37 In these area devoted to cache, and most of the architecture-improvement cycle has
hierarchies, the lowest-level caches available transistor budget was devot- been sustained for more than two
were small but fast enough to match ed to core microarchitecture advances. decades, delivering 1,000-fold perfor-
the processor’s needs in terms of high During this period, processors were mance improvement. How long will it
bandwidth and low latency; higher lev- probably cache-starved. As energy be- continue? To better understand and
els of the cache hierarchy were then came a concern, increasing cache size predict future performance, we decou-
optimized for size and speed. for performance has proven more en- ple performance gain due to transistor
Figure 5 outlines the evolution of ergy efficient than additional core-mi- speed and microarchitecture by com-
on-die caches over the past two de- croarchitecture techniques requiring paring the same microarchitecture
cades, plotting cache capacity (a) and energy-intensive logic. For this reason, on different process technologies and
percentage of die area (b) for Intel more and more transistor budget and new microarchitectures with the previ-
microprocessors. At first, cache sizes die area are allocated in caches. ous ones, then compound the perfor-
increased slowly, with decreasing die The transistor-scaling-and-micro- mance gain.
Figure 6 divides the cumulative
Figure 5. Evolution of on-die caches. 1,000-fold Intel microprocessor per-
formance increase over the past two
10,000 60% decades into performance delivered by
50% transistor speed (frequency) and due to
On-die cache (KB)

1,000 microarchitecture. Almost two-orders-


of total die area
On-die cache %

40%

100 30%
of-magnitude of this performance in-
crease is due to transistor speed alone,
20%
10 now leveling off due to the numerous
10%
challenges described in the following
1 0%
sections.
1u 0.5u 0.25u 0.13u 65nm 1u 0.5u 0.25u 0.13u 65nm

(a) (b)
The Next 20 Years
Microprocessor technology has deliv-
ered three-orders-of-magnitude per-
Figure 6. Performance increase separated into transistor speed and microarchitecture
formance improvement over the past
performance. two decades, so continuing this tra-
jectory would require at least 30x per-
formance increase by 2020. Micropro-
10,000 10,000
  Integer Performance   Floating-Point Performance
  Transistor Performance   Transistor Performance Table 1. New technology scaling
1,000 1,000
challenges.
Relative

Relative

100 100

10 10
Decreased transistor scaling benefits:
Despite continuing miniaturization, little
performance improvement and little
1 1
reduction in switching energy (decreasing
1.5u 0.5u 0.18u 65nm 1.5u 0.5u 0.18u 65nm
performance benefits of scaling) [ITRS].
(a) (b)
Flat total energy budget: package
power and mobile/embedded computing
drives energy-efficiency requirements.

Figure 7. Unconstrained evolution of a microprocessor results in excessive power


consumption.

Table 2. Ongoing technology scaling.


500
Unconstrained Evolution 100mm2 Die
400
Increasing transistor density (in area
and volume) and count: through
Power (Watts)

300
continued feature scaling, process
innovations, and packaging innovations.
200
Need for increasing locality and
reduced bandwidth per operation:
100
as performance of the microprocessor
increases, and the data sets for
0 applications continue to grow.
2002 2006 2010 2014 2008

70 commun ic ations of th e ac m | May 2 0 1 1 | vol . 5 4 | no. 5


contributed articles

cessor-performance scaling faces new


challenges (see Table 1) precluding
use of energy-inefficient microarchi-
tecture innovations developed over the
Death of
past two decades. Further, chip archi- 90/10 Optimization,
Rise of
tects must face these challenges with
an ongoing industry expectation of a
30x performance increase in the next
decade and 1,000x increase by 2030
(see Table 2).
10×10 Optimization
As the transistor scales, supply Traditional wisdom suggests investing maximum transistors in the 90% case, with
voltage scales down, and the thresh- the goal of using precious transistors to increase single-thread performance that can
be applied broadly. In the new scaling regime typified by slow transistor performance
old voltage of the transistor (when and energy improvement, it often makes no sense to add transistors to a single core
the transistor starts conducting) also as energy efficiency suffers. Using additional transistors to build more cores produces
scales down. But the transistor is not a limited benefit—increased performance for applications with thread parallelism.
In this world, 90/10 optimization no longer applies. Instead, optimizing with an
a perfect switch, leaking some small accelerator for a 10% case, then another for a different 10% case, then another 10%
amount of current when turned off, case can often produce a system with better overall energy efficiency and performance.
increasing exponentially with reduc- We call this “10×10 optimization,”14 as the goal is to attack performance as a set of
tion in the threshold voltage. In ad- 10% optimization opportunities—a different way of thinking about transistor cost,
operating the chip with 10% of the transistors active—90% inactive, but a different 10%
dition, the exponentially increasing at each point in time.
transistor-integration capacity exacer- Historically, transistors on a chip were expensive due to the associated design
bates the effect; as a result, a substan- effort, validation and testing, and ultimately manufacturing cost. But 20 generations
tial portion of power consumption is of Moore’s Law and advances in design and validation have shifted the balance.
Building systems where the 10% of the transistors that can operate within the energy
due to leakage. To keep leakage under budget are configured optimally (an accelerator well-suited to the application) may
control, the threshold voltage cannot well be the right solution. The choice of 10 cases is illustrative, and a 5×5, 7×7, 10×10,
be lowered further and, indeed, must or 12×12 architecture might be appropriate for a particular design.
increase, reducing transistor perfor-
mance.10
As transistors have reached atomic er envelope is around 65 watts, and cache observed in today’s micropro-
dimensions, lithography and variabil- the die size is around 100mm2. Figure cessors. If the die integrates no logic at
ity pose further scaling challenges, af- 8 outlines a simple analysis for 45nm all, then the entire die could be popu-
fecting supply-voltage scaling.11 With process technology node; the x-axis is lated with about 16MB of cache and
limited supply-voltage scaling, energy the number of logic transistors inte- consume less than 10 watts of power,
and power reduction is limited, ad- grated on the die, and the two y-axes since caches consume less power than
versely affecting further integration are the amount of cache that would fit logic (Case A). On the other hand, if it
of transistors. Therefore, transistor- and the power the die would consume. integrates no cache at all, then it could
integration capacity will continue with As the number of logic transistors on integrate 75 million transistors for log-
scaling, though with limited perfor- the die increases (x-axis), the size of the ic, consuming almost 90 watts of pow-
mance and power benefit. The chal- cache decreases, and power dissipa- er (Case B). For 65 watts, the die could
lenge for chip architects is to use this tion increases. This analysis assumes integrate 50 million transistors for
integration capacity to continue to im- average activity factor for logic and logic and about 6MB of cache (Case C).
prove performance.
Package power/total energy con- Figure 8. Transistor integration capacity at a fixed power envelope.
sumption limits number of logic tran-
2008, 45nm, 100mm2
sistors. If chip architects simply add 100 18
more cores as transistor-integration
Case A, 16MB of Cache n 16
capacity becomes available and oper- atio
80 issip
ate the chips at the highest frequen- er D 14
Pow
Total Power (Watts)

cy the transistors and designs can Cac


12
Cache (MB)

60 he
achieve, then the power consumption Siz
e Case C 10
50MT Logic
of the chips would be prohibitive (see 6MB Cache 8
40
Figure 7). Chip architects must limit
6
frequency and number of cores to keep
4
power within reasonable bounds, but 20

doing so severely limits improvement Case A, 0 Logic, 8W


2

in microprocessor performance. 0 0
Case B
Consider the transistor-integration 0 20 40 60 80

capacity affordable in a given power Logic Transistors (Millions)


envelope for reasonable die size. For
regular desktop applications the pow-

may 2 0 1 1 | vo l . 5 4 | n o. 5 | c o m m u n ic ati o n s o f t he acm 71


contributed articles

This design point matches the dual- If this analysis is performed for fu- capacitance, then the results will be
core microprocessor on 45nm technol- ture technologies, assuming (our best as they appear in Table 1. Note that
ogy (Core2 Duo), integrating two cores estimates) modest frequency increase over the next 10 years we expect in-
of 25 million transistors each and 6MB 15% per generation, 5% reduction in creased total transistor count, follow-
of cache in a die area of about 100mm2. supply voltage, and 25% reduction of ing Moore’s Law, but logic transistors
increase by only 3x and cache transis-
Figure 9. Three scenarios for integrating 150-million logic transistors into cores. tors increase more than 10x. Apply-
ing Pollack’s Rule, a single processor
core with 150 million transistors will
provide only about 2.5x microarchitec-
5 MT 2 3
Large-Core Large-Core ture performance improvement over
2 25MT
25 MT today’s 25-million-transistor core,
5 MT 2 3 well shy of our 30x goal, while 80MB of
cache is probably more than enough
3 4 for the cores (see Table 3).
The reality of a finite (essentially
fixed) energy budget for a microproces-
5 6 sor must produce a qualitative shift in
how chip architects think about archi-
30 20 tecture and implementation. First, en-
ergy-efficiency is a key metric for these
Large-Core Homogeneous Small-Core Homogeneous Small-Core Homogeneous designs. Second, energy-proportional
Large-core 1 Large-core Large-core 1 computing must be the ultimate goal
throughput throughput throughput for both hardware architecture and
Small-core Small-core Pollack’s Rule Small-core Pollack’s Rule software-application design. While
throughput throughput (5/25)0.5=0.45 throughput (5/25)0.5=0.45 this ambition is noted in macro-scale
Total 6 Total 13 Total 11 computing in large-scale data cen-
throughput throughput throughput
ters,5 the idea of micro-scale energy-
(a) (b) (c) proportional computing in micropro-
cessors is even more challenging. For
microprocessors operating within a
finite energy budget, energy efficiency
Figure 10. A system-on-a-chip from Texas Instruments. corresponds directly to higher perfor-
mance, so the quest for extreme energy
efficiency is the ultimate driver for per-
C64x+ DSP Display Subsystem
and video
formance.
accelerators LCD Video 10-bit DAC In the following sections, we out-
Controller Enc 10-bit DAC
ARM
(3525/3530 only) line key challenges and sketch poten-
Cortex tial approaches. In many cases, the
A8 challenges are well known and the
Camera I/F
CPU
2D/3D Graphics subject of significant research over
(3515/3530 only) Image
Pipe Parallel I/F many years. In all cases, they remain
critical but daunting for the future of
microprocessor performance:
Organizing the logic: Multiple cores
and customization. The historic mea-
L3/L4 Interconnect
sure of microprocessor capability is
the single-thread performance of a
traditional core. Many researchers
Peripherals Connectivity System have observed that single-thread per-
USB 2.0 HS USB Host Timers formance has already leveled off, with
GP x12
OTG Controller Controller x2
WDT x2 only modest increases expected in the
coming decades. Multiple cores and
Program/Data Storage
customization will be the major driv-
Serial Interfaces
ers for future microprocessor perfor-
McBSP x5 I2C x3 UART x2 HDQ/1-wire SDRC MMC/SD/SDIO
x3 mance (total chip performance). Mul-
McSPI x4 UART w/ GPMC tiple cores can increase computational
IRDA
throughput (such as a 1x–4x increase
could result from four cores), and cus-
tomization can reduce execution la-

72 communicat ions of th e ac m | May 2 0 1 1 | vo l . 5 4 | no. 5


contributed articles

Table 3. Extrapolated transistor ber of cores, and the related choices ity in smartphone system-on-a-chip
integration capacity in a fixed power in a multicore processor with uniform (SoC). As outlined in Figure 10, such
envelope. instruction set but heterogeneous im- an SoC might include as many as 10
plementation are an important part to 20 accelerators to achieve a supe-
Logic
of increasing performance within the rior balance of energy efficiency and
Transistors transistor budget and energy envelope. performance. This example could also
Year (Millions) Cache MB Choices in hardware customization. include graphics, media, image, and
2008 50 6 Customization includes fixed-function cryptography accelerators, as well as
2014 100 25 accelerators (such as media codecs, support for radio and digital signal
2018 150 80 cryptography engines, and composit- processing. As one might imagine,
ing engines), programmable accelera- one of these blocks could be a dynami-
tors, and even dynamically customiz- cally programmable element (such as
able logic (such as FPGAs and other an FPGA or a software-programmable
tency. Clearly, both techniques—mul- dynamic structures). In general, cus- processor).
tiple cores and customization—can tomization increases computational Another customization approach
improve energy efficiency, the new performance by exploiting hardwired constrains the types of parallelism
fundamental limiter to capability. or customized computation units, cus- that can be executed efficiently, en-
Choices in multiple cores. Multiple tomized wiring/interconnect for data abling a simpler core, coordination,
cores increase computational through- movement, and reduced instruction- and memory structures; for example,
put by exploiting Moore’s Law to rep- sequence overheads at some cost in many CPUs increase energy efficiency
licate cores. If the software has no generality. In addition, the level of par- by restricting memory access structure
parallelism, there is no performance allelism in hardware can be custom- and control flexibility in single-instruc-
benefit. However, if there is parallel- ized to match the precise needs of the tion, multiple-data or vector (SIMD)
ism, the computation can be spread computation; computation benefits structures,1,2 while GPUs encourage
across multiple cores, increasing over- from hardware customization only programs to express structured sets
all computational performance (and when it matches the specialized hard- of threads that can be aligned and ex-
reducing latency). Extensive research ware structures. In some cases, units ecuted efficiently.26,30 This alignment
on how to organize such systems dates hardwired to a particular data repre- reduces parallel coordination and
to the 1970s.29,39 sentation or computational algorithm memory-access costs, enabling use of
Industry has widely adopted a mul- can achieve 50x–500x greater energy large numbers of cores and high peak
ticore approach, sparking many ques- efficiency than a general-purpose reg- performance when applications can
tions about number of cores and size/ ister organization. Two studies21,22 of a be formulated with a compatible par-
power of each core and how they co- media encoder and TCP offload engine allel structure. Several microprocessor
ordinate.6,36 But if we employ 25-mil- illustrate the large energy-efficiency manufacturers have announced future
lion-transistor cores (circa 2008), the improvement that is possible. mainstream products that integrate
150-million-logic-transistor budget Due to battery capacity and heat- CPUs and GPUs.
expected in 2018 gives 6x potential dissipation limits, for many years Customization for greater energy
throughput improvement (2x from energy has been the fundamental or computational efficiency is a long-
frequency and 3x from increased log- limiter for computational capabil- standing technique, but broad adop-
ic transistors), well short of our 30x
goal. To go further, chip architects Table 4. Logic organization challenges, trends, directions.

must consider more radical options


of smaller cores in greater numbers,
along with innovative ways to coordi- Challenge Near-Term Long-Term
nate them. Integration and I/O-based interaction, shared memory Intelligent, automatic data movement
Looking to achieve this vision, memory model spaces, explicit coherence management among heterogeneous cores, managed
consider three potential approaches by software-hardware partnership

to deploying the feasible 150 million Software Explicit partition and mapping, Hardware-based state adaptation
transparency virtualization, application management and software-hardware partnership
logic transistors, as in Table 1. In Fig- for management
ure 9, option (a) is six large cores (good Lower-power Heterogeneous cores, vector extensions, Deeper, explicit storage hierarchy within
single-thread performance, total po- cores and GPU-like techniques to reduce the core; integrated computation in
tential throughput of six); option (b) is instruction- and data-movement cost registers
30 smaller cores (lower single-thread Energy Hardware dynamic voltage scaling Predictive core scheduling and selection
management and intelligent adaptive management, to optimize energy efficiency and
performance, total potential through-
software core selection and scheduling minimize data movement
put of 13); and option (c) is a hybrid
Accelerator Increasing variety, library-based Converged accelerators in a few
approach (good single-thread perfor- variety encapsulation (such as DX and OpenGL) application categories and increasing
mance for low parallelism, total poten- for specific domains open programmability for the
tial throughput of 11). accelerators

Many more variations are possible


on this spectrum of core size and num-

may 2 0 1 1 | vo l . 5 4 | n o. 5 | c o m m u n ic ati o n s o f t he acm 73


contributed articles

tion has been slowed by continued the performance advantage might path toward increased performance
improvement in microprocessor sin- soon be overtaken by advances in the or energy efficiency (see Table 4). But
gle-thread performance. Developers of traditional microprocessor. With slow- such software customization is diffi-
software applications had little incen- ing improvement in single-thread per- cult, especially for large programs (see
tive to customize for accelerators that formance, this landscape has changed the sidebar “Decline of 90/10 Optimi-
might be available on only a fraction of significantly, and for many applica- zation, Rise of 10x10 Optimization”).
the machines in the field and for which tions, accelerators may be the only Orchestrating data movement:
Memory hierarchies and intercon-
Figure 11. On-die interconnect delay and energy (45nm).
nects. In future microprocessors, the
energy expended for data movement
10,000 2 1,000
will have a critical effect on achiev-
On-die network energy per bit able performance. Every nano-joule
100
1,000 1.5 of energy used to move data up and
Wire Delay
down the memory hierarchy, as well
Delay (ps)

pJ/Bit 10

(pJ)
Measured
100 Wire Energy 1 as to synchronize across and data be-
1
tween processors, takes away from the
10 0.5 0.1
Extrapolated limited budget, reducing the energy
0.01 available for the actual computation.
1 0
0 5 10 15 20 0.5u 0.18u 65nm 22nm 8nm In this context, efficient memory hi-
On-die interconnect length (mm) erarchies are critical, as the energy to
(a) (b)
retrieve data from a local register or
cache is far less than the energy to go
to DRAM or to secondary storage. In
addition, data must be moved between
Figure 12. Hybrid switching for network-on-a-chip. processing units efficiently, and task
placement and scheduling must be
optimized against an interconnection
C C C C network with high locality. Here, we
C C C C
examine energy and power associated
Bus
R Bus
R
Bus Bus with data movement on the processor
C C C C die.
C C C C
Today’s processor performance is
C C C C on the order of 100Giga-op/sec, and
C C C C C C
R
Bus R
Bus a 30x increase over the next 10 years
Bus Bus Bus
would increase this performance to
C C C C
C C C C C C 3Tera-op/sec. At minimum, this boost
requires 9Tera-operands or 64b x
Bus to connect Second-level bus to connect Second-level router-based
a cluster clusters (hierarchy of busses) network (hierarchy of networks) 9Tera-operands (or 576Tera-bits) to be
moved each second from registers or
memory to arithmetic logic, consum-
ing energy.
Table 5. Data movement challenges, trends, directions. Figure 11(a) outlines typical wire
delay and energy consumed in moving
one bit of data on the die. If the oper-
Challenge Near-Term Long-Term ands move on average 1mm (10% of
Parallelism Increased parallelism Heterogeneous parallelism and die size), then at the rate of 0.1pJ/bit,
customization, hardware/runtime
placement, migration, adaptation
the 576Tera-bits/sec of movement con-
for locality and load balance sumes almost 58 watts with hardly any
Data Movement/ More complex, more exposed hierarchies; New memory abstractions and energy budget left for computation. If
Locality new abstractions for control over mechanisms for efficient vertical most operands are kept local to the ex-
movement and “snooping” data locality management with low ecution units (such as in register files)
programming effort and energy
and the data movement is far less than
Resilience More aggressive energy reduction; Radical new memory technologies
compensated by recovery for resilience (new physics) and resilience techniques 1mm, on, say, the order of only 0.1mm,
Energy Fine-grain power management in packet Exploitation of wide data, slow clock, then the power consumption is only
Proportional fabrics and circuit-based techniques around 6 watts, allowing ample energy
Communication budget for the computation.
Reduced Energy Low-energy address translation Efficient multi-level naming and Cores in a many-core system are
memory-hierarchy management
typically connected through a net-
work-on-a-chip to move data around
the cores.40 Here, we examine the ef-

74 communications of th e acm | May 2 0 1 1 | vol . 5 4 | no. 5


contributed articles

fect of such a network on power con- Figure 13. Improving energy efficiency through voltage scaling.
sumption. Figure 11(b) shows the en-
ergy consumed in moving a bit across
a hop in such a network, measured in 104
65nm CMOS, 50° C
102 450
65nm CMOS, 50° C
101

Energy Efficienty (GOP/Watt)

Active Leakage Power (mW)


historic networks, and extrapolated

Maximum Frequency (MHz)


400
into the future from previous assump-

Total Power (Watts)


350

Subthreshold Region
103 101
300 1
tions. If only 10% of the operands move
250
over the network, traversing 10 hops 102 1
200
320mV
on average, then at the rate of 0.06pJ/ 150 10 –1
101 10 –1
bit the network power would be 35 100
320mV 50 320mV
watts, more than half the power bud- 0
1 10 –2 10 –2
get of the processor. 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.2 0.4 0.6 0.8 1.0 1.2 1.4
As the energy cost of computation is Supply Voltage (V) Supply Voltage (V)
reduced by voltage scaling (described
later), emphasizing compute through-
put, the cost of data movement starts
to dominate. Therefore, data move- Table 6. Circuits challenges, trends, directions.
ment must be restricted by keeping
data locally as much as possible. This
restriction also means the size of local Challenge Near-Term Long-Term
storage (such as a register file) must Power, energy Continuous dynamic voltage and Discrete dynamic voltage and frequency
increase substantially. This increase efficiency frequency scaling, power gating, reactive scaling, near threshold operation,
power management proactive fine-grain power and energy
is contrary to conventional thinking of management
register files being small and thus fast. Variation Speed binning of parts, corrections with Dynamic reconfiguration of many cores
With voltage scaling the frequency of body bias or supply voltage changes, by speed
operation is lower anyway, so it makes tighter process control
sense to increase the size of the local Gradual, Guard-bands, yield loss, core sparing, Resilience with hardware/software
storage at the expense of speed. temporal, design for manufacturability co-design, dynamic in-field detection,
intermittent, diagnosis, reconfiguration and repair,
Another radical departure from and permanent adaptability, and self-awareness
conventional thinking is the role of faults
the interconnect network on the chip.
Recent parallel machine designs have
been dominated by packet-switch-
ing,6,8,24,40 so multicore networks ad- traditional parallel-machine approach also reduces, but energy efficiency in-
opted this energy-intensive approach. (see Table 5). creases. When the supply voltage is
In the future, data movement over The role of microprocessor archi- reduced all the way to the transistor’s
these networks must be limited to con- tect must expand beyond the proces- threshold, energy efficiency increases
serve energy, and, more important, sor core, into the whole platform on by an order of magnitude. Employing
due to the large size of local storage a chip, optimizing the cores as well as this technique on large cores would
data bandwidth, demand on the net- the network and other subsystems. dramatically reduce single-thread
work will be reduced. In light of these Pushing the envelope: Extreme performance and is hence not recom-
findings on-die-network architectures circuits, variability, resilience. Our mended. However, smaller cores used
need revolutionary approaches (such analysis showed that in the power-
as hybrid packet/circuit switching4). constrained scenario, only 150 mil- Figure 14. A heterogeneous many-core
system with variation.
Many older parallel machines used lion logic transistors for processor
irregular and circuit-switched net- cores and 80MB of cache will be af-
works31,41; Figure 12 describes a re- fordable due to energy by 2018. Note
turn to hybrid switched networks for that 80MB of cache is not necessary
on-chip interconnects. Small cores in for this system, and a large portion of Single-thread
Large-Core Large-Core
close proximity could be interconnect- the cache-transistor budget can be uti- performance

ed into clusters with traditional bus- lized to integrate even more cores if it
ses that are energy efficient for data can be done with the power-consump- Throughput
movement over short distances. The tion density of a cache, which is 10x performance
f/2 f/4 f f/2
clusters could be connected through less than logic. This approach can be Energy
wide (high-bandwidth) low-swing (low- achieved through aggressive scaling of f/4 f f/2 f/4 efficient with
fine-grain
energy) busses or through packet- or supply voltage.25 f f/2 f/4 f power
circuit-switched networks, depending Figure 13 outlines the effective- management
on distance. Hence the network-on-a- ness of supply-voltage scaling when
chip could be hierarchical and hetero- the chip is designed for it. As the
geneous, a radical departure from the supply voltage is reduced, frequency

may 2 0 1 1 | vo l . 5 4 | n o. 5 | c o m m u n ic ati o n s o f t he ac m 75
contributed articles

for throughput would certainly benefit given core are individually controlled advanced interpretive and compiler
from it. Moreover, the transistor bud- such that the total power consumption technologies, as well as increasing use
get from the unused cache could be is within the power envelope. Many of dynamic translation techniques. We
used to integrate even more cores with small cores operate at lower voltages expect these trends to continue, with
the power density of the cache. Aggres- and frequency for improved energy ef- higher-level programming, extensive
sive voltage scaling provides an avenue ficiency, while some small cores oper- customization through libraries, and
for utilizing the unused transistor-in- ate near threshold voltage at the lowest sophisticated automated performance
tegration capacity for logic to deliver frequency but at higher energy effi- search techniques (such as autotun-
higher performance. ciency, and some cores may be turned ing) will be even more important.
Aggressive supply-voltage scaling off completely. Clock frequencies need Extreme studies27,38 suggest that
comes with its own challenges (such not be continuous; steps (in powers of aggressive high-performance and ex-
as variations). As supply voltage is re- two) keep the system synchronous and treme-energy-efficient systems may
duced toward a transistor’s threshold simple without compromising perfor- go further, eschewing the overhead of
voltage, the effect of variability is even mance while also addressing variation programmability features that soft-
worse, because the speed of a circuit tolerance. The scheduler dynamically ware engineers have come to take for
is proportional to the voltage over- monitors workload and configures the granted; for example, these future sys-
drive (supply voltage minus threshold system with the proper mix of cores tems may drop hardware support for
voltage). Moreover, as supply voltage and schedules the workload on the a single flat address space (which nor-
approaches the threshold, any small right cores for energy-proportional mally wastes energy on address manip-
change in threshold voltage affects the computing. Combined heterogene- ulation/computing), single-memory
speed of the circuit. Therefore, varia- ity, aggressive supply-voltage scaling, hierarchy (coherence and monitoring
tion in the threshold voltage mani- and fine-grain power (energy) manage- energy overhead), and steady rate of
fests itself as variation in the speed ment enables utilization of a larger execution (adapting to the available
of the core, the slowest circuit in the fraction of transistor-integration ca- energy budget). These systems will
core determines the frequency of op- pacity, moving closer to the goal of 30x place more of these components un-
eration of the core, and a large core is increase in compute performance (see der software control, depending on in-
more susceptible to lower frequency Table 6). creasingly sophisticated software tools
of operation due to variations. On the Software challenges renewed: Pro- to manage the hardware boundaries
other hand, a large number of small grammability versus efficiency. The and irregularities with greater energy
cores has a better distribution of fast end of scaling of single-thread perfor- efficiency. In extreme cases, high-per-
and slow small cores and can better mance already means major software formance computing and embedded
even out the effect of variations. We challenges; for example, the shift to applications may even manage these
next discuss an example system that symmetric parallelism has created per- complexities explicitly. Most architec-
is variation-tolerant, energy-efficient, haps the greatest software challenge tural features and techniques we’ve
energy-proportional, and fine-grain in the history of computing,12,15 and discussed here shift more responsi-
power managed. we expect future pressure on energy- bility for distribution of the computa-
A hypothetical heterogeneous pro- efficiency will lead to extensive use of tion and data across the compute and
cessor (see Figure 14) consists of a heterogeneous cores and accelerators, storage elements of microprocessors
small number of large cores for single- further exacerbating the software chal- to software.13,18 Shifting responsibility
thread performance and many small lenge. Fortunately, the past decade has increases potential achievable energy
cores for throughput performance. seen increasing adoption of high-level efficiency, but realizing it depends on
Supply voltage and the frequency of any “productivity” languages20,34,35 built on significant advances in applications,
compilers and runtimes, and operat-
Table 7. Software challenges, trends, directions.
ing systems to understand and even
predict the application and workload
behavior.7,16,19 However, these ad-
Challenge Near-Term Long-Term
vances require radical research break-
1,000-fold Data parallel languages and “mapping” New high-level languages,
software of operators, library and tool-based compositional and deterministic
throughs and major changes in soft-
parallelism approaches frameworks ware practice (see Table 7).

Energy-efficient Manual control, profiling, maturing to New algorithms, languages, Conclusion


data movement automated techniques (auto-tuning, program analysis, runtime, The past 20 years were truly the great
and locality optimization) and hardware techniques
old days for Moore’s Law scaling and
Energy Automatic fine-grain hardware Self-aware runtime and
management management application-level techniques that
microprocessor performance; dra-
exploit architecture features for matic improvements in transistor
visibility and control density, speed, and energy, combined
Resilience Algorithmic, application-software New hardware-software partnerships with microarchitecture and memory-
approaches, adaptive checking and that minimize checking and
hierarchy techniques delivered 1,000-
recovery recomputation energy
fold microprocessor performance
improvement. The next 20 years—the

76 commun ications of th e acm | May 2 0 1 1 | vol . 5 4 | no. 5


contributed articles

pretty good new days, as progress predict whether some form of scaling with YCSB. ACM Symposium on Cloud Computing
(June 2010).
continues—will be more difficult, (perhaps energy) will continue or there 17. Dennard, R. et al. Design of ion-implanted MOSFETs
with Moore’s Law scaling producing will be no scaling at all. The pretty with very small physical dimensions. IEEE Journal of
Solid State Circuits SC-9, 5 (Oct. 1974), 256–268.
continuing improvement in transis- good old days of scaling that processor 18. Fatahalian, K. et al. Sequoia: Programming the memory
tor density but comparatively little design faces today are helping prepare hierarchy. ACM/IEEE Conference on Supercomputing
(Nov. 2006).
improvement in transistor speed and us for these new challenges. More- 19. Flinn, J. et al. Managing battery lifetime with energy-
energy. As a result, the frequency of over, the challenges processor design aware adaptation. ACM Transactions on Computer
Systems 22, 2 (May 2004).
operation will increase slowly. Energy will faces in the next decade will be 20. Gosling, J. et al. The Java Language Specification,
will be the key limiter of performance, dwarfed by the challenges posed by Third Edition. Addison-Wesley, 2005.
21. Hameed, R. et al. Understanding sources of inefficiency
forcing processor designs to use large- these alternative technologies, render- in general-purpose chips. International Symposium on
Computer Architecture (2010).
scale parallelism with heterogeneous ing today’s challenges a warm-up exer- 22. Hoskote, Y. et al. A TCP offload accelerator for 10Gb/s
cores, or a few large cores and a large cise for what lies ahead. Ethernet in 90-nm CMOS. IEEE Journal of Solid-State
Circuits 38, 11 (Nov. 2003).
number of small cores operating at 23. International Technology Roadmap for
low frequency and low voltage, near Acknowledgments Semiconductors, 2009; http://www.itrs.net/
Links/2009ITRS/Home2009.htm
threshold. Aggressive use of custom- This work was inspired by the Exas- 24. Karamcheti, V. et al. Comparison of architectural
ized accelerators will yield the highest cale study working groups chartered in support for messaging in the TMC CM-5 and Cray T3D.
International Symposium on Computer Architecture
performance and greatest energy effi- 2007 and 2008 by Bill Harrod of DAR- (1995).
ciency on many applications. Efficient PA. We thank him and the members 25. Kaul, H. et al. A 320mV 56W 411GOPS/Watt ultra-low-
voltage motion-estimation accelerator in 65nm CMOS.
data orchestration will increasingly and presenters to the working groups IEEE Journal of Solid-State Circuits 44, 1 (Jan. 2009).
be critical, evolving to more efficient for valuable insightful discussions 26. The Khronos Group. OpenCL, the Open Standard for
Heterogeneous Parallel Programming, Feb. 2009;
memory hierarchies and new types of over the past few years. We also thank http://www.khronos.org/opencl/
interconnect tailored for locality and our colleagues at Intel who have im- 27. Kogge, P. et al. Exascale Computing Study:
Technology Challenges in Achieving an Exascale
that depend on sophisticated software proved our understanding of these is- System; http://users.ece.gatech.edu/mrichard/
to place computation and data so as to sues through many thoughtful discus- ExascaleComputingStudyReports/exascale_final_
report_100208.pdf
minimize data movement. The objec- sions. Thanks, too, to the anonymous 28. Mazor, S. The history of microcomputer-invention and
evolution. Proceedings of the IEEE 83, 12 (Dec. 1995).
tive is ultimately the purest form of reviewers whose extensive feedback 29. Noguchi, K., Ohnishi, I., and Morita, H. Design
energy-proportional computing at the greatly improved the article. considerations for a heterogeneous tightly coupled
multiprocessor system. AFIPS National Computer
lowest-possible levels of energy. Het- Conference (1975).
erogeneity in compute and commu- References
30. Nvidia Corp. CUDA Programming Guide Version 2.0,
June 2008; http://www.nvidia.com/object/cuda_home_
nication hardware will be essential to 1. Advanced Vector Extensions. Intel; http://en.wikipedia.
new.html
org/wiki/Advanced_Vector_Extensions
optimize for performance for energy- 2. AltiVec, Apple, IBM, Freescale; http://en.wikipedia.org/
31. Pfister, G. et al. The research parallel processor
prototype (RP3): Introduction and architecture.
proportional computing and coping wiki/AltiVec
International Conference on Parallel Processing (Aug.
3. Amdahl, G. Validity of the single-processor approach
with variability. Finally, programming to achieving large-scale computing capability. AFIPS
1985).
32. Pollack, F. Pollack’s Rule of Thumb for Microprocessor
systems will have to comprehend Joint Computer Conference (Apr. 1967), 483–485.
Performance and Area; http://en.wikipedia.org/wiki/
4. Anders, M. et al. A 4.1Tb/s bisection-bandwidth
these restrictions and provide tools 560Gb/s/W streaming circuit-switched 8x8 mesh
Pollack’s_Rule
33. Przybylski, S.A. et al. Characteristics of performance-
and environments to harvest the per- network-on-chip in 45nm CMOS. International Solid
optimal multi-level cache hierarchies. International
State Circuits Conference (Feb. 2010).
formance. Symposium on Computer Architecture (June 1989).
5. Barroso, L.A. and Hölzle, U. The case for energy-
34. Richter, J. The CLR Via C#, Second Edition, 1997.
While no one can reliably predict proportional computing. IEEE Computer 40, 12 (Dec.
35. Ruby Documentation Project. Programming Ruby: The
2007).
the end of Si CMOS scaling, for this Pragmatic Programmer’s Guide; http://www.ruby-doc.
6. Bell, S. et. al. TILE64 processor: A 64-core SoC with
org/docs/ProgrammingRuby/
future scaling regime, many electrical mesh interconnect. IEEE International Solid-State
36. Seiler, L. et al. Larrabee: Many-core x86 architecture
Circuits Conference (2008).
for visual computing. ACM Transactions on Graphics
engineers have begun exploring new 7. Bienia, C. et. al. The PARSEC benchmark suite:
27, 3 (Aug. 2008).
Characterization and architectural implications.
types of switches and materials (such The 17th International Symposium on Parallel
37. Strecker, W. Transient behavior of cache memories.
ACM Transactions on Computer Systems 1, 4 (Nov.
as compound semiconductors, carbon Architectures and Compilation Techniques (2008).
1983).
8. Blumrich, M. et. al. Design and Analysis of the Blue
nanotubes, and graphene) with dif- Gene/L Torus Interconnection Network. IBM Research
38. Sarkar, V. et al. Exascale Software Study:
Software Challenges in Extreme-Scale
ferent performance and scaling char- Report, 2003.
Systems; http://users.ece.gatech.edu/mrichard/
9. Borkar, S. Designing reliable systems from unreliable
acteristics from Si CMOS, posing new components: The challenges of transistor variability
ExascaleComputingStudyReports/ECSS%20report%20
101909.pdf
types of design and manufacturing and degradation. IEEE Micro 25, 6 (Nov.–Dec. 2005).
39. Tartar, J. Multiprocessor hardware: An architectural
10. Borkar, S. Design challenges of technology scaling.
challenges. However, all such technol- IEEE Micro 19, 4 (July–Aug. 1999).
overview. ACM Annual Conference (1980).
40. Weingold, E. et al. Baring it all to software: Raw
ogies are in their infancy, probably not 11. Borkar, S. et al. Parameter variations and impact
machines. IEEE Computer 30, 9 (Sept. 1997).
on circuits and microarchitecture. The 40th Annual
ready in the next decade to replace sili- Design Automation Conference (2003).
41. Wulf, W. and Bell, C.G. C.mmp: A multi-miniprocessor.
AFIPS Joint Computer Conferences (Dec. 1972).
con but will pose the same challenges 12. Catanzaro, B. et. al. Ubiquitous parallel computing
from Berkeley, Illinois, and Stanford. IEEE Micro 30, 2
with continued scaling. Quantum (2010).
13. Cray, Inc. Chapel Language Specification. Seattle, WA, Shekhar Borkar (Shekhar.Y.Borkar@intel.com) is an
electronics (such as quantum dots) Intel Fellow and director of exascale technology at Intel
2010; http://chapel.cray.com/spec/spec-0.795.pdf
are even farther out and when realized 14. Chien, A. 10x10: A general-purpose architectural Corporation, Hillsboro, OR.
will reflect major challenges of its own, approach to heterogeneity and energy efficiency. The
Third Workshop on Emerging Parallel Architctures Andrew A. Chien (Andrew.Chien@alum.mit.edu) is
with yet newer models of computation, at the International Conference on Computational former vice president of research at Intel Corporation and
Science (June 2011). currently adjunct professor in the Computer Science and
architecture, manufacturing, variabil- 15. Chien, A. Pervasive parallel computing: An historic Engineering Department at the University of California,
ity, and resilience. opportunity for innovation in programming and San Diego.
architecture. ACM Principles and Practice of Parallel
Because the future winners are far Programming (2007).
from clear today, it is way too early to 16. Cooper, B. et al. Benchmarking cloud serving systems © 2011 ACM 0001-0782/11/05 $10.00

may 2 0 1 1 | vo l . 5 4 | n o. 5 | c o m m u n icat io n s o f t he acm 77

You might also like