Professional Documents
Culture Documents
PCI Express
Adam H. Wilen
Justin P. Schade
Ron Thornburg
Contents
Acknowledgements
Chapter 1
xi
Introduction
A Quick Overview 2
Why Develop a New I/O Technology?
Who Should Read This Book 6
System Integrators 6
Hardware Developers 6
Software Developers 6
Product Marketing Engineers 7
Application Engineers 7
The Organization of This Book 7
Beyond PCI 7
The Technology 8
Adopting PCI Express 9
Chapter 2
11
19
iii
iv
PCI Challenges 19
Bandwidth Limitations 19
Host Pin Limitations 21
Inability to Support Real-Time (Isochronous) Data Transfers
Inability to Address Future I/O Requirements 24
PCI Moving Forward 25
Chapter 3
23
27
Contents
Chapter 4
51
Chapter 5
Chapter 6
97
79
vi
Chapter 7
123
Chapter 8
141
Chapter 9
Flow Control
175
188
Contents
201
227
vii
viii
249
258
273
Contents
Glossary
Index
313
319
306
ix
Chapter
Introduction
Every great advance in science has issued from a new audacity of imagination.
John Dewey
between PCI Express and PCI/PCI-X is that many software and configuration space models are preserved among the three technologies.
Aside from the opportunity of introducing a brand new general input/output (I/O) architecture, there are several motivations for writing
this book. One of the primary motivations is to give the reader an easy-tofollow, introductory technological overview of PCI Express. This book is
not a replacement for reading the PCI Express Specification. The opinion
of the authors is that this book makes the PCI Express Specification easier to comprehend by giving it a context with extra background and insights into many areas of the technology. The second motivation is to
prepare the industry for a transition to PCI Express architecture by discussing system-level impact, application-specific transitions, and the general timeline for consumer market introduction.
A Quick Overview
PCI Express is a high-performance interconnect that gives more for less,
meaning more bandwidth with fewer pins. PCI Express is designed to
leverage the strengths of yesterdays general I/O architectures while addressing immediate and future I/O architectural and mechanical issues
with current technologies such as bandwidth constraints, protocol limitations and pin count. More technically speaking, PCI Express is a high
speed, low voltage, differential serial pathway for two devices to communicate with each other. PCI Express uses a protocol that allows devices to communicate simultaneously by implementing dual
unidirectional paths between two devices, as shown in Figure 1.1.
Device A
Figure 1.1
Device B
These properties in and of themselves are not as exciting without giving some merit to the technology, or in other words, without showing
the numbers. For PCI Express, high speed equates to a bit rate of
2,500,000,000 bits per second (2.5 gigabits per second). Low voltage
Chapter 1: Introduction
Device B
Device A
x2 Scaled
Device B
Device A
x1 Scaled
Figure 1.2
Chapter 1: Introduction
bus frequencies of 250 megahertz and higher are plagued with electrical
challenges and a limited set of solutions. Advancing the bus frequency
beyond 500 megahertz will require massive efforts and yield less than
friendly results, if those results are useable at all. There is no question
that something beyond PCI is required going forward. This is the opportunity to go beyond the stopgap approach of trying to squeeze more life
out of PCI by simply bumping up the frequency. This is the chance to
make changes that will carry general I/O architecture comfortably into
the next decade.
PCI is based on a protocol that is nearly a decade old. As usage models change, the protocol must adapt to deal with new models. The aggressive multimedia nature of todays applications such as streaming
audio and video require the ability to guarantee certain amounts of
bandwidth. The PCI protocol does not have the ability to deal appropriately with these types of deterministic transactions. There is a need to define an architecture that is equipped to deal with these multimedia usage
models.
System Integrators
System integrators will benefit from the system level information that is
discussed in many sections of this book. Of particular interest is the impact to the current infrastructure, cost structure, flexibility, applications,
and technology timelines. System integrators can take advantage of this
information for developing strategic short-range and long-range goals for
incorporating PCI Express into current and future designs.
Silicon Designers
Silicon designers can use the information in this book to assist in interpreting the PCI Express Specification. The technology sections of this
book are written to address and answer many of the why questions that
the specification does not clearly articulate. This book can be used to
bring silicon designers up to speed quickly on key PCI Express concepts
before and while reading through the PCI Express Specification.
Software Engineers
Software engineers can use the information in this book to understand
what must be done by BIOS code and drivers to take advantage of PCI
Express features. Software engineers should focus on which features can
take advantage of the existing PCI configuration model and which features cannot. The information in the technology section of the book is
helpful in outlining the general flow of software routines in setting up
the advanced features.
Application Engineers
Applications engineers will find the entire book useful to support and
drive their customer base through the transition to this new technology.
As with silicon designers, application engineers can use this book to provide insight and additional understanding to many key areas of the PCI
Express Specification.
Chapter 1: Introduction
Beyond PCI
The first section of this book sets the stage for understanding the motivations, goals and applications of PCI Express. As a baseline, a brief history
of PCI is explored in Chapter 2, as are the successes and challenges PCI
has encountered during its lifetime. The successes of PCI are discussed as
a foundation for PCI Express, while the challenges PCI faces are disclosed
to reveal areas that need to be addressed by a next generation technology.
Chapter 3 includes an investigation of the goals and requirements of
PCI Express. This section explores the metrics and criteria for PCI Express adoption, with a focus on preserving key commonalities such as infrastructure, manufacturing, multi-segment support and cost. In addition
to this, many new capabilities are discussed.
The first section ends with a discussion of next generation applications, looking closely at the applications for which PCI Express offers
significant benefits beyond existing I/O architectures. This discussion
takes into account the various segments such as desktop, mobile, server,
and communications. New and revolutionary usage models are also discussed as a natural solution to evolving system and I/O requirements
The Technology
The second section of this book is the general hardware and software architecture of PCI Express. This section examines what it means to have a
layered architecture, how those layers interact with each other, with
software, and with the outside world. This section also introduces and
explains the advantages of the PCI Express transaction flow control
mechanisms, closing with a look at PCI Express power management.
Chapter 5 introduces PCI Express as an advanced layered architecture. It includes an introduction to the three key PCI Express architectural layers and their interaction with each other, with software, and with
the outside world. A top-down description follows where the uppermost
layer (the Transaction Layer), which interacts directly with software, is
discussed first, followed by the intermediate (Data Link Layer) and final
layer (Physical Layer). Chapters 6, 7, and 8 examine each of the three architectural layers in detail. Following the discussion of the individual PCI
Express layers is a discussion in Chapter 9 on the various transaction flow
control mechanisms within PCI Express. This section describes the ordering requirements for the various PCI Express transaction types. The
bulk of this section, however, focuses on the newer flow control policies
that PCI Express utilizes such as virtual channels, traffic classes and flow
control credits.
Chapter 10 presents insights into PCI Express software architecture.
This section focuses on identifying the PCI Express features available in a
legacy software environment. It also includes a discussion of software
configuration stacks and also broader (device driver model, auxiliary)
software stacks for control of advanced features.
Chapter 11 concludes this section with a discussion on PCI Express
power management. This chapter discusses the existing PCI power management model as a base for PCI Express power management. This base
is used to discuss PCI Express system-level, device-level and bus/link
power management states.
Chapter 1: Introduction
early and late adopter will be examined. This allows companies to assess
what type of adopter they are and what opportunities, challenges, and
tools are available to them.
Chapter 14 closes the book with a case study of several different PCI
Express-based products. The development phase of each product is examined along with a discussion of some of the challenges of implementing PCI Express technology.
10
Chapter
12
was preparing to market. PCI was viewed as the vehicle that would fully
exploit the processing capabilities of the new Pentium line of processors.
Table 2.1
I/O Bus
MHz
Industry Standard
Architecture (ISA)
8.3
16
8.3
8.3
32
33
Peripheral Component
Interconnect (PCI)
33
32
132
13
PCI Successes
As fast as technology has evolved and advanced over the last ten years, it
is amazing how long PCI has remained a viable piece of the computing
platform. The original architects of PCI had no idea that this architecture
would still be integral to the computing platform ten years later. PCI has
survived and thrived as long as it has because of the successes it has enjoyed. The most noted success of PCI is the wide industry and segment
acceptance achieved through the promotion and evolution of the technology. This is followed by general compatibility as defined by the PCI
specification. Combine the above with processor architecture independence, full-bus mastering, Plug and Play operation, and high performance
low cost implementation, and you have a recipe for success.
Industry Acceptance
Few technologies have influenced general PC architecture as has PCI.
The way in which this influence can be gauged is by analyzing segment
acceptance and technology lifespan. PCI has forged its way into the three
computing segments (desktop, server, and mobile), as well as communi-
14
cations, and has become the I/O standard for the last ten years. The primary force that has made this possible is the PCI-SIG. The PCI-SIG placed
ownership of PCI in the hands of its member companies. These member
companies banded together to drive standardization of I/O into the market through the promotion of PCI. A list of current PCI-SIG members can
be found on the PCI-SIG web site at http://www.pcisig.com.
There are two key ways in which PCI is sufficiently flexible that
member companies banded together under PCI-SIG. The first is that PCI
is processor-agnostic (both its frequency and its voltage). This allows PCI
to function in the server market, mobile market, and desktop market
with little to no change. Each of these markets supports multiple processors that operate at different voltages and frequencies. This allows members to standardize their I/O across multiple product groups and
generations. The net effect to the vendor is lower system cost through
the use of common elements that can be secured at lower pricing
through higher volume contracts. For example, a system integrator can
use the same PCI based networking card in all of their product lines for
three to four generations. Along the same line of thought, multiple segments can use the same I/O product that invokes the economic concept
of reduced pricing through economy of scale.
The second way that PCI is flexible is in its ability to support multiple
form factors. The PCI-SIG members defined connectors, add-in cards, and
I/O brackets to standardize the I/O back panel and form factors for the
server and desktop market. The standardization of add-in cards, I/O
brackets and form factors in particular has had a massive impact to cost
structure of PCI from not only a system integrators perspective, but from
a consumer perspective as well. This standardization made the distribution of PCI-based add-in cards and form-factor-based computer chassis
possible through the consumer channel. For a product to be successful in
the computer consumer market it must be standardized in order to sustain sufficient volumes to meet general consumer price targets.
Defined Specifications
PCI add-in cards and discrete silicon are available from hundreds of different vendors fulfilling just about every conceivable I/O application.
Consumers can choose from over thirty brands of PCI add-in modem
cards alone ranging from several dollars to several hundred dollars in
cost. These PCI add-in solutions can function in systems that feature host
silicon from multiple vendors like Intel and others.
15
16
CPU
Memory
Controller
Expansion
Bus
Controller
Graphics
Controller
ISA, EISA
Figure 2.1
17
CPU
Bridge/
Memory
Controller
PCI Bus
Expansion
Bus
Controller
Graphics
Controller
IDE
PCI Cards
ISA, EISA
Figure 2.2
18
of waiting for the host bridge to service the device. The net effect to the
system is a reduction of overall latency in servicing I/O transactions.
19
tecture, system vendors could get away with building fewer motherboard
variations. A single variation could support different features depending
on the add-in device socketed in the PCI slot. Many of the details around
high performance and low cost are addressed in much more detail in
Chapter 3.
PCI Challenges
Equal to the successes that PCI has enjoyed are the challenges that PCI
now faces. These challenges pave the way for defining a new architecture. The challenges that PCI faces are in essence areas where PCI has
become inadequate. The key inadequacies are bandwidth limitations,
host pin limitations, the inability to support real time (isochronous) data
transfers, and the inability to address future I/O requirements.
Bandwidth Limitations
PCI transfers data at a frequency of 33 megahertz across either a 32-bit or
64-bit bus. This results in a theoretical bandwidth of 132 megabytes per
second (MB/s) for a 32-bit bus and a theoretical bandwidth of 264 megabytes per second for a 64-bit bus. In 1995 the PCI Specification added
support for 66 megahertz PCI, which is backward-compatible with 33
megahertz PCI in a thirty-two or 64-bit bus configuration. Consequently,
the server market has been the only market to make use of 66 megahertz
PCI and 64-bit PCI, as shown in Table 2.2. This is probably because the
64-bit PCI requires so much space on the platform due to the connector
size and signal routing space. The server market is much less sensitive to
physical space constraints than the desktop and mobile market.
Table 2.2
Bus Frequency in
Megahertz
Bandwidth in MB/s
Market
32
33
132
Desktop/Mobile
32
66
264
Server
64
33
264
Server
64
66
512
Server
momentum in the server market for PCI. The desktop and mobile markets continue to use only 33 megahertz PCI in the 32-bit bus flavor. In
light of this, when PCI is mentioned in this book it is in reference to 33
megahertz PCI, which is used exclusively in the mobile and desktop systems that account for over 85 percent of the total computer market.
The actual bandwidth of the PCI bus is much less than the theoretical
bandwidth (approximately 90 megabytes per second) due to protocol
overhead and general bus topology issues, such as shared bandwidth,
that are discussed in more detail in Chapter 3. Since PCI is a shared bus,
the available bandwidth decreases as number of users increase. When
PCI was introduced, 90 megabytes per second was more than adequate
for the I/O usage models and applications that had been defined. Todays
I/O usage models and applications have grown to require far more
bandwidth than can be supplied by PCI (take Gigabit Ethernet for example that requires 125 megabytes per second). While PCI has been improved over the years (the current PCI Specification is version 3.0), the
bandwidth of PCI has only been increased once. Comparatively, processor frequencies have increased dramatically. Ten years ago 66 megahertz
was a pretty fast processor speed, but todays processor speeds are two
orders of magnitude larger, already passing the 3000 megahertz (or 3 gigahertz) mark, as shown in Figure 2.3. PCI bandwidth hardly dents the I/O
processing capability of todays processors.
Processor Frequencies
3500
3000
2500
MHz
20
2000
1500
1000
500
0
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
Processor Frequencies
Figure 2.3
21
22
can neither bear the cost nor the system space to add additional host devices to support multiple PCI-X segments. This is one of the primary reasons that PCI-X is not used in desktop or mobile systems. Desktop and
mobile systems are constrained to use PCI to maintain upgradeability
through available connectors.
Other
Topside View
of a
I/O Host
Silicon Device
LAN
USB
Figure 2.4
PCI-X
IDE
23
24
Mini-PCI and PCI-X. Both of these technologies are based on PCI and use
a subset of the same signal protocol, electrical definitions, and configuration definitions as PCI. Mini-PCI defines an alternate implementation for
small form factor PCI cards. PCI-X was designed with the goal of increasing the overall clock speed of the bus and improving bus efficiencies
while maintaining backward compatibility with conventional PCI devices. PCI-X is used exclusively in server-based systems that require extra
bandwidth and can tolerate PCI-X bus width, connector lengths, and card
lengths.
The PCI SIG will continue to work on enhancements to the existing
base of standards (like the 533 megahertz PCI-X). However, the future of
PCI is PCI Express. PCI Express is not the next stretch of PCI architecture, but rather an architectural leap that keeps the core of PCIs software infrastructure to minimize the delays to adoption that were
experienced with PCI. PCI Express completely replaces the hardware infrastructure with a radically new forward-looking architecture. The goal
of this leap is to hit the technology sweet-spot like PCI did nearly a decade ago.
Chapter
Goals and
Requirements
his chapter explores key goals and requirements for smooth migraT
tion of systems and designs to PCI Express architecture. A basic assumption is that PCI Express must be stable and scalable for the next ten
years. This chapter discusses the need for PCI Express scalability and pin
link usage efficiency. Another primary requirement is multiple segment
support, focusing on the three key computing segments, desktop PCs,
servers, and mobile PCs. Following the discussion on multiple segment
support, the chapter explores system level cost parity with PCI as a key
requirement for technology migration. Afterwards, I/O simplification
goals are explored with an emphasis on consolidation of general I/O. Finally, the chapter investigates backward compatibility as it relates to current software environments and form factors.
26
the center of the digital world. Over a period of a few years the concept
of I/O has changed dramatically. The usage models for general I/O have
grown to include not only streaming audio and video, but the entire array
of digital devices such as PDAs, MP3 players, cameras and more. At first
introduction PCI Express will have sufficient bandwidth to support the
available applications. However, future applications and evolving current
applications will require PCI Express to be scalable, as seen in Figure 3.1.
Productivity
Personal Computer
Mid 80's
Figure 3.1
Multimedia
Personal Computer
Extended
Personal Computer
Late 90's
2000+
27
Bandwidth in GB/s
28
65
60
55
50
45
40
34
30
25
20
15
10
5
0
x1
x2
x4
x8
x12
x16
x32
Link Width
Generation 1 PCI Express
(2.5GHz)
Possible Generation 2
PCI Express (5GHz)
Possible Generation 3
PCI Express (10GHz)
The next generation frequencies shown in the figure above are based upon
current speculation.
Figure 3.2
29
Pin Efficiency
First consider conventional PCI, which uses a 32-bit wide (4 byte) bidirectional bus. In addition to the 32 data pins, there are 52 side band
and power and ground pins.
Equation 3.1
Equation 3.2
30
Device A
Device B
Device A
Device B
Figure 3.3
The calculation of PCI Express bandwidth per-pin is a little bit different from the comparison calculations above. For starters, PCI Express
does not share a single link. Instead it has two unidirectional links that
operate independently of one another. In addition to this PCI Express
encodes every byte of data it sends out on the link with an additional two
bits (referred to as 8-bit/10-bit encoding, which is discussed in detail in
Chapter 8). Taking these details into consideration the following calculations can be made.
1byte
250 megabytes per sec ond TX
10 bits
1byte
250 megabytes per second RX
10 bits
31
NoteIn PCI Express 10 bits are equated to a single byte due to 8-bit/10bit encoding, which reduces the efficiency of the bus. In PCI-Express (x1)
, 4 pins are used for signaling. The remaining 2 pins are used for power
and ground. A link that has multiple transmit and receive pairs is actually
more pin-efficient with regards to power and ground balls. As link width
increases the bandwidth per pin increases to 100 megabytes per second.
Equation 3.3
Link Efficiency
To foster high link efficiencies, PCI Express uses a split transaction protocol that benefits from the adoption of advanced flow control mechanisms. This allows PCI Express to maximize the bandwidth capability of
the architecture by minimizing the possibility of bottleneck contentions
and link inactivity. This type of efficiency is extremely critical when dealing with a serial architecture such as PCI Express.
Within any architecture there are devices that exhibit latencies that
do not allow a transaction to complete immediately (considered a latent
transaction). This requires mechanisms to be defined to handle these
types of transactions. For example, two devices, A and B, which are interconnected, are defined to have multiple functions. Devices A and B
could represent a real world chip-to-chip connection where Device A is a
Host Bridge and Device B is a PCI-to-PCI bridge to other bus segments
supporting multiple devices. Similarly Devices A and B could represent a
Host/PCI Bridge communicating to a slot-based device that has multiple
functions, as shown in Figure 3.4.
If Device A desires to receive some data from Function 1 on Device
B, it is likely that Device B will require time to obtain the requested information since Function 1 exists outside of Device B. The delay from the
time the request is comprehended in Device B until it can be serviced to
Device A is considered the latency period. If no further transactions are
allowed until the outstanding transaction is finished, the system efficiency is reduced severely. Take the case where Device A also requires
information from Function 2 of Device B. Without an intelligent mechanism for maximizing the efficiency of the system, Device A would have
32
to wait for Device B to complete the first request before beginning the
next transaction, which will most likely have a latency associated with it,
as shown in Figure 3.4.
Function 1
Device B
Device A
Function 2
Figure 3.4
Conventional PCI defined the concept of a delayed transaction between a master and a target device to address this kind of issue. The concept of master and target is derived from the fact that multiple devices
share the same bus, which is bidirectional in the case of conventional
PCI, as shown earlier in Figure 3.3. This requires devices on the bus to
arbitrate for control of the bus, or in other words, compete to become
the master. Consequently, the arbitration algorithm for bus control must
be fair for all devices. This technique of competing for control of the bus
limits the ability of conventional PCI to support devices that require
guaranteed bandwidth such as the isochronous data transfers that are
discussed in Chapter 3.
To begin a conventional PCI transaction, the master, Device A in this
case, must arbitrate for control of the bus. In this simple example Device
A is arbitrating for control of the bus with only one other device, Device
B. (Conventional PCI systems can have multiple devices competing to
become the bus master at any given time.) Once Device A wins control
of the bus it generates a transaction on the bus in which Device B is the
target. Device B decodes the transaction and determines that Device A is
requesting information stored in Function 1. Since Device B does not
have the requested information readily available, it terminates the transaction with a retry response, which buys extra time to retrieve the data
and complete the original request. Terminating the transaction also allows the bus to become available for other devices to arbitrate for control and complete other transactions. The delayed transaction protocol
requires Device A to again arbitrate for control of the bus, win control,
and send the original request to Device B. This process takes place until
Device B can service the request immediately thereby completing the
transaction.
33
34
lane highway in both directions. This four lane highway has a carpool
lane that allows carpoolers an easier path to travel during rush hour traffic congestion. There are also fast lanes for swifter moving traffic and
slow lanes for big trucks and other slow traffic. Drivers can use different
lanes in either direction to get to a particular destination. Each driver occupies a lane based upon the type of driver he or she is. Carpoolers take
the carpool lane while fast drivers and slow drivers occupy the fast and
slow lanes respectively. The four lane highway example represents what
is referred to in PCI Express as a virtual channel The link, or connection, formed between two PCI Express devices can support multiple virtual channels regardless of the actual link width.
Virtual channels are exactly what one might suspect, they are virtual
wires between two devices. The finite physical link bandwidth is divided
up amongst the supported virtual channels as appropriate. Each virtual
channel has its own set of queues and buffers, control logic, and a creditbased mechanism to track how full or empty those buffers are on each
side of the link. Thinking back to the four lane highway example, in the
real world those four lanes can become congested and blocked as well.
The advancement of cars on the four lane highway is in direct proportion
to the amount of road available in front of each vehicle. Some lanes may
have more space for traffic to move then others. Likewise, if the receive
queues and buffers for a virtual channel on one side of the link or the
other are full, then no further transactions can be sent until they are
freed up by completing outstanding transactions. Additionally, on the
transmit side, if the transmit queues and buffers become full, no further
transactions are accepted until they are freed up by completing outstanding transactions. Bottlenecked transactions on one virtual channel
do not cause bottlenecks on another virtual channel since each virtual
channel has its own set of queues and buffers.
System traffic is broken down into classes that are based on device
class and negotiation with the operating system. In the traffic example
above, the traffic classes would consist of carpoolers, fast drivers, and
slow drivers. PCI Express supports up to eight different traffic classes and
hence, eight different virtual channels. Each traffic class may be mapped
to a unique virtual channel; however, this is not a requirement. Unlike
drivers on a four lane highway who may continually change lanes, once a
device is assigned a traffic class it cannot change to another traffic class.
Figure 3.5 illustrates how PCI Express links can support multiple virtual channels. Each virtual channel can support one or multiple traffic
classes; however, a single traffic class may not be mapped to multiple virtual channels. Again recall that virtual channels are in fact virtual. You
35
cannot infer simply because a PCI Express link is defined as a x2 link that
there are two virtual channels. A x1 PCI Express link can have as many as
eight virtual channels and a x32 link can have as few as one virtual channel. Additional details of PCI Express flow control are examined in Chapter 9.
Function 1
Device A (Root
(Complex)
Link
TC[0:1]
VC0
TC[0:1]
TC7
VC3
TC7
Function 2
Link
TC[0:1]
VC0
TC[0:1]
TC[2:4]
VC1
TC[2:4]
TC[5:6]
VC2
TC[5:6]
TC7
VC3
TC7
Figure 3.5
Link
TC[0:1]
Mapping TC[2:4]
VC0
TC[0:1]
VC1
TC[2:4]
TC[5:6]
VC2
TC[5:6]
TC7
VC3
TC7
In addition to PCI Express split transactions and enhanced flow control there are other differences from PCI and PCI-X to consider that optimize the link efficiency. The first main difference is that PCI Express is
always a point-to-point solution. The second main difference is the link
topology, which is composed of two unidirectional links that operate independently of one another. This allows transactions to flow to and from
a device simultaneously. Since the master/target relationship is done
away with in PCI Express, transactions are no longer identified by who
mastered the transaction. Instead PCI Express identifies transactions by
transaction identifiers, which contain information as to which transaction
is being serviced.
PCI Express achieves link efficiency by using a split transaction protocol and adopting a dual unidirectional link topology to allow simultaneous traffic in both directions. In addition to this PCI Express defines
multiple traffic classes and virtual channels to eliminate single transaction
bottlenecks by allowing a variety of different transaction types to become
36
multiplexed and flow across the link. The combination of these techniques gives PCI Express a technical advantage over both conventional
PCI and PCI-X. More importantly this type of link efficiency will be necessary to support the device usage models of the immediate future.
Multi-Segment Support
PCI Express must support multiple market segments. An architectural
change this large requires integration into multiple segments to gain the
appropriate momentum for mass acceptance. In addition to this, it has
become apparent that the various market segments are becoming more
unified as time goes on. Mobile and desktop segments have been merging
for years. Many corporate stable systems have now shifted to become
desktop/mobile hybrids that also require manageability features found
primarily in the server market. Leveraging the mass adoption and economy of scale in the computing sectors, PCI has been adopted as a control
mechanism in the communications sector. To this end PCI Express has
been defined to support primary feature requirements of all four segments.
37
fined as being known to the system before they happen. PCI Express
transmit and receive buffers have been designed to withstand sustained
shorts to ground of the actual data lines. Additionally receive buffers remain in high impedance whenever power is not present to protect the
device from circuit damage.
38
39
Disc (DVD) technology took many years to become widely adopted. One
of the primary reasons that it took so long to adopt was the fact that DVD
players cost far more to manufacture than video cassette players. As a result they were sold for nearly four and five times the cost of a good quality video cassette player. DVDs have much higher quality than video
tapes. However, if the cost is inhibitive for the market in question, adoption will be incredibly slow or will fail altogether. As a consequence, the
PCI Express architects formed a set of requirements that use current fabrication technologies in four key areas: printed circuit board fabrication
technology, connector manufacturing technology, four-layer routability,
and silicon design process technology.
40
ever, given the fact that PCI Express connectors are much smaller than
conventional PCI connectors, there are some material savings realized
from a manufacturing standpoint that may balance out the cost.
Figure 3.6
41
Signal
Power
Glass
Laminate
Dielectric
Ground
Signal
Four-layer stackups are used primarily in the desktop computer market,
which makes up approximately 70 percent of overall computer sales.
Figure 3.7
42
process or smaller for their devices. Voltage constraints also exist. PCI
Express was designed to operate at I/O voltage levels compatible with
0.25 micron and future processes. In short PCI Express has some design
flexibility in that it can be designed on multiple silicon processes. PCI
Express devices manufactured on different processes can still be connected to one another since PCI Express is voltage independent through
AC coupling on the signal lines. For additional consideration and insights
into manufacturing choices see Chapter 14.
I/O Simplification
PCI Express seeks to simplify general I/O by consolidating the I/O strategy for the desktop, mobile, and server segments. I/O consolidation gives
a sense of validity to the architecture by removing technological constraints and architectural constraints that have generally separated the
segments. PCI Express is defined as an open specification and will allow
all hardware vendors alike to adopt the technology without the burden of
paying royalties. PCI Express is not currently defined to replace all I/O
that currently exists across the multiple segments. However, it is expected that as time passes, the I/O that was not originally consolidated
will soon become so.
PCI-to-PCI
Bridge
Figure 3.8
PCI
43
Audio
PCI
LAN
Graphics
Bridge/
Memory
Controller
ATA
Expansion Slots
Within a few years, however, the architectural picture was quite different. In a relatively short period of time the demand for bandwidth increased beyond what conventional PCI could deliver. Since conventional
PCI was not designed to be scalable, chipset manufacturers had to explore other options, resulting in the integration of high bandwidth I/O
elements such as graphics and ATA into the memory and I/O controller
respectively. While the integration of high bandwidth I/O gave some
bandwidth relief to the expansion slots portion of the platform, it made
the PCI-based chip-to-chip interconnect bandwidth problems even
worse. The natural result of feature integration into the I/O controller
was the simultaneous development of proprietary high-bandwidth chipto-chip solutions. The end result was segmentation of a once consolidated I/O, as illustrated in Figure 3.9.
Memory
Controller
Graphics
AGP
Memory
Controller
Core
SATA
SATA
USB
USB
LAN
LAN
Audio
AC'97
I/O
Controller
Core
PCIto-PCI
Bridge
PCI
Audio
IDE
LAN
ATA
MISC
Chip
to
Chip
MISC
44
Expansion Slots
I/O
Controller
In many systems audio and LAN still exist as expansion cards on the PCI
Bus
Figure 3.9
PCI Express gives chipset designers the ability to reconsolidate graphics and the chip-to-chip interconnect. Features that still remain on conventional PCI will be transitioned over a period of time to PCI Express, as
shown in Figure 3.10. Most importantly, the scalability infrastructure of
PCI Express provides a framework for growth based upon scalable clock
frequencies and logical signal lines. Conventional PCI has remained an
important part of the computing platform for the last ten years despite its
lack of scalability. PCI Express is expected to be a key part of the platform for the next decade and beyond.
PCI Express
Root
Complex
45
Audio
Graphics
MISC
PCI
Express
ATA
IDE
SATA
SATA
USB
USB
Audio
AC'97
I/O
Controller
Core
PCI
Express
Switch
PCIto-PCI
Bridge
PCI
Legacy
Expansion
Slots
PCI Express
LAN
PCI Express
Mobile Docking
PCI Express
Expansion Slot
PCI Express
Expansion Slot
I/O
Controller
Conventional PCI may coexist with PCI Express as an I/O controller feature
during the transition phase. The system level architecture will be PCI
Express-based.
Figure 3.10
Backward Compatibility
Over the last ten years PCI has developed an extensive infrastructure that
ranges from operating system support to chassis form-factor solutions.
This infrastructure is the result of many years of coordinated efforts between hardware vendors and software vendors. The infrastructure established through the adoption of PCI has been the springboard of success
for the personal computing platform. The most significant requirement
for smooth migration of PCI Express architecture is the level of backward
compatibility it has with the existing infrastructure.
46
47
the architecture quickly, whereas devices that do not require the benefits
of PCI Express (56K PCI modem cards for example) can make the change
slowly.
The coexistence of PCI and PCI Express needs to be clarified. The
core architecture of systems should change to PCI Express as rapidly as
possible. The coexistence model is defined to be a PCI Express core architecture with supporting PCI Express-to-PCI bridges.
48
Chapter
PCI Express
Applications
his chapter looks more closely at the applications where PCI Express
offers significant benefits beyond existing interconnect technologies.
PCI Express is a unique technology in that it provides immediate benefits
across multiple market segments from desktop PCs, mobile PCs, enterprise servers, to communications switches and routers. This chapter
starts with a brief overview of the key benefits of PCI Express and then
covers the applications where PCI Express is a natural solution due to
evolving requirements. Finally this chapter reviews some of the applications where PCI Express provides a new and revolutionary usage model.
50
Ease of Use
Figure 4.1
Benefits
Superior performance
Enable evolving applications
Reduced returns
Improved service and support
Smooth software changes
Virtual channels
QoS and Isochrony
Improved multimedia
"Glitch less" audio and video
High Performance
A key metric of performance is bandwidth, or the amount of data that
can be transferred in a given time. True usable bandwidth, typically
measured in millions of bytes or megabytes per second (MB/s) is a factor
of total theoretical peak bandwidth multiplied by efficiency. For example, recall PCI is a 32-bit bus running at 33 megahertz, which is 132
megabytes per second Although the final PCI specification evolved to a
64-bit bus running at 66 megahertz for a total of 533 megabytes per second, approximately 85 percent of the computing industry continues to
use the 33-megahertz version. However, the PCI bus cannot actually
transfer data at these rates due to overhead required for commands as
well as the inability to perform reads and writes at the same time. To determine the actual data transfer capability requires an understanding of
the bus efficiency. The bus efficiency is determined by several factors
such as protocol and design limitations and is beyond the scope of this
51
I/O Simplification
A look inside several computing platforms today illustrates that there is
an overabundance of I/O technologies. Todays platforms have PCI-X for
servers, Cardbus (PCMCIA slot for expansion) on mobile PCs, and PCI for
desktop PCs. In addition, several I/O technologies have evolved to application-specific usage models such as IDE and SCSI for disk drives, USB or
IEEE 1394 for PC peripherals, AGP for graphics cards, and proprietary
chip-to-chip interconnects such as Intels Hub Link. Although many of
these technologies will continue to coexist moving forward, PCI Express
provides a unique interface technology serving multiple market segments. For example, a PC chipset designer may implement an x16 PCI
Express configuration for graphics, an x1 configuration for general purpose I/O, and an x4 configuration as a high-speed chip-to-chip interconnect. Notice the platform of the future consolidates the design and
development effort to a single PCI Express core away from three separate
and distinct I/O technologies (AGP, PCI, and Hub Link respectively). Refer to Figure 12.2 and Figure 4.13 for examples of what a future desktop
PC and server platform could look like.
Layered Architecture
PCI Express establishes a unique divergence from historical PCI evolutions through a layered architecture improving serviceability and scalabil-
52
Software
Transaction
Data Link
Physical
Mechanical
Figure 4.2
53
from 2.5 gigabits per second to probably more than 5.0 gigabits per second, only the Physical Layer needs to evolve. The remaining layers can
continue to operate flawlessly, reducing development cost and time for
each incremental evolution of PCI Express.
54
OM15258
PCI Express
VCs
Figure 4.3
Isochronous Example
Ease of Use
PCI Express will revolutionize the way users install upgrades and repair
failures. PCI Express natively supports hot plug and hot swap. Hot swap
is the ability to swap I/O cards without software interaction where as hot
plug may require operating system interaction. PCI Express as a hardware specification defines the capability to support both hot swap and
hot plug, but hot plug support will depend on the operating system. In
the future, systems will not need to be powered down to replace faulty
equipment or install upgrades. In conjunction with the PCMCIA and PCI
SIG industry groups defining standard plug-in modules for mobile, desktop PCs, and servers, PCI Express enables systems to be easier to configure and use.
For example, compare the events following the failure of a PCI
Ethernet controller in the office today with what the future could look
like. If a PCI card fails today, a technician is dispatched to the location of
the PC. The PC must be powered down, opened up, and the card must
be physically removed. Opening the PC chassis can be cumbersome, as
screws need to be removed, cables disconnected and pushed out of the
way, and the card unseated from the motherboard. Once the faulty card
is replaced with an identical unit, the system is then reassembled, reconnected, and powered back on. Hopefully all goes well and the PC is up
and running after a short two-hour delay. In the future, PCI Express
modules could be plugged into the external slot on the PC without powering down, disassembling, and disconnecting the PC. Refer to Figure 4.4
for a picture of the modules. In the same scenario, the technician arrives
55
with a new module, swaps the good with the bad, and the user is off and
running in less than ten minutes. In addition to serviceability, PCI Express provides the interconnect capability to perform upgrades without
powering down the system.
Although the modules are still under definition and expected to be
finalized in 2003, the proposals currently being discussed highlight the
benefits of easy to use modules. See Figure 14.4. The PC on the left is
based on todays capabilities. In order to install an upgrade, the user must
open the box and navigate through the cables and connectors. The PC
on the right has the ability to install either in the front or back of the system. PC add-in cards are expected to continue to be supported within
the box for standard OEM configurations, but PCI Express provides a
revolutionary and easier method to install upgrades versus internal slots.
The small module also enables OEMs to provide expansion capability on
extremely small form-factor PCs where the PCI Express connectors consume valuable space.
External upgrades
and repairs
Evolutionary Applications
The following sections within this chapter review both evolutionary and
revolutionary applications for PCI Express. To some extent or another, all
the following applications are expected to leverage one or more of the
56
five main benefitshigh performance, I/O simplification, layered architecture, next generation multimedia, and ease of use.
PCI Express provides a new architecture for the next decade. Over
the years, the processor and memory system have scaled in frequency
and bandwidth, continually improving the overall system performance.
Additionally, platform I/O requirements continue to demand increasing
bandwidth, bringing about the creation of several high-speed busses in
addition to the general purpose PCI bus within the typical mobile or
desktop PC, as shown in Figure 4.5.
Over the next few years, the I/O bandwidth requirements will likely
continue to outpace the existing PCI capabilities. This section reviews
applications benefiting immediately from the bandwidth, scalability, and
reliability improvements highlighting where PCI Express is a natural solution due to evolving requirements. The actual applications covered include PC graphics, gigabit Ethernet, high-speed chip interconnects, and
general purpose I/O.
CPU
CPU
Bus
Graphics
AGP
Memory
Controller
Memory
Main
Memory
Chip
to
Chip
Figure 4.5
ATA Drive
IDE
USB
USB
Audio
AC97
I/O Host
Controller
Typical PC Architecture
PCI
Expansion
Slots
57
Graphics Evolution
Initial graphics devices were based on the ISA (Industry Standard Architecture) system bus in the early 1980s with a text-only display. The ISA
bus provided a 16-bit bus operating at 8.33 megahertz for a total theoretical bandwidth of approximately 16 megabytes per second. As the
CPU and main memory continued to improve in performance, the graphics interconnect also scaled to match system performance and improve
the overall end user experience. The early 1990s saw the introduction of
the PCI architecture providing a 32-bit bus operating at 33 megahertz for
a total bandwidth of approximately 132 megabytes per second as well as
the evolution to two-dimensional rendering of objects improving the
users visual experience. Although the PCI interface added support for
faster 64-bit interface on 66 megahertz clock, the graphics interface
evolved to AGP implementations. In the mid-1990s to the early years of
the following decade, the AGP interface evolved from the 1x mode eventually to the 8x mode. Today the AGP 8x mode operates on a 32-bit bus
with a 66 megahertz clock that is sampled eight times in a given clock
period for a total bandwidth of approximately 2,100 megabytes per second. Additional enhancements such as three-dimensional rendering also
evolved to improve the overall experience as well as drive the demand
for continued bandwidth improvements. See Figure 4.6 for the bandwidth evolution of the graphics interconnect. The graphics interconnect
has continued to double in bandwidth to take full advantage of increased
computing capability between main memory, the CPU, and the graphics
4500
4000
AGP8x
2133MBs
3500
3000
MB/s
58
2500
2000
1500
1000
500
ISA
16MBs
PCI
133MBs
AGP
266MBs
AGP4x
AGP2x 1066MBs
533MBs
0
1985
Figure 4.6
1993
1997
1998
1999
2002
2004
59
60
Ethernet Evolution
Ethernet has continually demonstrated resilience and flexibility in evolving to meet increasing networking demands. Ethernet first hit the market
in the early 1980s. As the PC market grew, so did the requirement for
computers and users to share data. By 1990, a 10-megabit per second
networking technology across standard UTP (unshielded twisted pair)
wiring was approved as the IEEE 10BASE-T standard and the next year
Ethernet sales nearly doubled (Riley, Switched Fast Gigabit Ethernet, pp
15). Networking requirements quickly demanded the evolution to Fast
Ethernet and the IEEE 100Base-T standard capable of 100 megabits per
second, published in 1994. Fast Ethernet enjoyed a rapid adoption as
network interface card suppliers offered both Fast Ethernet (100 megabit
per second standard) and Ethernet (10 megabit per second standard) capability on the same card providing backward compatibility and commonly referred to as 10/100Mbps capable. In fact, almost all Fast Ethernet
network interface cards (NICs) are 10/100Mbps-capable and represent a
commanding 70 percent of the market after only 4 years after introduction as shown in Figure 4.7.
Figure 4.7
1999 NIC
Shipments (000s)
% Market Share
45,594
13,703
3,474
1,522
135
64,428
70.77%
21.27%
5.39%
2.36%
0.21%
Gigabit Ethernet and specifically the IEEE 1000Base-T standard, capable of 1000 megabits per second across UTP cabling, is the next natural
networking technology evolution driven by unprecedented Internet
growth and increasing data transfer requirements within the corporate
environment. The 1000Base-T standard was published in 1998 and the
61
first major physical layer transceiver (PHY) interoperability testing followed in 1999.
Gigabit Ethernet will most likely experience the same adoption success as Fast Ethernet due to the inherent backward compatibility, low
cost structure, and increasing requirements for faster connection rates.
Similar to Fast Ethernet, Gigabit Ethernet has the ability to reuse the same
existing cable within the building as well as the ability to auto-negotiate
to a common speed. Auto-negotiaton allows each link to advertise the
highest possible connection rate. Most network connections support the
lower speed capabilities. For example, a Gigabit Ethernet (10/100/1000
megabits per second) switch connection in the wiring closet will be able
to operate with a slower Fast Ethernet (10/100 megabits per second)
desktop connection at the lowest common denominator of 100 megabits
per second. Backward compatibility expedites adoption through removing the requirement for new cable installations or the requirement to replace all of the existing network equipment.
The remaining factor driving Gigabit Ethernet success is the evolving
need for additional bandwidth. According to Gartner Group research,
PCs connected to the Internet are expected to double from 311 million
in 2000 to more than 600 million in 2004, as shown in Figure 4.8.
PCs Connected to the Internet 1
638
557
M
473
391
311
241
164
62
Immediate Benefits
Gigabit Ethernet requires a high performance interface and is well suited
for PCI Express. PCI Express provides an immediate boost in performance due to the dedicated bandwidth per link, a direct increase in the usable bandwidth, and the ability to perform concurrent cycles. PCI
Express provides a dedicated link with 100 percent of the bandwidth on
each port independent of the system configuration, unlike todays PCI
shared bus. An immediate performance gain is realized with PCI Express
due to the increase in available bandwidth. PCI provides a total bandwidth of 132 megabytes per second whereas PCI Express operates on a
2.5 gigabits-per-second encoded link providing 250 megabytes per second. PCI Express additionally supports concurrent data transmissions for
a maximum concurrent data transfer of 500 megabytes per second. Devices are able to transmit 250 megabytes per second of data during a
write operation while simultaneously receiving 250 megabytes per second of read data due to separate differential pairs. PCI on the other hand
is only capable of performing one read or write operation at any given
time. PCI Express is the next obvious connection to deliver Gigabit
Ethernet speeds.
Inter-campus
Link
63
End Users
Building
Switch
Network
Switch
Firewall
Server
Internet
Gigabit Ethernet
Figure 4.9
Desktop and mobile PCs will implement a x1 PCI Express configuration as the Gigabit Ethernet connection into the network. The local
switch and building switch could use PCI Express in either the data path
or the control plane (refer to the Communications Applications and Advanced Switching section at the end of this chapter). The server connection supplies the data into the network and will also be a PCI Expressbased Gigabit Ethernet connection.
Interconnect History
In 1998, Intel announced the 400 megahertz Pentium II processor to
operate along with the Intel 440BX chipset. The 440BX used the standard PCI bus to interconnect the memory and processor controller to the
PIIX4E I/O controller, as shown in Figure 4.10. Prior to the evolution of
bandwidth-intensive I/O devices, the PCI bus provided sufficient bandwidth and performance at 132 megabytes per second. Adding up the ma-
64
CPU
400MHz
AGP2x
(533MB/s)
Graphics
PC100 SDRAM
(800MB/s)
CPU
Bus
440BX
AGP
PCI
(133MB/s)
SDRAM
PCI
DMA 66
(66MB/s)
Figure 4.10
DMA 66
(4 Drives)
IDE
USB1.1
(2 Ports)
USB
PIIX4E
Controller
ISA
Main
Memory
Expansion
Slots
Expansion
Slots
PC Architecture 1998
65
CPU
1,200MHz
AGP4x
(1,066MB/s)
Graphics
CPU
Bus
AGP
815E
SDRAM
Main
Memory
HubLink
ATA 100
(100MB/s)
ATA 100
(4 Drives)
PC133 SDRAM
(1,00MB/s)
IDE
PCI
ICH2
USB1.1
(4 Ports)
Figure 4.11
USB
Expansion
Slots
PCI
(133MB/s)
PC Architecture 2000
66
fective method to add additional ports and connectivity. Several applications adopted the USB 1.1 standard due to the availability on virtually all
PC platforms and the ease of use for video cameras, connections to PDAs,
printer connections, and broadband modem connections. As more applications came on board, the requirement to increase performance materialized. Prior to the release of the Hi-Speed Universal Serial Bus (USB) 2.0,
the USB bandwidth was significantly smaller than the other I/O counterparts such as LAN adapters and hard drives. In 2000, the Hi-Speed Universal Serial Bus (USB) 2.0 specification was published with wide
acceptance and in 2002 Intel announced the 845GL chipset supporting
Hi-Speed Universal Serial Bus (USB) for a total bandwidth of 60 megabytes per second, as shown in Figure 4.12. Additional platform evolutions
increasing the total I/O bandwidth included increasing the number of
hard drives. Four years after the introduction of the 440BX AGP chipset,
I/O demand grew from 132 megabytes per second to more than 1 gigabyte per second.
CPU
3,000MHz
AGP4x
(1,066MB/s)
Graphics
CPU
Bus
AGP
845G
SDRAM
Main
Memory
HubLink
ATA 100
(100MB/s)
ATA 100
(4 Drives)
PC266 DDR
(2,100MB/s)
IDE
PCI
ICH4
USB2.0
(6 Ports)
Figure 4.12
USB
PC Architecture 2002
PCI
(133MB/s)
Expansion
Slots
67
Immediate Benefits
The immediate benefits of PCI Express as a high-speed chip interconnect
are the bandwidth improvements, scalability, and isochrony. I/O evolution will likely continue and several technological changes are underway.
The Serial ATA specification was published in 2001 paving the road for
new disk transfer rates. The first generation of Serial ATA interconnects
will be able to support up to 150 megabytes per second, a 50 percent increase over the existing ATA100 connections. Serial ATA has a plan for
reaching 600 megabytes per second by the third generation. In addition,
Gigabit Ethernet adoption will eventually drive the requirement for additional bandwidth between the host controller and the I/O controller.
History has demonstrated that bandwidth and performance continue
to rise as systems increase the computing power. The inherent scalability
in both the numbers of PCI Express lanes as well as the ability to scale
the individual frequency of the PCI Express link provides a robust plan
for the next decade. For example, if a high-speed interconnect within a
system requires 1 gigabyte per second of data initially, the system manufacturer could implement an x4 PCI Express link. With changes to the
Physical Layer only, the system would be able to scale up to 4 gigabytes
per second with a modification to the signaling rate, but leaving the
software and existing connectors intact. Due to the benefits, high-speed
chip-to-chip interconnects will likely implement PCI Express or proprietary solutions leveraging the PCI Express technology by 2004.
The other immediate benefit evolves from new capabilities within PCI
Express. Specifically, Isochrony provides revolutionary capabilities for multimedia. Isochrony is discussed later in this chapter and in Chapter 9.
68
PCI Express provides an immediate increase in bandwidth while reducing latencythe time it takes to receive data after a request by removing the number of bridges between PCI, PCI-X, and proprietary
interfaces. PCI Express, unlike current implementations, removes burdensome bridging devices due to its low pin count and dedicated pointto-point link. Bridging solutions and devices evolved due to the short bus
limitation of PCI-X. The short bus segment, or the inability to route PCI-X
over large distances because of the parallel bus, requires devices to be
close to the bus controller chip. In large PCs and servers with multiple
slots, the proximity required to support PCI-X is not feasible. I/O fan-out
led to bridging devices and proprietary interfaces. The low pin count and
dedicated point-to-point link enables greater I/O connectivity and fan-out
in a given silicon and package technology while removing the cost and
complexity of additional bridging solutions. See Figure 4.13 for a comparison of a server architecture today versus the future.
Today's Server
Future Server
Proprietary Interconnects
Multiple Bridge Devices
intel
intel
intel
intel
System Bus
intel
intel
System Bus
intel
Processor
Proprietary I/O
for performance
intel
intel
intel
intel
Processor
Video
intel
Processor
LAN
PCI Express
Fan out
intel
intel
Processor
SB
Video
LAN
SB
intel
intel
intel
intel
intel
intel
intel
intel
Bridge
Bridge
Bridge
Bridge
Bridge
Bridge
Bridge
Bridge
SCSI
SCSI
OM15509
PRELIMINARY
Can't see detail to make drawing
Figure 4.13
69
and consider new interconnects. By 2004, proprietary and open architectures are expected to be built around the PCI Express technology.
Revolutionary Applications
Where the previous section covered the applications that will take advantage of the benefits within PCI Express enabling a natural evolution,
this section reviews revolutionary applications that PCI Express enables.
Specifically, Isochrony, a unique implementation to guarantee glitchless
media, future modules that improve ease of use, and communications
applications with advanced switching.
70
check their e-mail and their favorite stock price. When they come back
to review the MPEG, they noticed several dropped frames and an excessive number of glitches in the movie clip. What happened?
This scenario is actually more common than most people think. Essentially, some data types require dedicated bandwidth and a mechanism
to guarantee delivery of time-critical data in a deterministic manner. The
video stream data needs to be updated on a regular basis between the
camera and the application to ensure the time dependencies are met to
prevent the loss of video frames. PCI Express-based isochrony solves this
problem by providing an interconnect that can deliver time-sensitive data
in a predetermined and deterministic method. In addition to providing
the interconnect solution, PCI Express provides a standardized software
register set and programming interface easing the software developer.
Full
Module
~40W
71
Basel
Module
~20W
72
cal to enable small and thin notebooks. To accommodate I/O connections such as the RJ45 for LAN, an extended module is under consideration where the I/O connector larger than 5 millimeters will reside outside
the desktop or mobile cavity. Figure 4.15 shows client PCI Express
modules for desktops and laptops.
Mobile and Desktop Client Modules Under Development
Single Wide
33.7m wide
60mm ong
5mm thick
Double Wide
68m wide
60mm ong
5mm thick
Extended
Double Wide
Single Wide
Figure 4.15
73
Some of the initial applications to embrace the module are flash media, microdrives, wireless LAN, and broadband modems (Cable and DSL
routers) because of ease of use. For example, consider the user who orders DSL from the local service provider. Currently, a technician is dispatched to install the unit and must disassemble the PC to install a
network connection to an external router via a PCI slot add-in card. The
module creates a compelling business scenario for service providers. The
service provider could simply ship the module to the user with instructions to connect the phone line to the module and then insert the module into the slot. The user does not need to disassemble or reboot the
system. When the card is installed, the operating system detects the
presence of a new device and loads the necessary drivers. Because of the
native hot-plug and hot-swap capability within PCI Express, desktop and
notebook systems will no longer be burdened with the card bus controller and the additional system cost issues with the current PC Card.
74
Chapter
PCI Express
Architecture
Overview
his chapter introduces the PCI Express architecture, starting off with
a system level view. This addresses the basics of a point-to-point architecture, the various types of devices and the methods for information
flow through those devices. Next, the chapter drops down one level to
further investigate the transaction types, mainly the types of information
that can be exchanged and the methods for doing so. Lastly, the chapter
drops down one level further to see how a PCI Express device actually
goes about building those transactions. PCI Express uses three transaction build layers, the Transaction Layer, the Data Link Layer and the
Physical Layer. These architectural build layers are touched upon in this
chapter with more details in Chapters 6 through 8.
76
PCI are no longer directly applicable to PCI Express. For example, devices no longer need to arbitrate for the right to be the bus driver prior to
sending out a transaction. A PCI Express device is always the driver for its
transmission pair(s) and is always the target for its receiver pair(s). Since
only one device ever resides at the other end of a PCI Express link, only
one device can drive each signal and only one device receives that signal.
In Figure 5.1, Device B always drives data out its differential transmission pair (traces 1 and 2) and always receives data on its differential receiver pair (traces 3 and 4). Device A follows the same rules, but its
transmitter and receive pairs are mirrored to Device B. Traces 3 and 4
connect to Device As transmitter pair (TX), while traces 1 and 2 connect
to its receiver pair (RX). This is a very important difference from parallel
busses such as PCI; the transmit pair of one device
be the receiver
pair for the other device. They must be point-to-point, one device to a
second device. TX of one is RX of the other and vice versa.
NVTU
RX
R+
R-
Trace #1
Trace #2
Figure 5.1
T+
T-
TX
Device B
Device A
TX
T+
T-
Trace #3
Trace #4
R+
R-
RX
77
Lane
P
O
R
T
Device B
P
O
R
T
Link
Figure 5.2
Please note that the signaling scheme for PCI Express is tremendously
simple. Each lane is just a unidirectional transmit pair and receive pair.
There are no separate address and data signals, no control signals like the
FRAME#, IRDY# or PME# signals used in PCI, not even a sideband clock
sent along with the data. Because of this modularity, the architecture can
more easily scale into the future, provide additional bandwidth and simplify the adoption of new usage models. However, it also requires the
adoption of techniques vastly different from traditional PCI.
Embedded Clocking
PCI Express utilizes 8-bit/10-bit encoding to embed the clock within the
data stream being transmitted. At initialization, the two devices determine the fastest signaling rate supported by both devices. The current
78
specification only identifies a single signaling rate, 2.5 gigabits per second (per lane per direction), so that negotiation is pretty simple. Since
the transfer rate is determined ahead of time, the only other function of
the clock would be for sampling purposes at the receiver. That is where
8-bit/10-bit encoding with an embedded clock comes into play. By
transmitting each byte of data as 10 encoded bits, you can increase the
number of transitions associated with each transmission character
simplifying the sampling procedures on the receiver side. Chapter 8,
Physical Layer Architecture contains more information on this topic.
Multiple Lanes
You might now be asking yourself, If the transfer rate is fixed ahead of
time at 2.5 gigabits per second per lane per direction, how can this interface scale to meet the needs of high-bandwidth interfaces? After all, 2.5
gigabits per second per direction is only 250 megabytes per second of actual data that can flow each way (recall that with 8-bit/10-bit encoding,
each byte of data is transferred as 10 bits, so you need to divide 2.5 gigabits per second by 10 to get theoretical data transfer rates). A data transfer rate of 250 megabytes per second per direction might be better than
traditional PCI, but it certainly is not in the same league as higher bandwidth interfaces, like AGP (AGP4x runs at 1 gigabyte per second and
AGP8x runs at 2 gigabytes per second total bandwidth). When you add to
this the fact that a parallel bus, like PCI or AGP, is substantially more efficient than a serial interface like PCI Express, the bandwidth of this new
interface seems to be at a disadvantage to some existing platform technologies. Well, that is where PCI Express scalability comes into play.
Much like lanes can be added to a highway to increase the total traffic
throughput, multiple lanes can be used within a PCI Express link to increase the available bandwidth. In order to make its capabilities clear, a
link is named for the number of lanes it has. For example, the link shown
in Figure 5.2 is called a x4 (read as: by four) link since it consists of a
four lanes. A link with only a single lane, as in Figure 5.1, is called a x1
link. As previously noted, the maximum bandwidth of a x1 is 250 megabytes per second in each direction. Because PCI Express is dual unidirectional, this offers a maximum theoretical bandwidth of 500 megabytes
per second between the two devices (250 megabytes per second in both
directions). The x4 link shown in Figure 5.2 has a maximum bandwidth
of 4 250 megabytes per second = 1 gigabyte per second in each direction. Going up to a x16 link provides 16 250 megabytes per second = 4
gigabytes per second in each direction. This means that PCI Express can
79
Device Types
The PCI Express specification identifies several types of PCI Express elements: a root complex, a PCI Express-PCI bridge, an endpoint and a
switch. These device elements emulate the PCI configuration model, but
apply it more closely to the variety of potential point-to-point PCI Express
topologies. Figure 5.3 demonstrates how these elements play together
within a PCI Express world.
CPU
PCI Express
Endpoint
PCI Express to
PCI Bridge
PCI Express
PCI Express
Memory
Root
Complex
PCI Express
PCI/PCI_X
PCI Express
PCI Express
Switch
PCI Express
Legacy
Endpoint
Figure 5.3
Legacy
Endpoint
PCI Express
PCI Express
Endpoint
PCI Express
Endpoint
The root complex is the head or root of the connection of the I/O
system to the CPU and memory. For example, in todays PC chipset system architecture, the (G)MCH (Graphics & Memory Controller Hub) or a combination of the (G)MCH and ICH (I/O
Controller Hub) could be considered the root complex. Each interface off of the root complex defines a separate hierarchy domain. Supporting transactions across hierarchies is not a required
80
Switches are used to fan out a PCI Express hierarchy. From a PCI
configuration standpoint, they are considered a collection of virtual PCI-to-PCI bridges whose sole purpose is to act as the traffic
director between multiple links. They are responsible for properly forwarding transactions to the appropriate link. Unlike a root
complex, they must always manage peer-to-peer transactions between two downstream devices (downstream meaning the side
further away from the root complex).
A PCI Express to PCI bridge has one PCI Express port and one or
multiple PCI/PCI-X bus interfaces. This type of element allows
PCI Express to coexist on a platform with existing PCI technologies. This device must fully support all PCI and/or PCI-X transactions on its PCI interface(s). It must also follow a variety of rules
(discussed in later chapters) for properly transforming those PCI
transactions into PCI Express transactions.
81
Even though PCI Express links are point-to-point, this does not always
mean that one of the devices on the link is the requester and the other
the completer. For example, say that the root complex in Figure 5.3
wants to communicate with a PCI Express endpoint that is downstream
of the switch. The root complex is the requester and the endpoint is the
completer. Even though the switch receives the transaction from the root
complex, it is not considered a completer of that transaction. Even
though the endpoint receives the transaction from the switch, it does not
consider the switch to be the requester of that transaction. The requester
identifies itself within the request packet it sends out, and this informs
the completer (and/or switch) where it should return the completion
packets (if needed).
Transaction Types
The PCI Express architecture defines four transaction types: memory,
I/O, configuration and message. This is similar to the traditional PCI
transactions, with the notable difference being the addition of a message
transaction type.
Memory Transactions
Transactions targeting the memory space transfer data to or from a memory-mapped location. There are several types of memory transactions:
Memory Read Request, Memory Read Completion, and Memory Write
Request. Memory transactions use one of two different address formats,
either 32-bit addressing (short address) or 64-bit addressing (long address).
I/O Transactions
Transactions targeting the I/O space transfer data to or from an I/Omapped location. PCI Express supports this address space for compatibility with existing devices that utilize this space. There are several types of
I/O transactions: I/O Read Request, I/O Read Completion, I/O Write Request, and I/O Write Completion. I/O transactions use only 32-bit addressing (short address format).
Configuration Transactions
Transactions targeting the configuration space are used for device configuration and setup. These transactions access the configuration registers of PCI Express devices. Compared to traditional PCI, PCI Express
82
allows for many more configuration registers. For each function of each
device, PCI Express defines a configuration register block four times the
size of PCI. There are several types of configuration transactions: Configuration Read Request, Configuration Read Completion, Configuration
Write Request, and Configuration Write Completion.
Message Transactions
PCI Express adds a new transaction type to communicate a variety of
miscellaneous messages between PCI Express devices. Referred to simply
as messages, these transactions are used for things like interrupt signaling, error signaling or power management. This address space is a new
addition for PCI Express and is necessary since these functions are no
longer available via sideband signals such as PME#, IERR#, and so on.
Build Layers
The specification defines three abstract layers that build a PCI Express
transaction, as shown in Figure 5.4. The first layer, logically enough, is referred to as the Transaction Layer. The main responsibility of this layer
is to begin the process of turning requests or completion data from the
device core into a PCI Express transaction. The Data Link Layer is the
second architectural build layer. The main responsibility of this layer is to
ensure that the transactions going back and forth across the link are received properly. The third architectural build layer is called the Physical
Layer. This layer is responsible for the actual transmitting and receiving
of the transaction across the PCI Express link.
83
Device
Core
Tx
Rx
Tx
Rx
Tx
Rx
Transaction Layer
Data Link Layer
Physical Layer
PCI Express
Link
Figure 5.4
Since each PCI Express link is dual unidirectional, each of these architectural layers has transmit as well as receive functions associated
with it. Outgoing PCI Express transactions may proceed from the transmit side of the Transaction Layer to the transmit side of the Data Link
Layer to the transmit side of the Physical Layer. Incoming transactions
may proceed from the receive side of the Physical Layer to the receive
side of the Data Link Layer and then on to the receive side of the Transaction Layer.
Packet Formation
In a traditional parallel interface like AGP, sideband signals (such as
C/BE[3:0]#, SBA[7:0] and so on) transmit the information for command
type, address location, length, and so on. As discussed previously, no
such sideband signals exist in PCI Express. Therefore, the packets that
are being sent back and forth must incorporate this sort of information.
The three architectural build layers accomplish this by building up
the packets into a full scale PCI Express transaction. This buildup is
shown in Figure 5.5.
84
Optional
Header
Data
ECRC
@ Transaction Layer
Sequence
Number
Header
Data
ECRC
LCRC
ECRC
LCRC
Frame
Sequence
Number
Header
Data
Frame
@ Physical Layer
Figure 5.5
Note
The PCI Express specification defines three types of CRCs: an ECRC, an LCRC,
and a CRC. Each of these CRC types provides a method for PCI Express devices to verify the contents of received packets. The transmitting device performs
a calculation on the bit values of the outgoing packet and appends the result of
that calculation to the packet. The appropriate receiver then performs the same
calculation on the incoming packet and compares the result to the attached
value. If one or more bit errors occurred during the transmission of that packet,
the two calculated values do not match and the receiver knows that the packet
85
is unreliable.
The differences between the three CRC types deal with the sizes (32 bits long
versus 16 bits long), and the PCI Express layer that is responsible for generating and checking the values. Additional details on ECRCs are contained in
Chapter 6, Transaction Layer Architecture and additional details on LCRCs
and CRCs are contained in Chapter 7, Data Link Layer Architecture.
CPU
GMCH
ICH
Device
Core
PCI Express
Device
Core
Link
Memory
Figure 5.6
LPC
The following example details how PCI Express may be used to help
boot up a standard computer. Once the system has powered up, the CPU
sends out a memory read request for the first BIOS instruction. This request comes to Device A across the processors system bus. Device As
core decodes this transaction and realizes that the requested address is
not its responsibility and this transaction needs to be forwarded out to
Device B. This is where PCI Express comes into play.
86
Device As core passes this memory read request to its PCI Express
block. This block is then responsible for turning the request into a legitimate PCI Express request transaction and sending it out across the
PCI Express link. On the other side of the link, Device Bs PCI Express
block is responsible for receiving and decoding the request transaction,
verifying its integrity, and passing it along to the Device B core.
Now the Device B core has just received a memory read request, so it
sends that request out its LPC (low pin count) bus to read that address
location from the systems flash/BIOS device. Once the Device B core receives the requested data back, it passes the data along to its PCI Express
block.
Device Bs PCI Express block is then responsible for turning this data
into a legitimate PCI Express completion transaction and sending it back
up the PCI Express link. On the other side of the link, Device As PCI Express block is responsible for receiving and decoding the transaction,
verifying its integrity, and passing it along to the Device A core. The Device A core now has the appropriate information and forwards it along to
the CPU. The computer is now ready to start executing instructions.
With this big picture in mind, the following sections start to examine
how each of the PCI Express architectural layers contributes to accomplishing this task.
Transaction Layer
As mentioned earlier, the Transaction Layer is the uppermost PCI Express
architectural layer and starts the process of turning request or data packets from the device core into PCI Express transactions. This layer receives request (such as read from BIOS location FFF0h) or completion
packet (here is the result of that read) from the device core. It is then
responsible for turning that request/data into a Transaction Layer Packet
(TLP). A TLP is simply a packet that is sent from the Transaction Layer of
one device to the Transaction Layer of the other device. The TLP uses a
header to identify the type of transaction that it is (for example, I/O versus memory, read versus write, request versus completion, and so on).
Please note that the Transaction Layer has direct interaction only
with its device core and its Data Link Layer, as shown in Figure 5.7. It relies on its device core to provide valid requests and completion data, and
on its Data Link to get that information to and from the Transaction Layer
on the other side of the link.
87
.
.
.
Device
Core
Tx
Tx
Figure 5.7
.
.
.
Rx
Transaction Layer
Rx
Transaction Layer
How might this layer behave in the previous Big Picture startup example? The Device A core issues a memory read request with associated
length and address to its PCI Express block. The Transaction Layers
transmit functions turn that information into a TLP by building a memory
read request header. Once the TLP is created, it is passed along to the
transmit side of the Data Link layer. Some time later, Device As Transaction Layer receives the completion packet for that request from the receive side of its Data Link Layer. The Transaction Layers receive side
then decodes the header associated with that packet and passes the data
along to its device core.
The Transaction Layer also has several other functions, such as flow
control and power management. Chapter 6, Transaction Layer Architecture contains additional details on the Transaction Layer and TLPs.
Chapter 9, Flow Control contains additional details on the flow control
mechanisms for those TLPs, and Chapter 11, Power Management contains additional details on the various power management functions.
88
across the link is wholesome. It is responsible for making sure that each
packet makes it across the link, and makes it across intact.
Figure 5.8
Tx
Rx
Tx
Rx
Tx
Rx
Transaction Layer
Data Link Layer
Physical Layer
This layer receives TLPs from the transmit side of the Transaction
Layer and continues the process of building that into a PCI Express transaction. It does this by adding a sequence number to the front of the
packet and an LCRC error checker to the end. The sequence number
serves the purpose of making sure that each packet makes it across the
link. For example, if the last sequence number that Device A successfully
received was #6, it expects the next packet to have a sequence number
of 7. If it instead sees #8, it knows that packet #7 got lost somewhere and
notifies Device B of the error. The LCRC serves to make sure that each
packet makes it across intact. As mentioned previously, if the LCRC does
not check out at the receiver side, the device knows that there was a bit
error sometime during the transmission of this packet. This scenario also
generates an error condition. Once the transmit side of the Data Link
Layer applies the sequence number and LCRC to the TLP, it submits them
to the Physical Layer.
The receiver side of the Data Link Layer accepts incoming packets
from the Physical Layer and checks the sequence number and LCRC to
make sure the packet is correct. If it is correct, it then passes it up to the
receiver side of the Transaction Layer. If an error occurs (either wrong
sequence number or bad data), it does not pass the packet on to the
Transaction Layer until the issue has been resolved. In this way, the Data
Link Layer acts a lot like the security guard of the link. It makes sure that
only the packets that are supposed to be there are allowed through.
The Data Link Layer is also responsible for several link management
functions. To do this, it generates and consumes Data Link Layer Packets
(DLLPs). Unlike TLPs, these packets are created at the Data Link Layer.
89
These packets are used for link management functions such as error notification, power management, and so on.
How might this layer behave in the previous Big Picture startup example? The Transaction Layer in Device A creates a memory read request
TLP and passes it along to the Data Link Layer. This layer adds the appropriate sequence number and generates an LCRC to append to the end of
the packet. Once these two functions are performed, the Data Link Layer
passes this new, larger packet along to the Physical Layer. Some time
later, Device As Data Link Layer receives the completion packet for that
request from the receive side of its Physical Layer. The Data Link Layer
then checks the sequence number and LCRC to make sure the received
read completion packet is correct.
What happens if the received packet at Device A was incorrect (assume the LCRC did not check out)? The Data Link Layer in Device A then
creates a DLLP that states that there was an error and that Device B
should resend the packet. Device As Data Link Layer passes that DLLP on
to its Physical Layer, which sends it over to Device B. The Data Link
Layer in Device B receives that DLLP from its Physical Layer and decodes
the packet. It sees that there was an error on the read completion packet
and resubmits that packet to its Physical Layer. Please note that the Data
Link Layer of Device B does this on its own; it does not send it on to its
Transaction Layer. The Transaction Layer of Device B is not responsible
for the retry attempt.
Eventually, Device A receives that resent packet and it proceeds from
the receive side of the Physical Layer to the receive side of the Data Link
Layer. If the sequence number and LCRC check out this time around, it
then passes that packet along to the Transaction Layer. The Transaction
Layer in Device A has no idea that a retry was needed for this packet; it is
totally dependent on its Data Link Layer to make sure the packet is correct.
Additional details on this layers functions, sequence numbers, LCRCs
and DLLPs are explained in Chapter 7, Data Link Layer Architecture.
Physical Layer
Finally, the lowest PCI Express architectural layer is the Physical Layer.
This layer is responsible for actually sending and receiving all the data to
be sent across the PCI Express link. The Physical Layer interacts with its
Data Link Layer and the physical PCI Express link (wires, cables, optical
fiber, and so on), as shown in Figure 5.9. This layer contains all the circuitry for the interface operation: input and output buffers, parallel-to-
90
Tx
Rx
Tx
Rx
PCI Express
Link
Figure 5.9
Physical Layer
How might this layer behave in the previous Big Picture startup example? Once power up occurs, the Physical Layers on both Device A and
Device B are responsible for initializing the link to get it up and running
and ready for transactions. This initialization process includes determining how many lanes should be used for the link. To make this example
simple, both devices support a x1 link. Sometime after the link is properly initialized, that memory read request starts to work its way through
Device A. Eventually it makes its way down to Device As Physical Layer,
complete with a sequence number, memory read request header, and
LCRC. The Physical Layer takes that packet of data and transforms it into
91
a serial data stream after it applies 8-bit/10-bit encoding and data scrambling to each character. The Physical Layer knows the link consists of a
single lane running at 2.5 gigahertz, so it sends that data stream out its
four transmit pairs at that speed. In doing this, it needs to meet certain
electrical and timing rules that are discussed in Chapter 8, Physical Layer
Architecture. The Physical Layer on Device B sees this data stream appear on its differential receiver input buffers and samples it accordingly.
It then decodes the stream, builds it back into a data packet and passes it
along to its Data Link Layer.
Please note that the Physical Layers of both devices completely insulate the rest of the layers and devices from the physical details for the
transmission of the data. How that data is transmitted across the link is
completely a function of the Physical Layer. In a traditional computer system, the two devices would be located on the same FR4 motherboard
planar and connected via copper traces. There is nothing in the PCI Express specification, however, that would require this sort of implementation. If designed properly, the two devices could implement their PCI
Express buffers as optical circuits that are connected via a 6-foot-long optical fiber cable. The rest of the layers would not know the difference.
This provides PCI Express an enormous amount of flexibility in the ways
it can be implemented. As speed or transmission media changes from system to system, those modifications can be localized to one architectural
layer.
Additional details on this layers functions, 8-bit/10-bit encoding, electrical requirements and timing requirements are explained in Chapter 8,
Physical Layer Architecture.
92
Chapter
Transaction
Layer
Architecture
his chapter goes into the details of the uppermost architectural layer:
the Transaction Layer. This layer creates and consumes the request
and completion packets that are the backbone of data transfer across PCI
Express. The chapter discusses the specifics for Transaction Layer Packet
(TLP) generation, how the header is used to identify the transaction, and
how the Transaction Layer handles incoming TLPs. Though TLP flow
control is a function of the Transaction Layer, that topic is discussed in
Chapter 9, Flow Control and is not discussed in this chapter.
94
transmit side, the Transaction Layer receives request data (such as read
from BIOS location FFF0h) or completion data (here is the result of that
read) from the device core, and then turns that information into an outgoing PCI Express transaction. On the receive side, the Transaction Layer
also accepts incoming PCI Express transactions from its Data Link Layer
(refer to Figure 6.1). This layer assumes all incoming information is correct, because it relies on its Data Link Layer to ensure that all incoming
information is error-free and properly ordered.
.
.
.
Device
Core
Tx
Tx
Figure 6.1
.
.
.
Rx
Transaction Layer
Rx
Transaction Layer
The Transaction Layer uses TLPs to communicate request and completion data with other PCI Express devices. TLPs may address several
address spaces and have a variety of purposes, for example: read versus
write, request versus completion, and so on. Each TLP has a header associated with it to identify the type of transaction. The Transaction Layer of
the originating device generates the TLP and the Transaction Layer of the
destination device consumes the TLP. The Transaction Layer also has
several other responsibilities, such as managing TLP flow control (discussed in Chapter 9, Flow Control) and controlling some aspects of
power management.
95
vices. A TLP consists of a header, an optional data payload, and an optional TLP digest. The Transaction Layer generates outgoing TLPs based
on the information it receives from its device core. The Transaction Layer
then passes the TLP on to its Data Link Layer for further processing. The
Transaction Layer also accepts incoming TLPs from its Data Link Layer.
The Transaction Layer decodes the header and digest (optional) information, and then passes along the appropriate information and data payload
(again optional) to its device core. A generic TLP is shown in Figure 6.2.
+0
+1
+2
+3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Header
Data Byte 0
Data Payload
(Optional)
Data Byte N-1
31
Figure 6.2
The TLP always begins with a header. The header is DWord aligned (always a multiple of four bytes) but varies in length based on the type of
transaction. Depending on the type of packet, TLPs may contain a data
payload. If present, the data payload is also DWord-aligned for both the
first and last DWord of data. DWord Byte Enable fields within the header
indicate whether garbage bytes are appended to either the beginning
or ending of the payload to achieve this DWord alignment. Finally, the
TLP may consist of a digest at the end of the packet.
Like the data payload, the digest is optional and is not always used. If
used, the digest field contains an ECRC (end-to-end CRC) that ensures the
contents of the TLP are properly conveyed from the source of the transaction to its ultimate destination. The Data Link Layer ensures that the
TLP makes it across a given link properly, but does not necessarily guarantee that the TLP makes it to its destination intact. For example, if the
TLP is routed through an intermediate device (such as a switch), it is possible that during the handling of the TLP, the switch introduces an error
96
within the TLP. An ECRC may be appended to the TLP to ensure that this
sort of error does not go undetected.
TLP Headers
All TLPs consist of a header that contains the basic identifying information for the transaction. The TLP header may be either 3 or 4 DWords in
length, depending on the type of transaction. This section covers the details of the TLP header fields, beginning with the first DWord (bytes 0
through 4) for all TLP headers. The format for this DWord is shown in
Figure 6.3.
+1
+0
+3
+2
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
R Fmt
Figure 6.3
Type
TC
T E
Attr
D P
Length
TLP fields marked with an R indicate a reserved bit or field. Reserved bits
are filled with 0s during TLP formation, and are ignored by receivers.
The format (Fmt) field indicates the format of the TLP itself. Table 6.1
shows the associated values for that field.
Table 6.1
TLP Format
00b
01
10
11
As can be seen in Table 6.1, the format field indicates the length of the
TLP header, but does not directly identify the type of transaction. This is
determined by the combination of the Format and Type fields, as shown
in Table 6.2.
Fmt [1:0]
Encoding
Type [4:0]
Encoding
TLP Type
Description
00
0 0000
MRd
01
00
MrdLk
01
10
97
0 0000
MWr
11
00
0 0010
IORd
10
0 0010
IOWr
00
0 0100
CfgRd0
10
0 0100
CfgWr0
00
0 0101
CfgRd1
10
0 0101
CfgWr1
01
1 0r2 r1 r0
Msg
11
1 0r2 r1 r0
MsgD
01
1 1n2 n1 n0
MsgAS
11
1 1c2 c1 c0
MsgASD
00
0 1010
Cpl
10
0 1010
CplD
00
0 1011
CplLk
10
0 1011
CplDLk
98
99
100
+0
Byte 0
Byte 4
+1
Address [64:32]
Byte 12
Address [31:2]
Figure 6.4
+0
Byte 4
+3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
T E
Length
R Fmt
Type
R
TC
R
D P Attr R
Last DW 1st DW
Requester ID
Tag
BE
BE
Byte 8
Byte 0
+2
+1
+2
+3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
T E
Length
R Fmt
Type
R
TC
R
D P Attr R
Last DW 1st DW
Requester ID
Tag
BE
BE
Byte 8
Address [31:2]
Figure 6.5
The address mapping for TLP headers is outlined in Table 6.3. All TLP
headers, not just memory requests, use this address scheme. Please note
that address bits [31:2] are not in the same location for 64-bit address
formats as they are for 32-bit addressing formats. If addressing a location
below 4 gigabytes, requesters must use the 32-bit address format.
Table 6.3
Address Bits
32 Bit Addressing
64 Bit Addressing
63:56
Not Applicable
55:48
Not Applicable
47:40
Not Applicable
39:32
Not Applicable
31:24
23:16
15:8
7:2
101
The Requester ID field (bytes 4 and 5 in Figure 6.5) contains the bus, device and function number of the requester. This is a 16-bit value that is
unique for every PCI Express function within a hierarchy. Bus and device
numbers within a root complex may be assigned in an implementation
specific manner, but all other PCI Express devices (or functions within a
multi-function device) must comprehend the bus and device number
they are assigned during configuration. PCI Express devices (other than
the root complex) cannot make assumptions about their bus or device
number. Each device receives a configuration write that identifies its assigned bus and device number. Since this information is necessary to
generate any request TLP, a device cannot initiate a request until it receives that initial configuration write containing its assigned bus and device numbers. This model is consistent with the existing PCI model for
system initialization and configuration. Figure 6.6 shows the requester ID
format.
Requester ID
Figure 6.6
7:0
4:0
2:0
Bus Number
Device
Number
Function
Number
Requester ID Format
The Tag field (byte 6 in Figure 6.5) is an 8-bit field that helps to uniquely
identify outstanding requests. The requester generates a unique tag value
for each of its outstanding requests that requires a completion. Requests
that do not require a completion do not have a tag assigned to them (the
tag field is undefined and may have any value). If a completion is required, the requester ID and tag value are copied into the completion
header. This allows the system to route that completion packet back up
to the original requester. The returned tag value identifies which request
the completion packet is responding to. These two values form a global
identification (referred to as a transaction ID) that uniquely identifies
102
Byte Enable
Header Location
st
Bit 0 of byte 7
Byte 0
st
Bit 1 of byte 7
Byte 1
st
Bit 2 of byte 7
Byte 2
st
1 DW BE[3]
Bit 3 of byte 7
Byte 3
Last DW BE[0]
Bit 4 of byte 7
Byte N-4
Last DW BE[1]
Bit 5 of byte 7
Byte N-3
Last DW BE[2]
Bit 6 of byte 7
Byte N-2
Last DW BE[3]
Bit 7 of byte 7
Byte N-1
1 DW BE[0]
1 DW BE[1]
1 DW BE[2]
If the request indicates a length greater than a single DWord, neither the
First DW BE field nor the Last DW BE field can be 0000b. Both must specify at least a single valid byte within their respective DWord. For example, if a device wanted to write six bytes to memory, it needs to send a
data payload of two DWords, but only six of the accompanying eight
bytes of data would be legitimately intended for that write. In order to
make sure the completer knows which bytes are to be written, the requester could indicate a First DW BE field of 1111b and a Last DW BE
field of 1100b. This indicates that the four bytes of the first DWord and
the first two bytes of the second (and last) DWord are the six bytes in-
103
tended to be written. The completer knows that the final two bytes of
the accompanying data payload are not to be written.
If the request indicates a data length of a single DWord, the Last DW
BE field must equal 0000b. If the request is for a single DWord, the First
DW BE field can also be 0000b. If a write request of a single DWord is
accompanied by a First DW BE field of 0000b, that request should have
no effect at the completer and is not considered a malformed (improperly built) packet. If a read request of a single DWord is accompanied by
a First DW BE field of 0000b, the corresponding completion for that request should have and indicate a data payload length of one DWord. The
contents of that data payload are unspecified, however, and may be any
value. A memory read request of one DWord with no bytes enabled is referred to as a zero length read. These reads may be used by devices as a
type of flush request, allowing a device to ensure that previously issued
posted writes have been completed.
+0
Byte 0
Byte 4
+2
+3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Length
TC
T E Attr
R Fmt
Type
R 000
R
D P 00 R 0 0 0 0 0 0 0 0 0 1
Last DW BE 1st DW
Requester ID
Tag
0 0 0 0
BE
Byte 8
Figure 6.7
+1
Address [31:2]
104
+0
Byte 0
Byte 4
Byte 8
Figure 6.8
+1
+2
+3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Length
TC
T E Attr
R Fmt
Type
R 000
R
D P 00 R 0 0 0 0 0 0 0 0 0 1
Last DW BE 1st DW
Requester ID
Tag
0 0 0 0
BE
Register
Device Function Reserved Ext. Reg.
Bus Number
R
Number Number
Number
Number
105
Message Headers
Recall that since PCI Express has no sideband signals (such as INTA#,
PME#, and so on), all special events must be transmitted as packets
(called messages) across the PCI Express link. There are two different
types of messages, those classified as baseline messages, and those
needed for advanced switching. Baseline messages are used for INTx interrupt signaling, power management, error signaling, locked transaction
support, slot power limit support, hot plug signaling, and for other vendor defined messaging. Advanced switching messages are used for data
packet messages or signal packet messages.
Baseline Messages
All baseline messages have the common DWord shown in Figure 6.3 as
the first DWord of the header. The second DWord for all baseline messages uses the transaction ID (requester ID + tag) in the same location as
memory, I/O and configuration requests. It then adds a Message Code
106
field to specify the type of message. Figure 6.9 shows the format for a
baseline message header.
+1
+0
Byte 0
Byte 4
+2
+3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
T E Attr
R Fmt
Type
R TC
Length
R
D P 00 R
Requester ID
Tag
Message Code
Byte 8
Byte 12
Figure 6.9
Most messages use the Msg encoding for the Type field. Exceptions
to this include the Slot Power Limit message, which uses the MsgD format, and vendor defined messages, which may use either the Msg or
MsgD encoding. Recall from Table 6.2 that the Msg encoding is 01b for
Format and 1 0r2 r1 r0 for Type, where r[2:0] indicates message routing.
The MsgD encoding is similar, but with 11b in the Fmt field indicating
that a data payload is attached. In addition to the address-based routing
used by memory and I/O requests, and the ID-based routing employed by
configuration requests, messages may use several other routing schemes.
The r[2:0] sub-field indicates the type of routing scheme that a particular
message employs. Table 6.5 outlines the various routing options.
Table 6.5
r[2:0]
Description
000
001
Routed by address
010
Routed by ID
011
100
101
110-111
107
*OUFSSVQU .FTTBHFT
Table 6.6
INTx Messages
Code[7:0]
Name/Description
0010 0000
Assert INTA
0010 0001
Assert INTB
0010 0010
Assert INTC
0010 0011
Assert INTD
0010 0100
De-assert INTA
0010 0101
De-assert INTB
0010 0110
De-assert INTC
0010 0111
De-assert INTD
108
use the default traffic class, TC0 (this is different than MSI interrupts,
which are not restricted to the default traffic class).
1PXFS .BOBHFNFOU .FTTBHFT These messages are used to support
power management operations. There are four distinct power management messages, as shown in Table 6.7.
Table 6.7
Code[7:0]
Routing r[2:0]
Name/Description
0001 0100
100
PM_Active_State_Nak
0001 1000
000
PM_PME
0001 1001
011
PME_Turn_Off
0001 1011
101
PME_TO_Ack
Table 6.8
Code[7:0]
Error Messages
Name/Description
0011 0000
0011 0001
0011 0011
Correctable errors are error conditions where the PCI Express protocol (and specifically hardware) can recover without any loss of information. An example of this type of error is an LCRC error that is detected by
the Data Link Layer and corrected through normal retry means. An uncorrectable error is one that impacts the functionality of the interface and
may be classified as either fatal or nonfatal. A fatal error is uncorrectable
and renders that particular link unreliable. A reset of the link may be required to return to normal, reliable operation. Platform handling of fatal
109
-PDLFE5SBOTBDUJPO.FTTBHFT
Table 6.9
Code[7:0]
Routing r[2:0]
Name/Description
0000 0000
011
Unlock
The Unlock message does not include a data payload and treats the
Length field as reserved. Unlock messages must use the default traffic
class, TC0. As evidenced by the r[2:0] value, the root complex initiates
and broadcasts this message.
4MPU 1PXFS -JNJU .FTTBHFT. PCI Express provides a mechanism for a
system to control the maximum amount of power provided to a PCI Express slot or module. The message identified here is used to provide a
mechanism for the upstream device (for example, root complex) to modify the power limits of its downstream devices. A card or module must
not consume more power than it was allocated by the Set Slot Power
Limit message. The format for this message is shown in Table 6.10.
Routing r[2:0]
Name/Description
010010000
100
110
The Set Slot Power Limit message contains a one DWord data payload
with the relevant power information. This data payload is a copy of the
slot capabilities register of the upstream device and is written into the
device capabilities register of the downstream device. Slot Power messages must use the default traffic class, TC0. As evidenced by the r[2:0]
value, this message is only intended to be sent from an upstream device
(root complex or switch) to its link mate.
Hot Plug Messages. The PCI Express architecture is defined to natively
support both hot plug and hot removal of devices. There are seven distinct Hot Plug messages. As shown in Table 6.11, these messages simulate the various states of the power indicator, attention button, and
attention indicator.
Table 6.11
Code[7:0]
Name/Description
0100 0101
Power Indicator On
0100 0111
0100 0100
0100 1000
0100 0001
Attention Indicator On
0100 0011
0100 0000
Hot plug messages do not contain a data payload and treat the Length
field as reserved. Hot plug messages must use the default traffic class,
TC0. Additional details on PCI Express hot plug support are found in
Chapter 12.
111
Completion Packet/Header
Some, but not all, of the requests outlined so far in this chapter may require a completion packet. Completion packets always contain a completion header and, depending on the type of completion, may contain a
number of DWords of data as well. Since completion packets are really
only differentiated based on the completion header, this section focuses
on that header format.
Completion headers are three DWords in length and have the common DWord shown in Figure 6.3 as the first DWord of the header. The
second DWord for completion headers make use of some unique fields: a
Completer ID, Completion Status, Byte Count Modified (BCM) and Byte
Count. The third and final DWord contains the requester ID and tag values, along with a Lower Address field. Figure 6.10 shows the format for a
completion header.
+1
+0
Byte 0
Byte 4
+2
+3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
T E
Length
R Fmt
Type
R TC
R
D P Attr R
Byte 8
Figure 6.10
Completer ID
Requester ID
Current B
C
Status M
Tag
Byte Count
R Lower Address
Completion Header
Completion packets are routed by ID, and more specifically, the requester ID that was supplied with the original request. The Completer ID
field (bytes 4 and 5) is a 16-bit value that is unique for every PCI Express
function within the hierarchy. It essentially follows the exact same format as the requester ID, except that it contains the component information for the completer instead of the requester. This format is shown in
Figure 6.11.
112
Completer ID
Figure 6.11
7:0
4:0
2:0
Bus Number
Device
Number
Function
Number
Completer ID
The Completion Status field (bits [7:5] of byte 6) indicates if the request
has been completed successfully. There are four defined completion
status responses, as shown in Table 6.12. The TLP Handling section later
in this chapter contains the details for when each of these completion
options is used.
Table 6.12
Completion Status
Completion Status[2:0]
Value
Status
000b
001b
010b
100b
All others
Reserved
113
114
field (though they are in other header fields). A value of 00 0000 0001b
in this location indicates a data payload that is one DWord long. A value
of 00 0000 0010b indicates a two DWord value, and so on up to a maximum of 1024 DWords. The data payload for a TLP must not exceed the
maximum allowable payload size, as defined in the devices control register (and more specifically, the Max_Payload_Size field of that register).
TLPs that use a data payload must have the value in the Length field
match the actual amount of data contained in the payload. Receivers
must check to verify this rule and, if violated, consider that TLP to be
malformed and report the appropriate error. Additionally, requests must
not specify an address and length combination that crosses a 4 kilobyte
boundary.
When a data payload is included in a TLP, the first byte of data corresponds to the lowest byte address (that is to say, closest to zero) and subsequent bytes of data are in increasing byte address sequence. For
example, a 16 byte write to location 100h would place the data in the
payload as shown in Figure 6.12.
+0
+1
+2
+3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Header
Data Byte 0
Address 100h
Data Byte 4
Address 104h
Data Byte 8
Address 108h
Data Byte 12
Address 10Ch
Figure 6.12
Data Byte 1
Address 101h
Data Byte 5
Address 105h
Data Byte 9
Address 109h
Data Byte 13
Address 10Dh
Data Byte 2
Address 102h
Data Byte 6
Address 106h
Data Byte 10
Address 10Ah
Data Byte 14
Address 10Eh
Data Byte 3
Address 103h
Data Byte 7
Address 107h
Data Byte 11
Address 10Bh
Data Byte 12
Address 10Fh
TLP Digest
The Data Link Layer provides the basic data reliability mechanism within
PCI Express via the use of a 32-bit LCRC. This LCRC code can detect errors in TLPs on a link-by-link basis and allows for a retransmit mechanism
115
for error recovery. This LCRC, however, is based upon the TLP the Data
Link is provided by its Transaction Layer. If an error is induced within the
TLP prior to being provided to the Data Link Layer (for example, by a
switch processing the TLP), the resultant LCRC has no ability to detect
that the TLP itself was in error.
To ensure end-to-end data integrity, the TLP may contain a digest that
has an end-to-end CRC. This optional field protects the contents of the
TLP through the entire system, and can be used in systems that require
high data reliability. The Transaction Layer of the source component
generates the 32-bit ECRC. The ECRC calculation begins with bit 0 of
byte 0 and proceeds from bit 0 to bit 7 of each subsequent byte in the
TLP. It incorporates the entire TLP header and, if present, the data payload. The exact details for the ECRC algorithm are contained in the PCI
Express Base Specification, Rev 1.0. Once calculated, that ECRC value is
placed in the digest field at the end of the TLP (refer to Figure 6.2). If the
ECRC is present and support is enabled, the destination device applies
the same ECRC calculation and compares the value to what is received in
the TLP digest.
The TD bit (bit 7 of byte 2 in the header) indicates whether a TLP digest is provided at the end of the TLP. A value of 1b in this location indicates that a TLP digest is attached, while a value of 0b indicates that no
TLP digest is present. Now what happens if, during the handling of the
TLP, a switch induces an error on the TD bit? It could accidentally switch
it from a 1 to a 0, which would negate the use of the ECRC and could
lead to other undetected errors. The PCI Express specification does not
really have a way to avoid this potential issue, other than to highlight that
it is the utmost of importance that switches maintain the integrity of the
TD bit.
The capability to generate and check ECRCs is reported to software
(by an Advanced Error Capabilities and Control register), which also controls whether the capability is enabled. If a device is enabled to generate
and/or check ECRCs, it must do so for all TLPs.
TLP Handling
This section details how the Transaction Layer handles incoming
TLPs, once they have been verified by the Data Link Layer. A TLP that
makes it through the Data Link Layer has been verified to have traversed
the link properly, but that does not necessarily mean that the TLP is correct. A TLP may make it across the link intact, but may have been im-
116
properly formed by its originator. As such, the receiver side of the Transaction Layer performs some checks on the TLP to make sure it has followed the rules described in this chapter. If the incoming TLP does not
check out properly, it is considered a malformed packet, is discarded
(without updating receiver flow control information) and generates an
error condition. If the TLP is legitimate, the Transaction Layer updates its
flow control tracking and continues to process the packet. This is seen in
the flowchart in Figure 6.13.
Start
No
Yes
Yes
Update Flow
Control tracking
Is value in Type
field defined?
No
Update Flow
Control tracking
End
No
Is TLP a request?
Completion TLP
Request TLP
Figure 6.13
Request Handling
If the TLP is a request packet, the Transaction Layer first checks to make
sure that the request type is supported. If it is not supported, it generates
a non-fatal error and notifies the root complex. If that unsupported request requires a completion, the Transaction Layer generates a comple-
117
No
Unsupported
Request
Is request type
supported?
Yes
Request Type =
Message?
No
Request Type =
Message?
Yes
Is message code
value defined?
Yes
No
Process Request
End
Unsupported
Request
Handle
as a message
Yes
Send Completion:
- Completion
Status = UR
No
End
Does
request require a
completion?
End
No
End
End
Yes
Send Completion
End
Figure 6.14
The shaded Process Request box indicates that there are optional implementation methods that may be employed by a PCI Express component. For example, if a component wanted to restrict the supported
characteristics of requests (for performance optimizations), it is permitted to issue a Completer Abort if it receives a request that violates its restricted model.
Another implementation-specific option may arise with configuration
requests. Some devices may require a lengthy self-initialization sequence
before they are able to properly handle configuration requests. Rather
than force all configuration requests to wait for the maximum allowable
118
Completion Handling
If a device receives a completion that does not correspond to any outstanding request, that completion is referred to as an unexpected completion. Receipt of an unexpected completion causes the completion to
be discarded and results in an error-condition (nonfatal). The receipt of
unsuccessful completion packets generates an error condition that is dependent on the completion status. The details for how successful completions are handled and impact flow control logic are contained in
Chapter 9, Flow Control.
Chapter
An error the breadth of a single hair can lead one a thousand miles astray.
Chinese Proverb
his chapter describes the details of the middle architectural layer, the
Data Link Layer. The Data Link Layers main responsibility is error
detection and correction. The chapter discusses the sequence number
and LCRC (Link CRC), and how they are added to the Transaction Layer
Packet (TLP) to ensure data integrity. It then describes the functions specific to the Data Link Layer, particularly the creation and consumption of
Data Link Layer Packets (DLLPs).
120
tion. The Data Link Layer adds a sequence number to the front of the
packet and an LCRC error checker to the tail. Once the transmit side of
the Data Link Layer has applied these to the TLP, the Data Link Layer
forwards it on to the Physical Layer. Like the Transaction Layer, the Data
Link Layer has unique duties for both outgoing packets and incoming
packets. For incoming TLPs, the Data Link Layer accepts the packets
from the Physical Layer and checks the sequence number and LCRC to
make sure the packet is correct. If it is correct, the Data Link Layer removes the sequence number and LCRC, then passes the packet up to the
receiver side of the Transaction Layer. If an error is detected (either
wrong sequence number or LCRC does not match), the Data Link Layer
does not pass the bad packet on to the Transaction Layer. Instead, the
Data Link Layer communicates with its link mate to try and resolve the issue through a retry attempt. The Data Link Layer only passes a TLP
through to the Transaction Layer if the packets sequence number and
LCRC values check out. It is important to note this because this gatekeeping allows the Transaction Layer to assume that everything it receives from the link is correct. As seen in Figure 7.1, the Data Link Layer
forwards outgoing transactions from the Transaction Layer to the Physical Layer, and incoming transactions from the Physical Layer to the
Transaction Layer.
Figure 7.1
Tx
Rx
Tx
Rx
Tx
Rx
Transaction Layer
Data Link Layer
Physical Layer
121
Sequence
Number
Header
LCRC
Sequence Number
The Data Link Layer assigns a 12-bit sequence number to each TLP as it is
passed from the transmit side of its Transaction Layer. The Data Link
Layer applies the sequence number, along with a 4-bit reserved field to
the front of the TLP. Refer to Figure 7.3 for the sequence number format.
To accomplish this, the transmit side of this layer needs to implement
two simple counters, one indicating what the next transmit sequence
number should be, and one indicating the most recently acknowledged
sequence number. When a sequence number is applied to an outgoing
TLP, the Data Link Layer refers to its next sequence counter for the appropriate value. Once that sequence number is applied, the Data Link
Layer increments its next sequence counter by one.
122
+0
+1
+3
+2
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Reserved
Figure 7.3
TLP Header
On the receiver side, the Data Link Layer checks the sequence number (and LCRC). If they check out properly, the TLP is passed on to the
Transaction Layer. If the sequence number (or LCRC) is incorrect, the
Data Link Layer requests a retry. To accomplish this, the receive side of
this layer needs to implement a counter for the next receiver sequence
number, which indicates the next expected sequence number. If the received sequence number matches that counter (and the LCRC checks),
the Data Link Layer then removes the sequence number, associated reserved bits, and the LCRC. Once the layer removes that data, it forwards
the incoming TLP on to the receive side of the Transaction Layer. When
this occurs, the Data Link Layer increments its next receiver sequence
counter.
If the sequence number does not match the value stored in the receivers next sequence counter, that Data Link Layer discards that TLP.
The Data Link Layer checks to see if the TLP is a duplicate. If it is, it
schedules an acknowledgement (Ack) DLLP to be sent out for that
packet. If the TLP is not a duplicate, it schedules a negative acknowledgement (Nak) DLLP to report a missing TLP. The Retries section of
this chapter explains this procedure in more detail.
The Data Link Layer does not differentiate among types of TLP when
assigning the sequence number. Transactions destined to I/O space do
not have a different set of sequence numbers than memory transactions.
Sequence numbers are not dependent on the completer of the transaction. The Data Link Layer of the transmitting device is the sole determinant of the sequence number assigned to a TLP.
The sequence number is completely link-dependent. If a TLP passes
through a PCI Express device (such as a switch), it has different sequence
numbers associated with it on a link-to-link basis. The TLP header contains all the global identifying information. The sequence numbers only
have meaning for a single transmitter and receiver. For example, if a PCI
Express switch receives a request TLP from its upstream link, it processes
that packet through its upstream receiver logic. That packet has a se-
123
quence number associated with it that the upstream Data Link Layer verifies. Once verified and acknowledged on the upstream side, that sequence number no longer means anything. After the request TLP is
passed through the Transaction Layer of the upstream port, it is sent
along to the appropriate downstream port. There, the TX side of the
downstream Data Link Layer appends its own sequence number as the
request TLP is sent out the downstream port. The downstream receiver
verifies and acknowledges this sequence number. If the TLP requires a
completion packet, the sequence numbers for the completion TLP is also
completely independent. The sequence number for the completion TLP
on the downstream link has no relationship to the request TLPs sequence number or the upstream links sequence number (once it is forwarded). Refer to Figure 7.4 for additional clarification: sequence
numbers A, B, C and D are completely independent of one another.
Root
Complex
Upstream Comp
Sequence Number D
Upstream Request
Sequence Number A
Downstream Request
Sequence Number B
Switch
Downstream Comp
Sequence Number C
PCI Express
Endpoint
In this example, a request is made from the root complex to the PCI Express
endpoint though a switch.
Figure 7.4
LCRC
The Data Link Layer protects the contents of the TLP by using a 32-bit
LCRC value. The Data Link Layer calculates the LCRC value based on the
TLP received from the Transaction Layer and the sequence number it has
124
just applied. The LCRC calculation utilizes each bit in the packet, including
the reserved bits (such as bits 7:4 of byte 0). The exact details for the LCRC
algorithm are contained in the PCI Express Base Specification, Rev 1.0.
On the receiver side, the first step that the Data Link Layer takes is to
check the LCRC value. It does this by applying the same LCRC algorithm
to the received TLP (not including the attached 32-bit LCRC). If a single
or multiple-bit error occurs during transmission, the calculated LCRC
value should not match the received LCRC value. If the calculated value
equals the received value, the Data Link Layer then proceeds to check
the sequence number. If the calculated LCRC value does not equal the
received value, the TLP is discarded and a Nak DLLP is scheduled for
transmission.
Like sequence numbers, the LCRC protects the contents of a TLP on
a link-by-link basis. If a TLP travels across several links (for example,
passes through a switch on its way to the root complex), an LCRC value
is generated and checked for each link. In this way, it is different than the
ECRC value that may be generated for a TLP. The ECRC serves to protect
the TLP contents from one end of the PCI Express topology to the other
end (refer to Chapter 6), while the LCRC only ensures TLP reliability for a
give link. The 32-bit LCRC value for TLPs is also differentiated from the
16-bit CRC value that is used for DLLP packets.
Retries
The transmitter cannot assume that a transaction has been properly received until it gets a proper acknowledgement back from the receiver. If
the receiver sends back a Nak (for something like a bad sequence number or LCRC), or fails to send back an Ack in an appropriate amount of
time, the transmitter needs to retry all unacknowledged TLPs. To accomplish this, the transmitter implements a Data Link Layer retry buffer.
All copies of transmitted TLPs must be stored in the Data Link Layer
retry buffer. Once the transmitter receives an appropriate acknowledgement back, it purges the appropriate TLPs from its retry buffer. It also
updates its acknowledged sequence number counter.
Note
A quick note on retry terminology: the PCI Express specification often flips back
and forth between the terms retry and replay. For example, the buffer that is
used during retry attempts is called a retry buffer, but the timeout counter associated with that buffer is called a replay timer. To avoid as much confusion as
possible, this chapter sticks to the term retry as much as possible and only uses
125
replay when referring to a function that uses that term expressly within the
specification.
TLPs may be retried for two reasons. First, it is retried if the receiver
sends back a Nak DLLP indicating some sort of transmission error. The
second reason for a retry deals with a replay timer, which helps ensure
that forward progress is being made. The transmitter side of the Data
Link Layer needs to implement a replay timer that counts the time since
the last Ack or Nak DLLP was received. This timer runs anytime there is
an outstanding TLP and is reset every time an Ack or Nak DLLP is received. When no TLPs are outstanding, the timer should reset and hold
so that it does not unnecessarily cause a time-out. The replay timer limit
depends upon the link width and maximum payload size. The larger the
maximum payload size and the narrower the link width, the longer the
replay timer can run before timing out (since each packet requires more
time to transmit). If the replay timer times out, the Data Link Layer reports an error condition.
If either of these events occurseither a Nak reception or a replay
timer expirationthe transmitters Data Link Layer begins a retry. The
Data Link Layer increments a replay number counter. This is a 2-bit
counter that keeps track of the number of times the retry buffer has been
retransmitted. If the replay counter rolls over from 11b to 00b (that is,
this is the fourth retry attempt) the Data Link Layer indicates an error
condition that requires the Physical Layer to retrain the link (refer to
Chapter 8, Physical Layer Architecture for details on retraining). The
Data Link Layer resets its replay counter every time it successfully receives an acknowledgement, so the retrain procedure only occurs if a retry attempt continuously fails. In other words, four unsuccessful attempts
at a single retry create this error. Four unsuccessful retry attempts across
numerous packets with numerous intermediate acknowledgements do
not.
If the replay counter does not roll over, then the Data Link Layer proceeds with a normal retry attempt. It blocks acceptance of any new outgoing TLPs from its Transaction Layer and completes the transmission of
any TLPs currently in transmission. The Data Link Layer then retransmits
all unacknowledged TLPs. It begins with the oldest unacknowledged TLP
and retransmits in the same order as the original transmission. Once all
unacknowledged TLPs have been retransmitted, the Data Link Layer resumes normal operation and once again accepts outgoing TLPs from its
Transaction Layer.
126
During the retry attempt, the Data Link Layer still needs to accept incoming TLPs and DLLPs. If the layer receives an Ack or Nak DLLP during
the retry attempt it must be properly processed. If this occurs, the
transmitter may fully complete the retry attempt or may skip the retransmission of any newly acknowledged TLPs. However, once the Data
Link Layer starts to retransmit a TLP it must complete the transmission of
that TLP. For example, imagine the transmitter has sequence numbers #58 sitting unacknowledged in its retry buffer and initiates a retry attempt.
The transmitter starts to retransmit all four TLPs, beginning with sequence number #5. If, during the retransmission of TLP #5, the transmitter receives an Ack associated with sequence number #7, it must
complete the retransmission of TLP #5. Depending on the implementation, the transmitter either continues with the retransmission of TLPs #6,
#7, and #8, or skips the newly acknowledged TLPs (that is, up through
#7) and continues retransmitting the remaining unacknowledged TLPs
in this example, #8.
If the transmitter receives multiple Acks during a retry, it can collapse them into only the most recent. If in the previous example the
transmitter had seen separate individual Acks for #5, #6, and then #7, it
could discard the individual Acks for #5 and #6 and only process the Ack
for #7. Acknowledging #7 implies that all previous outstanding sequence
numbers (#5 and #6) are also acknowledged. Likewise, if, during retry,
the transmitter receives a Nak followed by an Ack with a later sequence
number, the Ack supercedes the Nak and that Nak is ignored.
127
Encodings
DLLP Type
0000 0000
Ack
0001 0000
Nak
0010 0000
PM_Enter_L1
0010 0001
PM_Enter_L23
0010 0011
PM_Active_State_Request_L1
0010 0100
PM_Request_Ack
0011 0000
0100 0v2v1v0
0101 0v2v1v0
InitFC1-NP
0110 0v2v1v0
InitFC1-Cpl
1100 0v2v1v0
InitFC2-P
128
1101 0v2v1v0
InitFC2-NP
1110 0v2v1v0
InitFC2-Cpl
1000 0v2v1v0
UpdateFC-P
10010v2v1v0
UpdateFC-NP
1010 0v2v1v0
UpdateFC-Cpl
Reserved
+1
+0
+2
129
+3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Byte 4
Figure 7.5
Reserved
Ack/nak
Ack/Nak_Seq_Num
16bit CRC
130
131
+1
+0
Byte 0
+3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
VC ID
Hdr FC
R
R
P/N/P/Cpl 0
Data FC
[2:0]
Byte 4
Figure 7.6
+2
16bit CRC
+1
+0
+2
+3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Byte 4
Figure 7.7
00100 xxx
16bit CRC
Reserved
132
Processing a DLLP
The Physical Layer passes the received DLLP up to its Data Link Layer. If
the Physical Layer indicates a receiver error, it (and not the Data Link
Layer) reports the error condition. In this situation, the Data Link Layer
discards that DLLP. If the Physical Layer does not indicate a receiver error, the Data Link Layer calculates the CRC for the incoming DLLP. The
Data Link Layer then checks to see if the calculated value matches the
CRC attached to that DLLP. If the CRCs check out, the DLLP is processed.
In the event that the CRCs do not match, the DLLP is discarded and an
error is reported. This flow can be seen in Figure 7.8. Please note that
neither device expects to retry a DLLP. As such, DLLPs are not placed
into the retry buffer.
Start
Physical Layer
receiver error for
DLLP?
Yes
No
Calculate CRC of
DLLP, not including
attached CRC
Discard DLLP
End
Calculated CRC
equal to received
value?
No
Yes
Process DLLP
End
Figure 7.8
133
134
Reset
DL_Inactive
DL_Init
DL_Active
Figure 7.9
The DL_Inactive state is the initial state following a reset event. Upon
entry into this state, all Data Link Layer state information resets to default
values. Additionally, the Data Link Layer purges any entries in the retry
buffer. While in this state, the Data Link Layer reports DL_Down to the
Transaction Layer. This causes the Transaction Layer to discard any outstanding transactions and cease any attempts to transmit TLPs. Just as
well, because while in this state, the Data Link Layer does not accept any
TLPs from either the Transaction or the Physical Layer. The Data Link
Layer also does not generate or accept any DLLPs while in the Inactive
state. The state machine proceeds to the Init state if two conditions are
met: the Transaction Layer indicates the link is not disabled by software,
and the Physical Layer reports that the link is up (Physical LinkUp = 1).
135
The DL_Init state takes care of flow control initialization for the default virtual channel. While in this state, the Data Link Layer initializes the
default virtual channel according to the methods outlined in Chapter 9.
The DL status output changes during this state. It reports out DL_Down
while in FC_Init1 and switches over to DL_Up when in gets to FC_Init2.
The state machine proceeds to the Active state if FC initialization completes successfully and the Physical Layer continues to report that the
Physical Link is up. If the Physical Layer does not continue to indicate the
link is up (Physical LinkUp = 0), the state machine will return to the
DL_Inactive state.
The DL_Active state is the normal operating state. The Data Link
Layer accepts and processes incoming and outgoing TLPs, and generates
and accepts DLLPs as described in this chapter. While in this state, the
Data Link Layer reports DL_Up. If the Physical Layer does not continue to
indicate the link is up (Physical LinkUp = 0), the state machine returns to
the DL_Inactive state.
136
Chapter
Physical Layer
Architecture
138
have been reached; however, to satisfy curiosity it can be noted that optical wires are a likely solution.
The Physical Layer contains all the necessary digital and analog circuits required to configure and maintain the link. Additionally, the Physical Layer could contain a phase locked loop (PLL) to provide the
necessary clocking for the internal state machines. Given the understanding that PCI Express supports data rates greater than 2.5 gigabits per second, the data rate detect mechanisms have been predefined to minimize
the changes to support future generations of PCI Express. Additionally,
the Physical Layer of PCI Express is organized to provide isolation of the
circuits and logic that need to be modified and/or tuned in order to support next generation speeds. As illustrated in Figure 8.1, the architectural
forethought of layering and isolating the Physical Layer eases the transition for upgrading the technology by allowing maximum reuse of the
upper layers.
Software
Transaction Layer
Data Link Layer
Physical Layer
Mechanical
(Connectors, Wire)
Figure 8.1
There are two key sub-blocks that make up the Physical Layer architecture: a logical sub-block and an electrical sub-block. Both sub-blocks
have dedicated transmit and receive paths that allow dual unidirectional
communication (also referred to as dual simplex) between two PCI Express devices. These sub-blocks ensure that data gets to and from its destination quickly and in good order, as shown in Figure 8.2.
139
Logical
Logical
TX+
TX-
Electrical
RX+
RX-
Figure 8.2
CAP
CAP
Physical
Physical
RX+
RXCAP
CAP
Electrical
TX+
TX-
Logical Sub-Block
The logical sub-block is the key decision maker for the Physical Layer. As
mentioned above, the logical sub-block has separate transmit and receive
paths, referred to hereafter as the transmit unit and receive unit. Both
units are capable of operating independently of one another.
The primary function of the transmit unit is to prepare data link
packets received from the Data Link Layer for transmission. This process
involves three primary stages: data scrambling, 8-bit/10-bit encoding, and
packet framing. The receive unit functions similarly to the transmit unit,
but in reverse. The receive unit takes the deserialized physical packet
taken off the wire by the electrical sub-block, removes the framing, decodes it, and finally descrambles it. Figure 8.3 gives a description of each
of these stages along with a description of the benefits received by each
stage.
140
Receive Unit
Packet
Scrambled
Packet
De-scrambled
8-Bit / 10-Bit
Encoded
8-Bit / 10-Bit
Decoded
Packet
Framed
Framing
Removed
Transmit Unit
Receive Unit
Electrical Sub-Block
Figure 8.3
Data Scrambling
PCI Express employs a technique called data scrambling to reduce the
possibility of electrical resonances on the link. Electrical resonances can
cause unwanted effects such as data corruption and in some cases circuit
damage, due to electrical overstresses caused by large concentrations of
voltage. Since electrical resonances are somewhat difficult to predict, the
simplest solution is usually to prevent conditions that can cause electrical
resonances. Most electrical resonance conditions are caused by repeated
data patterns at the systems preferred frequency. The preferred frequency of a system depends on many factors, which are beyond the
scope of this book. However, it would be well to note that very few systems have the same preferred frequency. To avoid repeated data patterns
the PCI Express specification defines a scrambling/descrambling algorithm that is implemented using a linear feedback shift register. PCI Express accomplishes scrambling or descrambling by performing a serial XOR
operation to the data with the seed output of a Linear Feedback Shift Register
(LFSR) that is synchronized between PCI Express devices. Scrambling is enabled by default; however, it can be disabled for diagnostic purposes.
141
8-Bit/10-Bit Encoding
The primary purpose of 8-bit/10-bit encoding is to embed a clock signal
into the data stream. By embedding a clock into the data, this encoding
scheme renders external clock signals unnecessary. An investigation of
parallel multi-drop bus technologies, like conventional PCI, has shown
that as clock frequencies increase, the length matching requirements become increasingly more stringent. The dependency of a group of signals
to a single clock source acts to severely reduce setup and hold margins in
a particular data transaction. Take for example two data lines named Data
Line 1 and Data Line 2 that are both referenced to a high-speed clock signal called Data Clock. At a transmitting source both Data Line 1 and Data
Line 2 have signals placed on the bus at the same instance in reference to
Data Clock. However, due to a slight mismatch in interconnect length
between Data Line 1, Data Line 2, and Data Clock, they all reach the receiving device at slightly different times. Since the receiving device samples the data based on the reception of the Data Clock, the overall
margin may be reduced significantly or the wrong data may be clocked
into the receiving device if the mismatch is bad enough. As bus frequencies increase, the amount of allowable mismatch decreases or essentially
becomes zero. For an illustration of this concept see Figure 8.4.
As Seen At The
Transmitter
Setup
As Seen At The
Receiver
Hold
Setup
Data Line 1
Data Line 1
Data Line 2
Data Line 2
Data Clock
Data Clock
Figure 8.4
Hold
Because PCI Express embeds a clock into the data, setup and hold
times are not compromised due to length mismatch between individual
PCI Express lanes within a link.
The concept of 8-bit/10-bit encoding is not something new that is
unique to PCI Express. This data encoding concept was actually patented
by IBM and used in Fibre Channel to increase data transfer lengths and
rates. Since then it has also been adopted and used in Serial ATA and Gigabit Ethernet because of the benefits that can be had through its adop-
142
Special
Character
Code
Special
Character
Name
Special
Character
Value
Bits
HGFEDCBA
Description
K28.0
Skip
1C
000 11100
K28.1
K28.2
Fast
Training
Sequence
3C
Start DLLP
5C
001 11100
010 11100
K28.3
Idle
7C
011 11100
K28.4
K28.5
Comma
9C
100 11100
Reserved
BC
101 11100
K28.6
DC
110 11100
Reserved
K28.7
FC
111 11100
Reserved
F7
111 10111
K23.7
Pad
Start TLP
FB
111 11011
143
Layer Packet
K29.7
End
FD
111 11101
K30.7
Note:
End Bad
FE
111 11110
The process of 8-bit/10-bit encoding adds twenty-five percent more overhead to the system through the addition of two extra bits. However,
many side benefits make this additional overhead tolerable. Each of these
benefits, as they relate to PCI Express, is described briefly here.
Benefit 1: Embedded Clocking. The process of 8-bit/10-bit encoding
actually embeds a clock signal into the data stream. This is accomplished
by causing multiple bit-level transitions within a particular character. Bitlevel clock synchronization is achieved at the receive side with every bitlevel transition. From this perspective it is desirable to have as many bit
transitions as possible to ensure the best possible synchronization between devices. To illustrate this concept, consider the data byte value
0x00. It is conceivable that without 8-bit/10-bit encoding, eight bits
could be sent without giving the receiving device a chance to synchronize. Consequently, this could result in data corruption if the receiving
device sampled a bit at the wrong time.
By definition, 8-bit/10-bit encoding allows at most five bits of the
same polarity to be transmitted before a bit-level transition occurs. Recall
that a byte value represented by bits HGFEDCBA is broken into two separate bit streams, mainly a 3-bit stream HGF and a 5-bit stream EDCBA.
Each of these bit streams has a control variable appended to it to form a
4-bit stream and a 6-bit respectively. Concatenating the 4-bit stream and
the 6-bit stream together forms forms a 10-bit symbol. As a result, bits
HGF become JHGF, where J is a control variable. Bits EDCBA become
IEDCBA where I is a control variable. Figure 8.5 shows 8-bit/10-bit encoded byte value 0x00.
Byte Value
00H
8-Bit/10-Bit
Encoded
Value
Figure 8.5
1 1
0 0 0
144
Left side shows a cut-out of a routing example in which the traces are
snaked to length-match them to the clock in order to guarantee data is
sampled with the clock.
Right side shows a PCI Express routing solution, which does not require
length matching to a clock signal thereby freeing up board space and
simplifying the routing.
Figure 8.6
Benefit 2: Error Detection. A secondary benefit of 8-bit/10-bit encoding is to provide a mechanism for error detection through the concept of
running disparity. Running disparity is essentially trying to keep the difference between the number of transmitted 1s and 0s as close to zero as
145
possible. This allows the receiving device to determine the health of the
transmitted character by registering the effect of the received character
had on disparity.
Benefit 3: DC Balance. DC balancing is accomplished through running
disparity. It is called out separately here to discuss the benefits received
from maintaining the balance of 1s and 0s from an electrical perspective
instead of an error-checking mechanism. Maintaining a proportionate
number of 1s and 0s allows an individual data line to have an average DC
voltage of approximately half of the logical threshold. This reduces the
possibility of having inter-symbol interference, which is the inability to
switch from one logic level to the next because of system capacitive
charging. Inter-symbol interference is discussed in more detail during the
electrical sub-block section.
Packet Framing
In order to let the receiving device know where one packet starts and
ends, there are identifying 10-bit special symbols that are added and appended to a previously 8-bit/10-bit encoded data packet. The particular
special symbols that are added to the data packet are dependent upon
where the packet originated. In the case where the packet originated
from the Transaction Layer the special symbol Start TLP (encoding
K27.7) would be added to the front of the data packet. In the case the
packet originated from the Data Link layer the special symbol Start DLLP
(encoding K28.2) would be added to the beginning of the data packet.
To end either a TLP or DLLP the special symbol END (encoding K29.7) is
appended, as shown in Figure 8.6.
K27.7
Scrambled/Encoded Packet
K29.7
Scrambled/Encoded Packet
Data Link Layer Originated Packet
Figure 8.7
K29.7
146
Electrical Sub-Block
As the logical sub-block of the Physical Layer fulfils the role as the key
decision maker, the electrical sub-block functions as the delivery mechanism for the physical architecture. The electrical sub-block contains
transmit and receive buffers that transform the data into electrical signals
that can be transmitted across the link. The electrical sub-block may also
contain the PLL circuitry, which provides internal clocks for the device.
The following paragraphs describe exactly how the signaling of PCI Express works and why, and what a PLL (Phase Locked Loop) actually does.
The concepts of AC coupling and de-emphasis are also discussed briefly.
Serial/Parallel Conversion
The transmit buffer in the electrical sub-block takes the encoded/packetized data from the logical sub-block and converts it into serial format. Once the data has been serialized it is then routed to an
associated lane for transmission across the link. On the receive side the
receivers deserialize the data and feed it back to the logical sub-block for
further processing.
Clock Extraction
In addition to the parallel-to-serial conversion described above, the receive buffer in the electrical sub-block is responsible for recovering the
link clock that has been embedded in the data. With every incoming bit
transition, the receive side PLL circuits are resynchronized to maintain bit
and symbol (10 bits) lock.
Lane-to-Lane De-Skew
The receive buffer in the electrical sub-block de-skews data from the
various lanes of the link prior to assembling the serial data into a parallel
data packet. This is necessary to compensate for the allowable 20 nanoseconds of lane-to-lane skew. Depending on the flight time characteristics
of a given transmission medium this could correlate to nearly 7 inches of
variance from lane to lane. The actual amount of skew the receive buffer
must compensate for is discovered during the training process for the
link.
Differential Signaling
PCI Express signaling differs considerably from the signaling technology
used in conventional PCI. Conventional PCI uses a parallel multi-drop
147
bus, which sends a signal across the wire at given amplitude referenced
to the system ground. In order for that signal to be received properly, it
must reach its destination at a given time in reference to some external
clock line. In addition to this the signal must arrive at the destination
with a given amplitude in order to register at the receiver. For relatively
slow signals this type of signaling has worked quite well. However, as
signals are transmitted at very high frequencies over distances of 12
inches or more, the low pass filter effects of the common four-layer FR4
PC platform cause the electrical signals to become highly attenuated. In
many cases the attenuation is so great that a parallel multi-bus receiver
cannot detect the signal as valid. Electrically there are two options to
overcome this signal attenuation. One option is to shorten the length of
the transmission path in order to reduce signal attenuation. In some cases
this is possible. However, in most cases it makes design extremely difficult, if not impossible. The other option is to use a different type of signaling technique that can help overcome the effects of attenuation.
PCI Express transmit and receive buffers are designed to convert the
logical data symbols into a differential signal. Differential signaling, as its
name might give away, is based on a relative difference between two different signals referred to as a differential pair. A differential pair is usually
signified by a positively notated signal and a negatively notated signal.
Logical bits are represented by the relative swing of the differential pair.
To illustrate how logical bits are represented electrically on a differential
pair, take the following example, as illustrated in Figure 8.7. A differential
pair has a given voltage swing around 1 volt, which means the positively
notated signal swings to +1 volt when representing a logical 1 and to a 1
volt when representing a logical 0. The negatively notated signal likewise
swings to 1 volt when representing a logical 1 and to a +1 volt when
representing a logical 0. The peak-to-peak difference between the differential pair is 2 volts in the case either logical bit is represented. The logical bit is determined by the direction in which the signals swing.
Parallel MultiDrop Signaling
Differential
Signaling
Figure 8.8
DD+
Signaling Comparison
OV Reference
OV Reference or
some common
mode DC voltage
148
149
AC Coupling
PCI Express uses AC coupling on the transmit side of the differential pair
to eliminate the DC Common Mode element. By removing the DC Common Mode element, the buffer design process for PCI Express becomes
much simpler. Each PCI Express device can have a unique DC Common
Mode voltage element, which is used during the detection process. The
link AC coupling removes the common mode element from view of the
receiving device. The range of AC capacitance that is permissible by the
PCI Express specification is 75 to 200 nanofarads.
De-Emphasis
PCI Express utilizes a concept referred to as de-emphasis to reduce the
effects of inter-symbol interference. In order to best explain how deemphasis works it is important to understand what inter-symbol interference is. As frequencies increase, bit times decrease. As bit times decrease
the capacitive effects of the platform become much more apparent. Intersymbol interference comes into play when bits change rapidly on a bus
after being held constant for some time prior. Take under consideration a
differential bus that transmits five logical 1s in a row. This is the maximum number of same-bit transmissions allowable under 8-bit/10-bit encoding. Suppose that following the five logical 1s was a logical 0
followed by another logical 1. The transmission of the first five logical 1s
charges the system capacitance formed by the layering process of the
PCB stackup (Plate Capacitor). When the system follows the five logical
1s with a logical 0 and then another logical 1 the system cannot discharge quick enough to register the logical 0 before the next logical 1.
The effect is inter-symbol interference, as shown in Figure 8.8.
Figure 8.9
Inter-Symbol Interference
150
In order to minimize the effect of inter-symbol interference, subsequent bits of the same polarity that are output in succession are deemphasized. When discussing PCI Express this translates into a 3 decibels reduction in power of each subsequent same polarity bits as shown
in Figure 8.9. This does not mean that each and every bit continues to be
reduced in power; it only refers to the first subsequent bit. Further bits
would continue at the same de-emphasized strength. It may not be immediately apparent just how this technique actually helps to reduce the
effects of inter-symbol interference, but this will soon become clear.
As seen at Transmit side
De-emphasis Example
Instead of thinking of each subsequent same polarity bit as being reduced by 3 decibels, think of the first initial bit as being preemphasized. As discussed above, the difficulty with inter-symbol interference is the inability to overcome the capacitive discharging of the system
quick enough to reach the electrical thresholds to register a logical event.
By pre-emphasizing the first bit transition, or as the PCI Express specification defines it, by de-emphasizing subsequent same polarity bit transmissions, the first initial bit transition is given extra drive strength.
Through this mechanism, bit transitions occurring after multiple same
polarity transmissions are given enough strength to overcompensate for
the capacitive effects of the system. See Table 8.2 for the DC characteristics of de-emphasis.
Table 8.2
800 mV minimum
505 mV minimum
151
Polling
Detect
(Initial State)
Configuration
Figure 8.11
Electrical Idle
Before describing the link configuration states it seems appropriate to define electrical idle since it will be referred to throughout the remainder of
this chapter. Upon initial power-up the device enters the electrical idle
state, which is a steady state condition where the Transmit and Receive voltages are held constant. The PCI Express specification defines constant as
meaning that the differential pair lines have no more than 20 millivolts of difference between the pair after factoring out any DC common element. The
minimum time that a transmitter must remain in electrical idle is 20 nanoseconds, however, the transmitter must attempt to detect a receiving device
within 100 milliseconds. Electrical idle is primarily used in power saving
mode and common mode voltage initialization.
152
Detect State
The first Physical Layer state that the PCI Express link enters into is the
detect state upon power-up. The detect state is also entered into upon a
link reset condition, a surprise removal of a device, or an exit from the
link disabled state. The detect state determines whether or not there is a
device connected on the other side of the link. The detection process
takes place in the progression through three sub-states called quiet, active, and charge.
Quiet Sub-State. During the quiet sub-state, four primary tasks are completed. The first task, completed by the electrical sub-block, is that the
transmitter in the downstream port (upstream device) begins driving its DC
Common Mode voltage while remaining in high impedance. The relationship between an upstream and downstream port are shown in Figure 8.12.
Downstream Port
(Upstream Component)
Upstream Port
(Downstream Component)
The PCI Express specification defines the upstream and downstream port
relationship as following: All ports on a root complex are downstream ports.
The downstream device on a link is the device farther from the root complex.
The port on a switch that is closest topologically to the root complex is the
upstream port. The port on an endpoint device or bridge component is an
upstream port. The upstream component on a link is the component closer to
the root complex.
Figure 8.12
153
The downstream device next selects the data rate, which is always 2.5
gigahertz during link training even when PCI Express speeds go beyond
the immediate generation. Finally, the downstream device clears the
status of the linkup indicator to inform the system that a link connection
is not currently established. A register in the data link layer monitors the
linkup status. The system only remains in the quiet sub-state for 12 milliseconds before attempting to proceed to the next sub-state.
Active Sub-State. Primary detection is completed during the active substate. Detection is done by analyzing the effect that the upstream port
(downstream device) receiver loading has on the operating DC Common
Mode voltage output from the transmitter. If there is no upstream port
connected, the rate of change of the applied DC Common Mode voltage
is much faster than if a terminated upstream port receiver were setting
out on the link. The detection process is done on a per-lane basis. The
downstream device holds the transmitter in high impedance to disable
any lanes on the downstream port, which are not connected. During the
detection process the downstream port transmitter is always in high impedance, even when sending the operating DC common mode voltage
for detection. If an upstream device is detected the next state is the polling state. If no upstream device can be detected the sub-state machine returns to the quiet state and wait for 12 milliseconds before checking
again for an upstream device.
Charge Sub-State. The final sub-state of the detect state is the charge
state. During this state the electrical sub-block of the downstream port
continues to drive the DC Common Mode voltage while remaining in a
high impedance electrical idle state. A timer is also set to count off 12
milliseconds. As soon as the DC Common Mode voltage is stable and
within specification or the 12-millisecond timer has timed out the state
machine transitions to the polling state. Figure 8.13 illustrates the detect
sub-state machine.
154
Entry
12ms timeout
Detect.Quiet
Receiver Detected
Detect.Active
Detect.Charge
No Detect
12ms charge
Exit to Polling
Figure 8.13
Polling State
The polling state is the first state where training instructions called training ordered sets are sent out on all the individual PCI Express lanes. PCI
Express currently defines two training ordered sets called TS1 and TS2.
There are not many differences between the two sets except for the indicator used to distinguish which training ordered set it actually is. TS1
ordered sets are used during the configuration process. Once all of the
lanes of the link are trained, TS2 ordered sets are used to mark a successful training. During the polling state TS1 ordered sets are used to establish bit and symbol lock, to determine whether a single or multiple links
should be formed, and to select the data rate for the link.
Training ordered sets are nothing more than a group of 16 8-bit/10-bit
encoded special symbols and data. Training ordered sets are never scrambled. These training instructions are used to establish the link data rate,
establish clock synchronization down to the bit level, and check lane polarity. Table 8.3 shows the training ordered set that is sent out during the
polling state.
155
Encoded Values
Description
K28.5
0 - 255
0 - 31
0 - 255
D0.0 D31.7
D2.0
Bit 0 = 0, 1
Bit 1 = 0, 1
Bit 2 = 0, 1
Bit 3 = 0, 1
Bit 4:7 = 0,
6-15
Note:
D10.2
TS1 Identifier
The TS2 Training Ordered Set is exactly the same as the TS1 Training Ordered Set with one
exception. In the place of symbols 6-15 is the TS2 encoded value D5.2.
Similar in concept to the detect state, the polling state has five defined
sub-states that are used in the link training process. The polling sub-states
are referred to as quiet, active configuration, speed, and compliance. A
short description of the transition in and out of these sub-states follows.
Quiet Sub-State. The first polling sub-state that is entered into upon the
completion of the detect state is the quiet sub-state. Upon entry into this
sub-state the 12-millisecond countdown timer is set. During the quiet
sub-state each downstream port receiver is looking for a training ordered
set or its complement. The polling state responds to either a TS1 or TS2
ordered training set by progressing to the next sub-state. As mentioned
above, the receiver also responds to the complement of either training
156
157
Logical Inversion
PCI
Express
Device
0101.... DD+
1010....
1010....
DD+
PCI
Express
Device
The ability to perform a logical inversion on incoming signals due to polarity inversion of the
differential pair gives the designer extra freedom in design in cases where it would otherwise be
necessary to bow-tie the signals.
Figure 8.14
Logical Inversion
Polling.Quiet
Polling.Active
Polling.Compliance
No No TSx set
received
2ms timeout
TSx received
Polling.Speed
Polling.Configuration
TSx received
Figure 8.15
Exit to
Configuration
158
Configuration State
The configuration state establishes link width and lane ordering. Prior to
this state, bit and symbol lock should have been established, link data
rate determined, and polarity corrections made on incoming data if necessary. Within the configuration state there are two sub-states, rcvrcfg
and idle.
Rcvrcfg Sub-State. During the rcvrcfg sub-state, link width and lane ordering are established. For links greater than a x1 configuration the receiver must compensate for the allowable 20 nanoseconds of lane-to-lane
skew that the PCI Specification Revision 1.0 allows between the lanes
that form the link. Transmitted TS1 training ordered sets to the upstream
port contain link numbers assigned by the downstream port, as shown in
Figure 8.16. If the downstream port is capable of forming two individual
links it sends out two separate link numbers, N and N+1. The upstream
port responds to the downstream port by sending the desired link number in symbol 2 of the training ordered set, as shown in Table 8.3. The
upstream port establishes a link number by sending a TS1 training ordered set to the downstream port with the preferred link number instead
of the special character K23.7 (PAD). If the link number is not established within 2 milliseconds, the state machine changes again to the polling state. Until the link number is established or the 2-millisecond timer
times out, the downstream port continues to broadcast its link numbering preference.
159
Sym
5
Lane Number
=K23.7
Data Rate
= D2.0
Sym
3
Sym
3
Data Rate
= D2.0
Lane Number
=K23.7
Sym
5
Sym 6-15
TS identifier
Link Number
=N
Lane 0
Link Number
=N
Sym
0
Lane Number
=K23.7
Sym
3
Data Rate
= D2.0
Sym
5
Sym 6-15
TS identifier
OM15568
Sym 6-15
TS identifier
Sym
5
Data Rate
= D2.0
Sym
3
Lane Number
=K23.7
Link Number
=N
Sym
0
Lane 1
Sym
0
Lane 1
Sym
0
Sym 6-15
TS identifier
Lane 0
Figure 8.16
Link Training
160
Config.RcvCfg
Link Configured
Config.Idle
ut
Link
Error
eo
im
t
ms
Exit to
Configuration
Figure 8.17
8 idle
symbols
Exit to
Configuration
Surprise Insertion/Removal
PCI Express physical architecture is designed with ease of use in mind.
To support this concept PCI Express has built in the ability to handle
surprise insertion and removal of PCI Express devices. All transmitters
and receivers must support surprise hot insertion/removal without damage to the device. The transmitter and receiver must also be capable of
withstanding sustained short circuit to ground of the differential inputs/outputs D+ and D-.
A PCI Express device can assume the form of an add-in card, module,
or a soldered-down device on a PC platform. In the case of an add-in card
or cartridge, PCI Express allows a user to insert or remove a device (an
161
upstream port) while the system is powered. This does not mean that
there is nothing more required to support surprise insertion. The key objective is to identify that a mechanism exists to check for the presence of
a device on a link.
Surprise Insertion
A broken link that is missing an upstream port causes the downstream
device to remain in the detect state. Every 12 milliseconds the downstream port checks the link to see whether or not any upstream ports
have been connected. As soon as a user inserts a device into the system it
is detected and the link training process as previously described begins.
Surprise Removal
If a PCI Express device is removed from the system during normal operation, the downstream port receivers detect an electrical idle condition (a
loss of activity). Because the electrical idle condition was not preceded
by the electrical idle ordered set, the link changes to the detect state.
162
while the receive path to the downstream port could remain in the fully
functional L0 state. Because the link will likely transition into and out of
this state often, the latencies associated with coming in and out of this
state must be relatively small (a maximum of several microseconds). During this state the transmitter continues to drive the DC common mode
voltage and all devices on chip clocks continue to run (PLL clocks and so
on).
Recovery
L0 Normal
Operation
L1
Figure 8.18
L0s
163
1
= 0.4 ps 0.4 ps 20 = 8 ps
2.5 exp 9
To exit the L0s state the transmitter must begin sending out Fast
Training Sequences to the receiver. A Fast Training Sequence is an ordered set composed of one K28.5 (COM) special character and three
K28.1 special characters. The fast training sequences are used to resynchronize the bit and symbol times of the link in question. The exit latency from this state depends upon the amount of time it takes the
receiving device to acquire bit and symbol synchronization. If the receiver is unable to obtain bit and symbol lock from the Fast Training Sequence the link must enter a recovery state where the link can be
reconfigured if necessary.
164
Chapter
Flow Control
his chapter goes into the details of the various flow control mechanisms within PCI Express. It begins with a description of the ordering
requirements for the various transaction types. The rest of the chapter
then deals with some of the newer flow control policies that PCI Express
uses: virtual channels, traffic classes, as well as flow control credits. Following that, the chapter briefly describes how these flow control
mechanisms are used to support isochronous data streams.
Transaction Ordering
The PCI Express specification defines several ordering rules to govern
which types of transactions are allowed to pass or be passed. Passing occurs when a newer transaction bypasses a previously issued transaction
and the device executes the newer transaction first. The ordering rules
apply uniformly to all transaction typesmemory, I/O, configuration,
and messagesbut only within a given traffic class. There are no ordering rules between transactions with different traffic classes. It follows
that there are no ordering rules between different virtual channels, since
165
Ordering Rules
Posted
Request
Non-Posted Request
Completion
Memory Write
or Message
Request (1)
Read
Request
(2)
I/O or Config
Write Request
(3)
Read
Completion
(4)
I/O or Config
Write
Completion
(5)
Memory Write
or Message
Request (A)
a) No
Yes
Yes
a) Y/N
a) Y/N
b) Yes
b) Yes
Read Request
(B)
No
Y/N
Y/N
Y/N
Y/N
I/O or Config
Write Request
(C)
No
Y/N
Y/N
Y/N
Y/N
Read
Completion (D)
a) No
Yes
Yes
a) Y/No
Y/N
I/O or Config
Write
Completion (E)
Y/N
Non-Posted
Request
Posted
Request
Completion
166
b) Y/N
b) Y/N
b) No
Yes
Yes
Y/N
Y/N
167
the relaxed ordering bit (bit 5 of byte 2 in the TLP header) contains a
value of 0, then the second transaction is not permitted to bypass the
previously submitted request (A1a). If that bit is set to a 1, then the subsequent transaction is permitted to bypass the previous transaction
(A1b). A memory write or message request must be allowed to pass read
requests (A2) as well as I/O or configuration write requests (A3) in order
to avoid deadlock. The ordering rules between memory write or message
requests and completion packets depend on the type of PCI Express device. Endpoints, switches, and root complexes may allow memory write
or message requests to pass or be blocked by completions (A4a and A5a).
PCI Express to PCI or PCI-X bridges, on the other hand, must allow
memory write or message requests to pass completions in order to avoid
deadlock (A4b and A5b). This scenario only occurs for traffic flowing
from the upstream (PCI Express) side of the bridge to the downstream
(PCI or PCI-X) side of the bridge.
A subsequent non-posted request (any read request or an I/O or configuration write request) interacts with previous transactions in the following way. As seen in cells B1 and C1, these requests are not allowed to
pass previously issued memory write or message requests. Non-posted
requests may pass or be blocked by all other transaction types (B2, B3,
B4, B5, C2, C3, C4, C5).
A subsequent read completion interacts with previous transactions as
follows. As seen in cell D1, there are two potential ordering rules when
determining if a read completion can pass a previously issued memory
write or message request. If the relaxed ordering bit (bit 5 of byte 2 in
the TLP header) contains a value of 0, then the read completion is not
permitted to bypass the previously submitted request (D1a). If that bit is
set to a 1, then the read completion may bypass the previously enqueued
transaction (D1b). A read completion must be allowed to pass read requests (D2) as well as I/O or configuration write requests (D3) in order
to avoid deadlock. Read completions from different read requests are
treated in a similar fashion to I/O or configuration write completions. In
either case (D4a or D5), the subsequent read completion may pass or be
blocked by the previous completion transaction. Recall however, that a
single completion may be split up amongst several completion packets.
In this scenario, a subsequent read completion packet is not allowed to
pass a previously enqueued read completion packet for that same request/completion (D4b). This is done in order to ensure that read completions return in the proper order.
A subsequent I/O or configuration write completion interacts with
previous transactions as follows. As seen in cell E1, these completions
168
169
PCI-X flow control works. Additionally, once a car gains access to the
road, it needs to determine how fast it can go. If there is a lot of traffic already on the road, the driver may need to throttle his or her advancement to keep from colliding with other cars on the road. PCI
accomplishes this through signals such as IRDY# and TRDY#.
Now consider that the road is changed into a highway with four lanes
in both directions. This highway has a carpool lane that allows carpoolers
an easier path to travel during rush hour traffic congestion. There are also
fast lanes for swifter moving traffic and slow lanes for big trucks and
other slow moving traffic. Drivers can use different lanes in either direction to get to a particular destination. Each driver occupies a lane based
upon the type of driver he or she is. Carpoolers take the carpool lane
while fast drivers and slow drivers occupy the fast and slow lanes respectively. This highway example represents the PCI Express flow control
model. Providing additional lanes of traffic increases the total number of
cars or bandwidth that can be supported. Additionally, dividing up that
bandwidth based on traffic class (carpoolers versus slow trucks) allows
certain packets to be prioritized over others during high traffic times.
PCI Express does not have the same sideband signals (IRDY#,
TRDY#, RBF#, WBF#, and so on) that PCI or AGP have in order to implement this sort of flow control model. Instead, PCI Express uses a flow
control credit model. Data Link Layer Packets (DLLPs) are exchanged between link mates indicating how much free space is available for various
types of traffic. This information is exchanged at initialization, and then
updated throughout the active time of the link. The exchange of this information allows the transmitter to know how much traffic it can allow
on to the link, and when the transmitter needs to throttle that traffic to
avoid an overflow condition at the receiver.
System traffic is broken down into a variety of traffic classes (TCs). In
the traffic example above, the traffic classes would consist of carpoolers,
fast drivers, and slow drivers. PCI Express supports up to eight different
traffic classes. Each traffic class can be assigned to a separate virtual channel (VC), which means that there can be at most eight virtual channels.
Support for traffic classes and virtual channels beyond the defaults (TC0
and VC0) is optional. Each supported traffic class is assigned to a supported virtual channel for flow control purposes. TC0 is always associated with VC0, but beyond that, traffic class to virtual channel mapping is
flexible and device-dependent. Although each traffic class may be
mapped to a unique virtual channel, this is not a requirement. Multiple
traffic classes can share a single virtual channel, but multiple virtual
channels cannot share a single traffic class. A traffic class may only be as-
170
Function 1
Device A (Root
(Complex)
Link
TC[0:1]
VC0
TC[0:1]
TC7
VC3
TC7
Function 2
Link
TC[0:1]
VC0
TC[0:1]
TC[2:4]
VC1
TC[2:4]
TC[5:6]
VC2
TC[5:6]
TC7
VC3
TC7
Figure 9.1
171
Link
TC[0:1]
Mapping TC[2:4]
VC0
TC[0:1]
VC1
TC[2:4]
TC[5:6]
VC2
TC[5:6]
TC7
VC3
TC7
172
x1 Port at Device B
T0+
T0-
R0+
R0-
R0+
R0-
T0+
T0-
VC = 0
VC = ?
VC = ?
TC
TC0 VC0
Figure 9.2
TC
VC
VC
TC0 VC0
x1 Port at Device A
VC = 0
VC = 1
173
x1 Port at Device B
T0+
T0-
R0+
R0-
R0+
R0-
T0+
T0-
VC = 0
VC = 1
VC = x
TC
TC7
Figure 9.3
TC
VC
TC0-6 VC0
VC1
VC
TC0-6 VC0
TC7
VC1
174
Supported VCs
VC0
VC0, VC1
Again, these are example associations and not the only possible traffic
class to virtual channel associations in these configurations.
There are several additional traffic class/virtual channel configuration
details to make note of. As seen in Figure 9.2, all ports support VC0 and
map TC0 to that virtual channel by default. This allows traffic to flow
across the link without (or prior to) any VC-specific hardware or software
configuration. Secondly, implementations may adjust their buffering per
virtual channel based on implementation-specific policies. For example,
in Figure 9.3, the queues or buffers in Device A that are identified with a
VC ID of x may be reassigned to provide additional buffering for VC0 or
VC1, or they may be left unassigned and unused.
Flow Control
PCI Express enacts flow control (FC) mechanisms to prevent receiver
buffer overflow and to enable compliance with the ordering rules outlined previously. Flow control is done on a per-link basis, managing the
traffic between a device and its link mate. Flow control mechanisms do
not manage traffic on an end-to-end basis, as shown in Figure 9.4.
175
Root
Complex
FC
Switch
FC
PCI Express
Endpoint
Figure 9.4
Flow Control
If the root complex issues a request packet destined for the PCI Express endpoint, it transmits that packet across the outgoing portion of its
link to the switch. The switch then sends that packet across its downstream port to the endpoint. The flow control mechanisms that PCI Express implements, however, are link-specific. The flow control block in
the root complex only deals with managing the traffic between the root
complex and the switch. The downstream portion of the switch and the
endpoint then manage the flow control for that packet between the
switch and the endpoint. There are no flow control mechanisms in the
root complex that track that packet all the way down to the endpoint.
Link mates share flow control details to ensure that no device transmits a packet that its link mate is unable to accept. Each device indicates
how many flow control credits it has available for use. If the next packet
allocated for transmission exceeds the available credits at the receiver,
that packet cannot be transmitted. Within a given link, each virtual channel maintains its own flow control credit pool.
As mentioned in Chapter 7, DLLPs carry flow control details between
link mates. These DLLPs may initialize or update the various flow control
credit pools used by a link. Though the flow control packets are DLLPs
and not TLPs, the actual flow control procedures are a function of the
Transaction Layer in cooperation with the Data Link Layer. The Transaction Layer performs flow control accounting for received TLPs and gates
outgoing TLPs if, as mentioned previously, they exceed the credits avail-
176
able. The flow control mechanisms are independent of the data integrity
mechanisms of the Data Link Layer (that is to say that the flow control
logic does not know if the Data Link Layer was forced to retry a given
TLP).
TLP
Credits Consumed
1 NPH
1 PH + n PD
1 PH
1 PH + n PD
1 CplH + n CPLD
1 CplH + 1 CPLD
1 CplH
The n
length
length
which
units used for the data credits come from rounding up the data
by 16 bytes. For example, a memory read completion with a data
of 10 DWords (40 bytes) uses 1 CplH unit and 3 (40/16 = 2.5,
rounds up to 3) CplD units. Please note that there are no credits
177
and hence no flow control processes for DLLPs. The receiver must therefore process these packets at the rate that they arrive.
Each virtual channel has independent flow control, and thus maintains independent flow control pools (buffers) for PH, PD, NPH, NPD,
CplH, and CplD credits. Each device autonomously initializes the flow
control for its default virtual channel (VC0). As discussed in Chapter 7,
this is done during the DL_Init portion of the Data Link Layer state machine. The initialization procedures for other virtual channels flow control are quite similar to that of VC0, except that VC0 undergoes
initialization by default (and before the link is considered active) while
other virtual channels undergo initialization after the link is active. Once
enabled by software, multiple virtual channels may progress through the
various stages of initialization simultaneously. They need not initialize in
numeric VC ID order (that is to say, VC1 initializes before VC2 initializes
before VC3, and so on) nor does one channels initialization need to
complete before another can begin (aside from VC0, which must be initialized before the link is considered active). Additionally, since VC0 is
active prior to the initialization of any other virtual channels, there may
already be TLP traffic flowing across that virtual channel. Such traffic has
no direct impact on the initialization procedures for other virtual channels.
Tying this all together, this section has shown that arbitration for
bandwidth of a given link is dependent on several factors. Since multiple
virtual channels may be implemented across a single link, those virtual
channels must arbitrate for the right to transmit. VC-to-VC arbitration can
take several forms, but it is important that regardless of the policy, no virtual channel is locked out or starved for bandwidth. Once a virtual channel is set to transmit, it must arbitrate amongst its supported traffic
classes. If a virtual channel has only one traffic class assigned to it, that
arbitration is quite simple. If a virtual channel has numerous traffic
classes, there needs to be some arbitration policy to determine which
traffic class has priority. Like VC-to-VC arbitration, TC-to-TC arbitration
(within a virtual channel) can take several forms, with a similar priority of
ensuring that no traffic class is locked out or starved for bandwidth. Finally, once a specified traffic class is ready to transmit, the transaction
ordering rules from the beginning of the chapter are used to determine
which transaction should be transmitted. Transaction ordering only details the rules for traffic within a set traffic class, which is why this is the
last step in the arbitration process.
178
Received InitFC1
or InitFC2 DLLP
for VCx?
No
Record indicated FC
unit value for VCx and
set appointed flag
No
Yes
No
Timer roll-over?
Yes
Yes
Proceed to FC_Init2
Figure 9.5
179
rupted. While in this state for VC0 no other traffic is possible. As such,
this pattern should be retransmitted in this order continuously until exit
into FC_Init2. For other virtual channels, this pattern is not repeated continuously. Since other traffic may wish to use the link during nonzero virtual channel initialization, this pattern does not need to be repeated
continuously, but it does need to be repeated (uninterrupted) at least
every 17 microseconds.
While in this state, the FC logic also needs to process incoming
InitFC1 (and InitFC2) DLLPs. Upon receipt of an InitFC DLLP, the device
records the appropriate flow control unit value. Each InitFC packet contains a value for both the header units and data payload units. Once the
device has recorded values for all types (P, NP and Cpl, both header and
data) of credits for a given virtual channel, it sets a flag (FI1) to indicate
that the virtual channel has successfully completed FC_Init1. At this
point, InitFC1 packets are no longer transmitted, and the device proceeds to the FC_Init2 stage. Figure 9.6 shows the flowchart for flow control initialization state FC_Init2.
180
Received InitFC2
or UpdateFC DLLP,
or TLP for VCx?
No
Yes
Set Flag
No
Yes
No
Timer roll-over?
Flag set?
Yes
End
Figure 9.6
For all virtual channels, the entrance to FC_Init2 occurs after successful completion of the FC_Init1 stage. While in FC_Init2, the Transaction
Layer no longer needs to block transmission of TLPs that use that virtual
channel. While in this state, the device must first transmit InitFC2-P, then
InitFC2-NP, and then InitFC2-Cpl. This sequence must progress in this
order and must not be interrupted. While in this state for VC0, this pattern should be retransmitted in this order continuously until successful
completion of FC_Init2. For other virtual channels, this pattern is not repeated continuously, but is repeated uninterrupted at least every 17 microseconds until FC_Init2 is completed.
While in this state, it is also necessary to process incoming InitFC2
DLLPs. The values contained in the DLLP can be ignored, and InitFC1
packets are ignored entirely. Receiving any InitFC2 DLLP for a given virtual channel should set a flag (FI2) that terminates FC_Init2 and the flow
181
control initialization process. Please note that the FI2 flag is dependent
on receipt of a single Init_FC2 DLLP, and not all three(P, NP and Cpl).
Additionally, the FC_Init2 flag may also be set upon receipt of any TLP or
UpdateFC DLLP that uses that virtual channel.
What exactly is the purpose of this FC_Init2 state? It seems as if it is
just retransmitting the same flow control DLLPs, but with a single bit
flipped to indicate that it is in a new state. The purpose for this state is to
ensure that both devices on a link can successfully complete the flow
control initialization process. Without it, it could be possible for one device to make it through flow control while its link mate had not. For example, say that there is no FC_Init2 state and a device can proceed
directly from FC_Init1 to normal operation mode. While in FC_Init1, Device A transmits its three InitFC1 DLLPs to Device B and vice versa. Device B successfully receives all three DLLPs and proceeds on to normal
operation. Unfortunately, one of Device Bs flow control DLLPs gets lost
on its way to Device A. Since Device A has not received all three types of
flow control DLLPs, it stays in FC_Init1 and continues to transmit flow
control DLLPs. Device B is no longer transmitting flow control packets,
so Device A never gets out of FC_Init1 and all traffic from Device A to B
is blocked.
By having the FC_Init2 state in there, it ensures that both devices can
successfully complete the flow control initialization process. In the above
example, Device B would transfer into the FC_Init2 state and could begin
to transmit TLPs and other DLLPs. However, Device B still needs to periodically transmit FC2 DLLPs for all three flow control types (P, NP, Cpl).
If Device A does not see the three original FC1 DLLPs, it can still eventually complete FC_Init1 since it periodically receives FC2 packets that
contain the needed flow control configuration information.
Tying this all together, what would a real flow control initialization
look like? Figure 9.7 illustrates the first step in an example flow control
initialization.
182
Device A
SDP
Device B
P
1
Figure 9.7
01
Cpl
1
01
Init_FCI - NP
Init_FCI - P
NP
NP
1
P
1
SDP
Devices A and B exit out of reset and begin the default initialization
of VC0. In this example, Device B happens to begin the initialization first,
so it begins to transmit Init_FC1 packets before Device A does. It starts
with an SDP symbol (Start DLLP Packetrefer to Chapter 8 for additional
details on framing) and then begins to transmit the DLLP itself. The first
packet that Device B must transmit is Init_FC1 for type P. It does so, differentiating it as an FC1 initialization packet for credit type P in the first
four bits of the DLLP (refer to Chapter 7 for more details on the format of
these DLLPs). Device B then indicates that this packet pertains to VC0 by
placing a 0 in the DLLPs VC ID field. The next portions of the packet
identify that Device B can support 01h (one decimal) posted header request units and 040h (64 decimal) posted request data units. At 16 bytes
per unit, this equates to a maximum of 1024 bytes of data payload. Following this information is the CRC that is associated with this DLLP. The
Init_FC1-P DLLP then completes with the END framing symbol. Device B
continues on with the transmission of the Init_FC1-NP and Init_FC1-Cpl
DLLPs.
Device A also begins transmitting Init_FC1 DLLPs, but does so just a
little later than Device B. At the point in time of this example, Device A
has just completed the transmission of the second initialization packet,
Init_FC1-NP DLLP, whereas Device B is already well into the transmission
183
Device B
END SDP
NP
1
Cpl
1
P
2
Device B must
complete the 2nd set of
Init_FC1s before it can
progress to FC_Init2
01 0
Figure 9.8
Cpl
2
NP
2
P
2
SDP
Device B has sent out all three Init_FC1 packets (P, NP, and Cpl), but
has not yet received all three Init_FC1 packets from Device A. This
means that Device B cannot yet exit from the FC_Init1 state and must
therefore retransmit all three Init_FC1 packets. Device A, on the other
hand, has already received all three Init_FC1 packets by the time it completes transmitting its own Init_FC1 packets. This means that Device A
can exit from the FC_Init1 state after only one pass and proceed on to
FC_Init2.
In Figure 9.8, Device A has begun to send out Init_FC2 packets. It
begins, as required, with the P type. Only this time, it is identified as an
Init_FC2 packet and not an Init_FC1. It proceeds to send out the Init_FC2
184
packet for NP and has started to send out the Init_FC2 packet for Cpl at
the time of this example.
Device B, on the other hand, has had to continue through the second
transmission of Init_FC1 packets. Once it completes the set of three, it
can transition to FC_Init2 and begin to transmit Init_FC2 packets. In this
example, Device B has just started to send out an Init_FC2 packet for
type P.
Now that each device has entered into the FC_Init2 stage, what do
things look like as they exit the flow control initialization and enter into
normal link operation? Figure 9.9 illustrates the completion of the flow
control initialization process.
Device A
Device B
TLP on VC0
Figure 9.9
Cpl
1
STP
Cpl
2
SDP
Device A has sent out all three Init_FC2 packets (P, NP, and Cpl), but
has not yet received all three Init_FC2 packets from Device B. As discussed previously, however, exit from FC_Init2 is not dependent on receiving all three Init_FC2 packets. Receipt of any Init_FC2 DLLP allows
that device to exit from the FC_Init2 state (as long as it does not interrupt
the transmission of a trio of its Init_FC2 DLLPs). As such, Device A does
not need to retransmit its Init_FC2 packets and has completed the flow
control initialization. If it had not successfully received and processed the
first Init_FC2 DLLP from Device B by the time the transmission of
185
Credit Type
Minimum Advertisement
PH
PD
NPH
NPD
CplH
186
CplD
187
a given transaction, transactions that use other traffic classes/virtual channels are not impacted.
Example
It should be noted that return of flow control credits does not necessarily
mean that the TLP has reached its destination or has been completed. It
simply means that the buffer or queue space allocated to that TLP at the
receiver has been cleared. In Figure 9.4, the upstream port of the switch
may send an UpdateFC that indicates it has freed up the buffer space
from a given TLP that is destined for the endpoint. The root complex
should not imply that this has any meaning other than that the TLP has
been cleared from the upstream receive buffers of the switch. That TLP
may be progressing through the core logic of the switch, may be in the
outgoing queue on the downstream port, or may be already received
down at the endpoint.
188
Isochronous Support
Servicing isochronous traffic requires a system to not only provide guaranteed data bandwidth, but also specified service latency. PCI Express is
designed to meet the needs of isochronous traffic while assuring that
other traffic is not starved for support. Isochronous support may be realized through the use of the standard flow control mechanisms described
above: traffic class traffic labeling, virtual channel data transfer protocol
189
Root Complex
Endpoint to Root
Complex Isochronous Traffic
Endpoint to Endpoint
Isochronous Traffic
Switch
PCI Express
Endpoint
A
Figure 9.10
PCI Express
Endpoint
B
PCI Express
Endpoint
C
Isochronous Pathways
190
propriate timeframe. Conversely, if there are too many isochronous transfers within a given time period, other traffic may be starved and/or
isochronous flow control credits may not be returned in an appropriate
timeframe. The isochronous contract is set up based upon the desired
packet sizes, latencies, and period of isochronous traffic. When allocating
bandwidth for isochronous traffic, only a portion of the total available
bandwidth should be used, because sufficient bandwidth needs to remain
available for other traffic. Additional details on the isochronous contract
variables and their impact on virtual channel and traffic class configurations are contained in the PCI Express specification.
Generally speaking, isochronous transactions follow the same rules
that have been discussed in this chapter. Software just uses the mechanisms described to ensure that the isochronous traffic receives the
bandwidth and latencies needed. Since isochronous traffic should be differentiated by traffic class and virtual channel, there are no direct ordering relationships between isochronous and other types of transactions.
Devices that may see an isochronous traffic flow should implement
proper buffer sizes to ensure that normal, uniform isochronous traffic
does not get backed up and require throttling. If the isochronous traffic
flow is bursty (lots of isochronous traffic at once), throttling may occur,
so long as the acceptance of the traffic is uniform (and according to the
isochronous contract). Chapter 10 gives additional explanations of the
software setup for isochronous traffic.
Chapter
10
PCI Express
Software Overview
Those parts of the system that you can hit with a hammer
are called hardware; those program instructions that you
can only curse at are called software.
Anonymous
192
this model should ease the adoption of PCI Express, because it removes
the dependency on operating system support for PCI Express in order to
have baseline functionality. The second PCI Express configuration model
is referred to as the enhanced mechanism. The enhanced mechanism increases the size of available configuration space and provides some optimizations for access to that space.
193
the PCI-PCI bridges on that bus need to pay attention to the configuration transaction. If the target PCI bus for the configuration is a bridges
subordinate but not secondary bus, the bridge claims the transaction
from its primary bus and forwards it along to its secondary bus (still as a
Type 1). If the target PCI bus for the configuration is a bridges secondary
bus, the bridge claims the transaction from its primary bus and forwards
it along to its secondary bus, but only after modifying it to a Type 0 configuration transaction. This indicates that the devices on that bus need to
determine whether they should claim that transaction. Please refer to the
PCI Local Bus Specification Revision 2.3 for additional details on PCI configuration.
For PCI Express, each link within the system originates from a PCIPCI bridge and is mapped as the secondary side of that bridge. Figure
10.1 shows an example of how this configuration mechanism applies to a
PCI Express switch. In this example, the upstream PCI Express link that
feeds the primary side of the upstream bridge originates from the secondary side of a PCI bridge (either from the root complex or another
switch). A PCI Express endpoint is represented as a single logical device
with one or more functions.
Bridge Key
Primary Side of Bridge
Upstream
PCI Express
PCI-PCI
Bridge
PCI Express
Switch
PCI Express
PCI-PCI
Bridge
PCI-PCI
Bridge
PCI-PCI
Bridge
PCI Express
Endpoint
PCI Express
Endpoint
PCI Express
Endpoint
Figure 10.1
194
Configuration Mechanisms
PCI 2.3 allowed for 256 bytes of configuration space for each device
function within the system. PCI Express extends the allowable configuration space to 4096 bytes per device function, but does so in a way that
maintains compatibility with existing PCI enumeration and configuration
software. This is accomplished by dividing the PCI Express configuration
space into two regions, the PCI 2.3-compatible region and the extended
region. The PCI 2.3-compatible region is made up of the first 256 bytes of
a devices configuration space. This area can be accessed via the traditional configuration mechanism (as defined in the PCI 2.3 specification)
or the new PCI Express enhanced mechanism. The extended region of
configuration space consists of the configuration space between 256 and
4096 bytes. This area can be accessed only through the enhanced PCI
Express mechanism, and not via the traditional PCI 2.3 access mechanism. This is shown in Figure 10.2. The extension of the configuration
space is useful for complex devices that require large amounts of registers to control and monitor the device (for example, a Memory Controller
Hub). With only 256 bytes of configuration space offered by PCI, these
devices may need to be implemented as multiple device or multifunctional devices, just to have enough configuration space.
FFFh
PCI Configuration
Space
(Available on legacy
operating systems
through legacy access
mechanisms)
Figure 10.2
3Fh
0
195
Extended configuration
space for PCI Express
parameters and capabilities
(not available on legacy
operating systems)
196
register that is being addressed. The memory data contains the contents for
the configuration register being accessed. The mapping from memory address A[27:0] to PCI Express configuration space is shown in Table 10.1.
Table 10.1
Memory Address
A [27:20]
Bus [7:0]
A [19:15]
Device [4:0]
A [14:12]
Function [2:0]
A [11:8]
A [7:0]
Register [7:0]
Again, both the enhanced PCI Express and the PCI 2.3-compatible access
mechanisms use this request format. PCI 2.3-compatible configuration
requests must fill the Extended Register Address field with all 0s.
The PCI Express host bridge is required to translate the memorymapped PCI Express configuration accesses from the host processor to
legitimate PCI Express configuration transactions. Refer to Chapter 6 for
additional details on how configuration transactions are communicated
through PCI Express.
Error Reporting
This section explains the error signaling and logging requirements for
PCI Express. PCI Express defines two error reporting mechanisms. The
first is referred to as baseline and defines the minimum error reporting
capabilities required by all PCI Express devices. The second is referred to
as advanced error reporting and allows for more robust error reporting.
Advanced error reporting requires specific capability structures within
the configuration space. This is touched upon briefly in this section, but
not to the same level of detail as in the PCI Express specification.
In order to maintain compatibility with existing software that is not
aware of PCI Express, PCI Express errors are mapped to existing PCI reporting mechanisms. Naturally, this legacy software would not have access to the advanced error reporting capabilities offered by PCI Express.
197
Error Classification
There are two types of PCI Express errors: uncorrectable errors and correctable errors. Uncorrectable errors are further classified as either fatal
or nonfatal. Specifying these error types provides the platform with a
method for dealing with the error in a suitable fashion. For instance, if a
correctable error such as a bad TLP (due to an LCRC error) is reported,
the platform may want to respond with some monitoring software to determine the frequency of the TLP errors. If the errors become frequent
enough, the software may initiate a link specific reset (such as retraining
the link). Conversely, if a fatal error is detected, the platform may want to
initiate a system-wide reset. These responses are merely shown as examples. It is up to platform designers to map appropriate platform responses to error conditions.
Correctable errors are identified as errors where the PCI Express protocol can recover without any loss of information. Hardware corrects
these errors (for example, through a Data Link Layer initiated retry attempt for a bad LCRC on a TLP). As mentioned previously, logging the
frequency of these types of errors may be useful for understanding the
overall health of a link.
Uncorrectable errors are identified as errors that impact the functionality of the interface. Fatal errors are uncorrectable errors that render a
given link unreliable. Handling of fatal errors is platform-specific, and
may require a link reset to return to a reliable condition. Nonfatal errors
are uncorrectable errors that render a given transaction unreliable, but do
not otherwise impact the reliability of the link. Differentiating between
fatal and nonfatal errors allows system software greater flexibility when
dealing with uncorrectable errors. For example, if an error is deemed to
be nonfatal, system software can react in a manner that does not upset
(or reset) the link and other transactions already in progress. Table 10.2
shows the various PCI Express errors.
Physical Layer
Table 10.2
Error Types
Error Name
Default Severity
Receive Error
Correctable
Training Error
Uncorrectable
(Fatal)
If checking, send
ERR_FATAL/ERR_NONFATAL to
root complex (unless masked)
Bad TLP
Correctable
Receiver:
Receiver:
Send ERR_COR to root complex
Replay Timeout
Transmitter:
Send ERR_COR to root complex
Replay Num
Rollover
Transmitter:
Send ERR_COR to root complex
Uncorrectable
(Fatal)
Poisoned TLP
Received
Uncorrectable
(Nonfatal)
ECRC Check
Failed
Transaction Layer
198
Receiver:
Send ERR_NONFATAL to root
complex
Log the header of the TLP that
encountered the ECRC error
Unsupported
Request
Request Receiver:
Completion
Timeout
Requester:
Send
ERR_NONFATAL/ERR_FATAL to
root complex
Completer Abort
Completer:
Send ERR_NONFATAL to root
complex
Log the header of the completion
that encountered the error
Unexpected
Completion
199
Receiver:
Send ERR_NONFATAL to root
complex
Log the header of the completion
that encountered the error
This error is a result of misrouting
Receiver
Overflow
Uncorrectable
(Fatal)
Flow Control
Protocol Error
Malformed TLP
Receiver:
Error Signaling
The PCI Express device that detects an error is responsible for the appropriate signaling of that error. PCI Express provides two mechanisms
for devices to alert the system or the initiating device that an error has
occurred. The first mechanism is through the Completion Status field in
the completion header. As discussed in Chapter 6, the completion packet
indicates if the request has been completed successfully. Signaling an error in this manner allows the requester to associate that error with a specific request.
The second method for error signaling is through in-band error messages. These messages are sent to the root complex in order to advertise that an error of a particular severity has occurred. These messages
are routed up to the root complex, and indicate the severity of the error
(correctable versus fatal versus nonfatal) as well as the ID of the initiator
of the error message. If multiple error messages of the same type are detected, the corresponding error messages may be merged into a single error message. Error messages of differing severity (or from differing
initiators) may not be merged together. Refer to Chapter 6 for additional
details on the format and details of error messages. Once the root complex receives the error message, it is responsible for translating the error
into the appropriate system event.
200
Baseline error handling does not allow for severity programming, but
advanced error reporting allows a device to identify each uncorrectable
error as either fatal or nonfatal. This is accomplished via the Uncorrectable Errors Severity register that is implemented if a device supports advanced error reporting.
Error messages may be blocked through the use of error masking.
When an error is masked, the status bit for that type of error is still affected by an error detection, but no message is sent out to the root complex. Devices with advanced error reporting capabilities can
independently mask or transmit different error conditions.
201
Byte
Offset
31
00h
04h
08h
0Ch
10h
14h
18h
18h
Root
Ports
Only
Figure 10.3
2Ch
30h
Error Source
Error Source ID Reg Correctable
ID Reg
34h
The PCI Express Enhanced Capability header is detailed in the specification and uses a Capability ID of 0001h to indicate it as an advanced
error reporting structure. The error status registers (Uncorrectable Error
Status register and Correctable Error Status register) indicate if a particular type of error has occurred. The mask registers (Uncorrectable Error
Mask register and Correctable Error Mask register) control whether that
particular type of error is reported. The Uncorrectable Error Severity register identifies if an uncorrectable error is reported as fatal or nonfatal. All
of these registers call out the error types shown in Table 10.2 (that is to
say, all of the uncorrectable errors shown in that table have a bit associated with them in each of the uncorrectable error registers).
The Advanced Error Capabilities and Control register takes care of
some of the housekeeping associated with advanced error reporting. It
contains a pointer to the first error reported (since the error status registers may have more than one error logged at a time). It also details the
ECRC capabilities of the device. The Header Log register captures the
202
header for the TLP that encounters an error. Table 10.2 identifies the errors that make use of this register.
Root complexes that support advanced error reporting must implement several additional registers. Among them is the Root Error Command and Root Error Status registers that allow the root complex to
differentiate the system response to a given error severity. Finally, if supporting advanced error reporting, the root complex also implements registers to log the Requester ID if either a correctable or uncorrectable
error is received.
Error Logging
Figure 10.4 shows the sequence for signaling and logging a PCI Express
error. The boxes shaded in gray are only for advanced error handling and
not used in baseline error handling.
203
Error Detected
Yes
No
Correctable?
Adjust severity
(Uncorrectable Error
Severity Register
Set Fatal/Nonfatal
Error Detected in
Device Status Register
Masked in
Uncorrectable Error
Mask Register?
Yes
Masked in
Correctable Error
Mask Register?
End
End
No
Yes
Uncorrectable
Error Reporting
Enabled?
No
Is Severity
Fatal?
Yes
Send ERR_NONFATAL
Message
Send ERR_FATAL
Message
End
End
Figure 10.4
No
End
Yes
End
No
Uncorrectable
Error Reporting
Enabled?
Send ERR_COR
Message
End
Devices that do not support the advanced error handling ignore the
boxes shaded in gray and only log the Device Status register bits as
shown in the white boxes. Some errors are also reported using the PCIcompatible configuration registers, using the parity error and system er-
204
ror status bits (refer to the PCI Express specification for full details on
this topic).
205
PCI Express do not have mechanisms to configure and enable this feature. Active state power management can be implemented on legacy operating systems through updated BIOS or drivers.
The specifics of PCI Express power management are covered in detail in Chapter 11 and therefore are not included here.
Primary Element
Objective
Indicators
Manually Operated
Retention Latches (MRLs)
MRL Sensor
Electromechanical
Interlock
Attention Button
Slot Numbering
PCI Express adopts the standard usage model for several reasons. One of
the primary reasons, unrelated to software, is the ability to preserve the existing dominant Hot Plug usage model that many customers have become
used to. Another reason is the ability to reuse code bits and flow processes
already defined for legacy Hot Plug implementations with PCI-X.
206
Message
Issued By
Description
Attention_Indicator_On
Switch/Root
Port
Attention_Indicator_Off
Switch/Root
Port
207
Switch/Root
Port
Power_Indicator_On
Switch/Root
Port
Power_Indicator_Off
Switch/Root
Port
Power_Indicator_Blink
Switch/Root
Port
Attention_Button_Pressed
Add-in Slot
Device
Once software has enabled and configured the PCI Express device for
Hot Plug functionality, if supported, system interrupts and power man-
208
agement events are generated based upon Hot Plug activity (attention
buttons being pressed, power faults, manual retention latches opening/closing). When PCI Express Hot Plug events generate interrupts, the
system Hot Plug mechanism services those interrupts. The Hot Plug
mechanism is dependent upon the operating system. Legacy operating
systems will likely use an ACPI implementation with vendor specific filter
drivers. A contrast between a PCI Express aware and a legacy ACPIcapable operating system Hot Plug service model is provided in Figure
10.5. For additional information on ACPI refer to the Advanced Configuration and Power Interface, Specification Revision 2.0b.
PCI Express-Aware
Operating
System
No
Firmware Control of
Hot Plug Registers
is Enabled
Firmware Control of
Hot Plug Registers
is Disabled
Hot Plug
Interrupt
Hot Plug
Interrupt
BIOS Redirects
Interrupt to a General
Purpose Event
Operating System
Services Interrupt
ACPI
Driver Executes
OM15584
Figure 10.5
Generic Hot Plug Service Model for PCI Express Hot Plug
209
byte limit for PCI 2.3-compatible configuration space). The PCI Express
enhanced configuration mechanism is required to access this feature
space. This means that isochronous services such as traffic class/virtual
channel mapping for priority servicing cannot be supported by legacy
operating systems. The following discussion assumes that software supports the PCI Express enhanced configuration mechanism.
16 15
04h
08h
All Devices
00h
Figure 10.6
RsvdP
14h
18h
.....
Port Arb Table
Offset (31:24)
RsvdP
14h + (n*0Ch)
18h + (n*0Ch)
VC Arbitration Table
VAT_offset(0)*04h
PAT_offset(0)*04h
PAT_offset(n)*04h
The PCI Express Enhanced Capability Header is detailed in the specification and uses a Capability ID of 0002h to indicate it as a virtual channel structure. The Port Virtual Channel Capabilities registers (Port VC
210
Capability 1 register and Port VC Capability 2 register) indicate the number of other virtual channels (VC[7:1]) that a device supports in addition
to the default virtual channel VC0. These registers also contain arbitration
support information and the offset location of the actual Virtual Channel
Arbitration Table. The capability and status registers (Port VC Status register and VC Resource Capability [n:0] registers) report the Virtual Channel Arbitration Table coherency status, the types of port arbitration
supported by the available virtual channels (also referred to as resources)
and the offset location of the Port Arbitration Table for each virtual
channel. The Port VC Control register allows software to select, configure and load an available virtual channel arbitration scheme. The VC Resource Control [n:0] registers are used to enable and configure each
available virtual channel as well as map which particular traffic classes
use that virtual channel. The VC Resource Status [n:0] registers are used
to report the coherency status of the Port Arbitration Table associated
with each individual virtual channel as well as report whether or not a
virtual channel is currently in the process of negotiating for port access.
211
would get the highest priority. This arbitration mechanism is the default
arbitration mechanism for PCI Express virtual channel arbitration. The
use of this arbitration scheme does require some amount of software
regulation in order to prevent possible starvation of low priority devices.
Round Robin arbitration is a common technique that allows equal access
opportunities to all virtual channel traffic. This method does not guarantee that all virtual channels are given equal bandwidth usage, only the
opportunity to use some of the available bandwidth. Weighted Round
Robin arbitration is a cross between Strict Priority and Round Robin. This
mechanism provides fairness during times of traffic contention by allowing lower priority devices at least one arbitration win per arbitration
loop. The latency of a particular virtual channel is bounded by a minimum and maximum amount. This is where the term weighted comes
in. Weights can be fixed through hardware, or preferably, programmable
by software. If configurable by software, the ability is reported through
the PCI Express Virtual Channel Extended Capability Structure outlined
in Figure 10.6.
212
The Virtual Channel Arbitration Table and the Port Arbitration Table
It can be a bit confusing, when mentioning the different arbitration tables, to understand what each table is used for. It seems useful at this
point to make some sort of distinction between the two. The Virtual
Channel Arbitration Table contains the arbitration mechanism for prioritizing virtual channels competing for a single port. The Port Arbitration
Table contains the arbitration mechanism for prioritizing traffic that is
mapped onto the same virtual channel, but originates from another receiving (also called ingress) port. Port Arbitration Tables are only found
in switches and root complexes. Figure 10.7 illustrates the arbitration
structure that is configured through the PCI Express virtual channel
structure.
VC0
VC0
VC1
Port
0
VC0
VC1
VC1
ARB
Virtual
Channel
Arbitration
ARB
VC0
VC0
VC1
VC2
VC0
Port
1
ARB
VC0
Port
3
Port
Arbitration
Receiving
(Ingress)
Ports
Figure 10.7
Transmitting
(Egress)
Port
213
set 08h), bits [7:0], as shown in Figure 10.6. Table 10.5 is an excerpt
from this register as documented in the PCI Express Specification, Revision 1.0. Software can select an available arbitration option by setting bits
in the Port VC Control register, as shown in Figure 10.6.
Table 10.5
Bit Location
Description
4-7
Reserved
214
31
28 27
20 19
12 11
16 15
4 3
8 7
Phase[2]
Phase[1]
Phase[0]
00h
Phase[9]
Phase[8]
04h
..
..
..
..
..
..
Phase[7]
Phases are
Arranged in
an Array
24 23
..
..
Phase[6]
..
..
Phase[5]
..
..
Phase[4]
..
..
Phase[3]
..
..
..
..
3Ch
Software
Program
Array
Figure 10.8
VC0
VC1
VC0
VC2
VC1
VC0
VC1
VC0
00h
VC0
VC2
VC1
VC0
VC1
VC0
VC2
VC1
04h
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
VC0
VC1
VC0
VC2
VC1
VC0
VC1
VC0
3Ch
215
216
Chapter
11
Power
Management
his chapter provides an overview of the power management capabilities and protocol associated with PCI Express. During this chapter
the existing PCI power management model is discussed as a base for PCI
Express power management. The chapter expands on this base to define
the new power management capabilities of PCI Express such as Link
State Power Management and Active Power Management. The chapter
also discusses the impact of PCI Express power management on current
software models as well as the general flow that new software must take
to enable the new power management capabilities.
218
it became apparent that a system-independent power management standard was needed to address PCI based add-in cards. In June of 1997 the
PCI-SIG released the PCI Bus Power Management Interface Specification to standardize PCI-based system power management.
Though not specifically required to do so, each PCI and PCI Express
device is capable of hosting multiple functions. The PCI Bus Power
Management Specification Revision 1.1 (PCI-PM), which PCI Express is
compatible with, defines four function-based power management states,
D0 through D3. Functions within a single device can occupy any of the
supported function states as dictated by the systems power management
policy. In fact, since each function represents an independent entity to
the operating system, each function can occupy a unique function-based
power management state within a singular device independent of the
other functions within the device (with the exception of Power Off, as
shown in Figure 11.1. PCI Express power management supports and defines exactly the same function states as PCI power management.
Table 11.1
Device State
Characterization
Function Context
Exit Latency
Required/Optional
D0
Preserved
n/a
Required
D1
Preserved
Optional
D2
Preserved
200uS
Optional
D3HOT
Software Accessible
Not Maintained
10ms
Required
D3COLD
Not Maintained
n/a
Required
Device X
Function 1
Device X
Function 2
Supports D0, D3
Device X
Function 3
Device X
Function 4
Supports D0, D3
Figure 11.1
219
Operation
Characterization
Required/Optional
Get Capabilities
Required
Required
Required
Wakeup
Optional
Each PCI and PCI Express device function maintains 256 bytes of configuration space. PCI Express extends the configuration space to 4096
bytes per device function. However, only the first 256 bytes of the configuration space is PCI 2.3-compatible. Additionally the first 256 bytes of
the extended PCI Express configuration space is all that is visible to current (Microsoft Windows XP) and legacy operating systems. Refer to
Chapter 10 for additional information.
220
The base power management operations are supported through specific registers and offsets within the 256 bytes of PCI 2.3-compatible configuration space. The two key registers of concern are the Status register
and the Capabilities Pointer register. The Status register, located at offset
0x04, has a single bit (bit 4) that reports whether the device supports
PCI-PM extended power management capabilities. Extended power management capabilities, not to be confused with the extended configuration
space of PCI Express, can be described as the function states D1, D2 and
D3HOT. The function states D0 and D3COLD are naturally supported by every
function since they correspond to the function being fully on or powered
off, respectively. For example, an unplugged hair dryer supports D3COLD
because it does not have any power.
During device enumeration the operating system uses the Get Capabilities operation to poll the status register of each function to determine
whether any extended power management capabilities are supported. If
bit 4 (capabilities bit) of the Status register is set, the operating system
knows that the function supports extended power management capabilities. As a result, the operating system routines reads the offset value contained in the Capabilities Pointer register at offset 0x34, which points to
the location of the registers that inform the operating system of the specific extended power management capabilities, as shown in Figure 11.2.
Device ID
Vendor ID
00h
Status
Command
04h
Revision ID
Class Code
BIST
08h
20h
24h
28h
2Ch
30h
Cap_Ptr
34h
38h
Interrupt Pin
Figure 11.2
Interrupt Line
3Ch
221
Figure 11.3
PCSR_BSE Bridge
Support Extensions
Capability ID
Offset = 0
Power Management
COntrol/Status Register (PMCSR)
Offset = 4
222
Bridge A
223
Secondary Bus
of Bridge A
Bus 0
Device 1
Bridge B
Device 2
Secondary Bus
of Bridge B
Device 3
Secondary Bus
of Bridge B
Bus 1
Bridge C
Bus 2
Device 4
Figure 11.4
As mentioned before, the PCI Express Link states replace the Bus
states defined by the PCI-PM Specification. The newly defined Link states
are not a radical departure from the Bus states of PCI-PM. These Link
states have been defined to support the new advanced power management concepts and clock architecture of PCI Express. For the most part,
the general functionality of the PCI Express Link states parallels the Bus
states of PCI with the exception of a couple new states added by PCI Express. PCI Express defines the following Link states: L0, L0s, L1, L2/L3
Ready, L2, and L3. As with the PCI-PM Bus states, In the L0 state, the link
is fully on, in the link state L3 the link is fully off; and L0s, L1, L2/L3
Ready, and L2 are in-between sleep states, as shown in Figure 11.5 and
Table 11.3. Of the above PCI Express Link states, the only state that is not
PCI-PM-compatible is the L0s state, which is part of the advanced power
management features of PCI Express.
224
Configuration
Polling
L0
(Full on)
L0s
Recovery
Detect
L1
L2/L3
Ready
L2
L3
Figure 11.5
Table 11.3
Link State
L0
Required
Required
L0S
Not Applicable
Required
L1
Required
Optional
L2/L2 Ready
Required
Not Applicable
L2
Not Applicable
L3
Required
Not Applicable
Recovery
Required
Required
Detect
Required
Not Applicable
Polling
Required
Required
Configuration
Required
Required
225
Clock
Source
Figure 11.6
100MHz
Differential Clock
100MHz Clock
2.5GHz Clock
PLL
PCI Express
Component
(Phase Locked
Loop)
PCI
Express
Circuits
226
PCI Express Link States L1, L2, and PCI-PM Bus State B2
PCI Express defines two Link states that are similar to the optional PCIPM Bus state B2. At a high level the similarity between these PCI Express
Link states and the B2 PCI-PM Bus state is an idle bus and the absence of
a clock. In terms of PCI Express, the Link states L1 and L2 correspond to
a high latency low power state and a high latency deep sleep state.
227
228
The link does not enter the Recovery state if main power has been removed from the system. In the case that main power has been removed
the link must be entirely retrained.
Function
Detect
Polling
Establish bit and symbol lock, data rate, and lane polarity
Configuration
Table 10.5
PCI Bus
State
Characterization
Device
Power
PCI Express
Link State
Characterization
Device
Power
B0
Bus is fully on
Vcc
L0
Link is fully on
Vcc
Component Reference
Clock Running
Component PLLs
Running
B1
Vcc
L0s
No TLP or DLLP
communication over link
Component Reference
Clock Running
Component PLLs
Running
Vcc
B2
Vcc
L1
No TLP or DLLP
communication over link
229
Vcc
Component Reference
Clock Running
Component PLLs Shut
Off
L2
No TLP or DLLP
communication over link
Vaux
Component Reference
Clock Input Shut Off
Component PLLs Shut
Off
L2/L3 Ready
B3
Off (System
Unpowered)
Off
L3
Vcc
Off
The Comparison of PCI-PM Bus State B1 to Link State L0s is for a functionality comparison only. Link State
L0s is not a PCI-PM compatible state
230
ment does require new software to enable the capability. In this case the
definition of new software does not mean a new operating system, but
rather new BIOS code that works in conjunction with a legacy operating
system. Active State Power Management does not coincide with basic
PCI-PM software-compatible features.
You access PCI Express Active Power Management capabilities
through the PCI Express Capability Structure that exists within PCI 2.3compatible configuration space, as shown in Figure 11.7. You manage
Active State Power Management at each PCI Express port through registers in the PCI Express Capability Structure, as shown in Figure 11.8.
FFFh
PCI Express Extended
Configuration Space
(Not available on
legacy operating
systems)
FFh
PCI Express
Capability Structure
3Fh
0 0h
Figure 11.7
231
The first step that software must take is to determine whether the link
supports Active State Power Management, and if so, to what level. This is
accomplished by querying the Link Capabilities register (offset 0x0Ch,
bits 11:10) of the PCI Express Capability Structure, as shown in Figure
11.8. If the link does not support Active State Power Management then
the rest of the steps are not necessary. The available options for Active
State Power Management are no support, L0s support only, or both L0s
and L1 support.
Link Capabilities
Device Status
Link Capabilities
Link Status
Offset = 00h
Offset = 04h
Device Control
Offset = 08h
Offset = 0Ch
Link Control
Slot Capabilities
Offset = 10h
Offset = 14h
Slot Status
Slot Control
Offset = 18h
R srvdP
Root Control
Offset = 1Ch
Root Status
Figure 11.8
Capability ID
Offset = 20h
232
ware must query the Link Status register (offset 0x12h, bit 12) of the PCI
Express Capability Structure to determine whether the device utilizes the
reference clock provided to the slot by the system or a clock provided on
the add-in card itself. The results of the previous query are used to update
the Link Control register located at offset 0x10h, bit 6 of the PCI Express
Capability Structure. Bit 6 corresponds to the common clock configuration and when set causes the appropriate L0s and L1 exit latencies to be
reported in the Link Capabilities register (offset 0x0Ch, bits 14:12). It is
no surprise that exit latencies will vary dependent upon whether two PCI
Express devices on a link utilize the same reference clock or different
reference clocks.
L1 (offset 0x0Ch,
bits 17:15)
L1 Available Exit
Latency Settings
000b
Less than 64 ns
000b
Less than 1 us
001b
001b
1 us to less than 2 us
010b
010b
2 us to less than 4 us
011b
011b
4 us to less than 8 us
100b
100b
8 us to less than 16 us
101b
1 us to less than 2 us
101b
16 us to less than 32 us
110b
2 us to 4 us
110b
32 us to 64 us
111b
Reserved
111b
More than 64 us
233
A Quick Example
Consider the following example. On a certain link there is a root complex and two endpoint devices. Software determines that L0s Active State
Power Management is supported on the link. After polling the individual
devices for L0s exit latency information it finds that the root complex has
an exit latency of 256 nanoseconds and both endpoint devices have exit
latencies of 64 nanoseconds. Further investigation reveals that the endpoints can tolerate up to 512 nanoseconds of exit latency before risking,
for example, the possibility of internal FIFO overruns. Based upon this information software enables Active State Power Management on the link.
234
235
Packet
Type
Use
Function
PM_PME
TLP
MSG
Software
Directed
PM
PME_Turn_Off
TLP
MSG
Software
Directed
PM
PME_TO_Ack
TLP
MSG
Software
Directed
PM
PM_Active_State_Nak
TLP
MSG
Active
State PM
PM_Enter_L1
DLLP
Software
Directed
PM
PM_Enter_L23
DLLP
Software
Directed
PM
PM_Active_State_Request_L1
DLLP
Active
State PM
PM_Request_Ack
DLLP
Software
Directed/
Active
State PM
236
The PCI_PME Transaction Layer Packet (TLP) message is softwarecompatible to the PME mechanism defined by the PCI Power Management Specification Revision 1.1. PCI Express devices use this power
management event to request a change in their power management state.
In most cases the PCI_PME event is used to revive the system from a previously injected lower power state (L1). If a link is in the L2 state, you
must use an out-of-band mechanism to revive the system since the link is
no longer communicating. The out-of-band signal takes the form of an
optional WAKE# pin defined in the PCI Express Card Electromechanical
Specification.
Root complexes and switches send the PME_Turn_Off TLP message
to their downstream devices. This message informs each downstream
device to discontinue the generation of subsequent PCI_PME messages
and prepare for the removal of main power and the reference clocks (the
message informs them to enter the L2/L3 Ready state). PCI Express devices are required to not only accept the PME_Turn_Off message, but
also to acknowledge that the message was received. Downstream devices
reply with the PME_TO_Ack TLP message to acknowledge the
PME_Turn_Off TLP message from a root complex or a switch.
Downstream devices send the PME_Enter_L23 Data Link Layer Packet
(DLLP) to inform the root complex or switch that a downstream device
has made all preparations for the removal of main power and clocks and
is prepared to enter the L2 or L3 State. As soon as the root complex or
switch receives the PME_Enter_L23 Data Link Layer Packet it responds
back to the downstream component with the PM_Request_Ack DLLP,
which acknowledges the devices preparation for the L2 or L3 state,
whichever the case may be. After all downstream devices have reported
their preparation for entry into L2 or L3 the main power and clocks for
the system can be removed. A root complex or switch that sends the
PME_Turn_Off TLP message to its downstream PCI Express devices must
not initiate entry into the L2 or L3 state until each downstream device
sends the PME_Enter_L23 DLLP.
Downstream devices also send the PME_Enter_L1 DLLP to inform the
root complex or switch that a downstream device has made all preparations for turning off the internal Phase Lock Loop circuit and is prepared
to enter the L1 state. Downstream devices send this packet in response
to software programming the device to enter a lower power state. As
soon as the root complex or switch receives the PME_Enter_L1 Data Link
Layer Packet it responds back to the downstream component with the
PM_Request_Ack DLLP, which acknowledges the devices preparation for
the L1 state. At that point the link is fully transitioned to the L1 Link state.
237
238
Chapter
12
PCI Express
Implementation
his chapter touches on some of the basics for PCI Express implementation. It begins with some examples of chipset partitioning, explaining how PCI Express could be used in desktop, mobile, or server environments. The rest of the chapter identifies some of the ways that PCI
Express lives within, or can expand, todays computer systems. This focuses on example connectors and add-in cards, revolutionary form factors, and system level implementation details such as routing constraints.
Chipset Partitioning
PCI Express provides a great amount of flexibility in the ways that it can
be used within a system. Rather than try to explain all the various ways
that this architecture could be used, this section focuses on how the
chipset may implement a PCI Express topology. Generically speaking, the
chipset is the way that the CPU talks to the rest of the components
within a system. It connects the CPU with memory, graphics, I/O components, and storage. As discussed in Chapter 5, a common chipset divi239
240
sion is to have a (G)MCH and an ICH. The GMCH (Graphics & Memory
Controller Hub) connects the CPU to system memory, graphics (optionally), and to the ICH. The ICH (I/O Controller Hub), then branches out to
communicate with generic I/O devices, storage, and so on.
How exactly could a chipset like this make use of PCI Express? First,
recall the generic PCI Express topology discussed in Chapter 5 and
shown in Figure 12.1.
CPU
PCI Express
Endpoint
PCI Express to
PCI Bridge
PCI Express
PCI Express
Memory
Root
Complex
PCI Express
PCI/PCI_X
PCI Express
PCI Express
Switch
PCI Express
Legacy
Endpoint
Figure 12.1
Legacy
Endpoint
PCI Express
PCI Express
Endpoint
PCI Express
Endpoint
241
examples are just that, examples, and actual PCI Express designs may or
may not be implemented as such.
Desktop Partitioning
Desktop chipsets generally follow the (G)MCH and ICH divisions discussed above. An example PCI Express topology in the desktop space is
shown in Figure 12.2.
CPU
Graphics
PCI Express
Memory
(G)MCH
PCI Express
GbE
PCI Express
USB2.0
PCI
HDD
Add-ins
Add-ins
SATA
ICH
PCI Express
MB Down
Device
PCI Express
PCI Express
SIO
SIO
Figure 12.2
In the above hypothetical example, the GMCH acts as the root complexinteracting with the CPU and system memory, and fanning out to
three separate hierarchy domains. One goes to the graphics device or
connector, the second domain goes to the GbE (Gigabit Ethernet) LAN
device or connector and the third domain goes to the ICH (domain identifying numbers are arbitrary). The connection to the ICH may occur via a
direct connection on a motherboard or through several connectors or
cables if the GMCH and ICH reside on separate boards or modules (more
on this later in the chapter). Recall that this is a theoretical example only,
242
and actual PCI Express products may or may not follow the topology
breakdowns described here.
In this example, the chipset designers may have identified graphics
and Gigabit Ethernet as high priority devices. By providing them with
separate PCI Express domains off of the root complex, it may facilitate
flow control load balancing throughout the system. Thanks to the traffic
classes and virtual channels defined by the specification, it would be possible to place all these devices on a single domain and prioritize traffic via
those specified means. However, if both graphics and Gigabit Ethernet
require large amounts of bandwidth, they may compete with each other
and other applications for the available flow control credits and physical
link transmission time. Separating these devices onto separate domains
may facilitate bandwidth tuning on all domains.
Naturally, the downside to this possibility is that the GMCH/root
complex is required to be slightly larger and more complex. Supporting
multiple domains requires the GMCH to implement some arbitration
mechanisms to efficiently handle traffic flow between all three PCI Express domains, the CPU and main memory interfaces. Additionally, the
GMCH needs to physically support PCI Express logic, queues, TX and RX
buffers, and package pins for all three domains. For these reasons, it may
be just as likely that the Gigabit Ethernet connection is located off of the
ICH instead of the GMCH.
Since graphics tends to be a bandwidth-intensive application, the
GMCH may implement a x16 port for this connection. This allows for a
maximum of 16 250 megabytes per second = 4 gigabytes per second in
each direction. The graphics device may make use of this port via a direct
connection down on the motherboard or, more likely, through the use of
a x16 PCI Express connector (more on PCI Express connectors later in
this chapter). Through this connector, a graphics path is provided that is
very similar to todays AGP (Accelerated Graphics Port) environment, but
provides additional bandwidth and architectural capabilities.
Gigabit Ethernet bandwidth requirements are much less than those
for graphics, so the GMCH may only implement a x1 port for this connection. This allows for a maximum of 1 250 megabytes per second =
250 megabytes per second in each direction. The Gigabit Ethernet device
may make use of this port via a x1 connector or may be placed down on
the motherboard and tied to the GMCH directly.
The bandwidth requirements for the PCI Express connection between the GMCH and ICH depend mostly on the bandwidth requirements for the devices attached to the ICH. For this example, assume that
the bandwidth needs of the ICH can be met via a x4 PCI Express connec-
243
244
Mobile Partitioning
Mobile chipsets also tend to follow the (G)MCH and ICH divisions discussed above. An example PCI Express topology in the mobile space is
shown in Figure 12.3. Again, this is a hypothetical example only, and actual PCI Express products may or may not follow the topology breakdowns described here.
CPU
Graphics
PCI Express
Memory
(G)MCH
PCI Express
PCI Express
USB2.0
PCI
HDD
Add-ins
Add-ins
SATA
PCMCIA
ICH
PCI Express
MB Down
Device
PCI Express
PCI Express
SIO
SIO
Figure 12.3
Docking
245
This example looks remarkably similar to the desktop model just discussed. The GMCH still acts as the root complex, interacting with the
CPU and system memory, and fanning out to three separate hierarchy
domains. One still goes to the graphics device or connector and another
still goes to the ICH. The only noticeable difference between Figure 12.2
and Figure 12.3 is that the mobile platform has identified one of the
GMCH/root complexs domains for docking, whereas the desktop model
had identified it for Gigabit Ethernet. If the GMCH does not supply a x1
for Gigabit Ethernet (desktop) or docking (mobile), that functionality
would likely be located on the ICH.
Just like on the desktop model, mobile graphics tends to be a bandwidth intensive application so the GMCH may implement a x16 port for
this connection (with a maximum of 4 gigabytes per second in each direction). The graphics device may make use of this port via a mobilespecific x16 connector, or more likely, through a direct connection if it is
placed on the motherboard. Docking bandwidth requirements are much
less than those for graphics, so the GMCH may only implement a x1 port
for this connection (with a maximum of 250 megabytes per second in
each direction). PCI Express allows for a variety of docking options due
to its hot-plug and low-power capabilities.
As on the desktop model, the bandwidth requirements for the PCI
Express connection between the GMCH and ICH depend mostly on the
bandwidth requirements for the devices attached to the ICH. For this example, assume that the bandwidth needs of the ICH can still be met via a
x4 PCI Express connection (with a maximum of 1 gigabyte per second in
each direction). In order to prioritize and differentiate between the various types of traffic flowing between the GMCH and ICH, this interface
likely includes support for multiple traffic classes and virtual channels.
The ICH in this example is also almost identical to that in the desktop
model. It continues to act as a switch that fans out the third PCI Express
domain. The three (downstream) PCI Express ports shown on the ICH
are likely x1 ports. These provide high speed (250 megabytes per second
maximum each way) connections to generic I/O functions. In the example shown in Figure 12.3, one of those generic I/O functions is located
on the motherboard, while the other two ports are accessed via x1 connectors. The x1 connectors used for a mobile system are obviously not
going to be the same as those used in a desktop system. There will likely
be specifications that define mobile specific add-in cards, similar to miniPCI (from PCI SIG) or the PC Card (from PCMCIA) in existing systems.
The PCI Express specification provides a great amount of flexibility in the
types of supported connectors and daughter cards that it can support.
246
For example, the power and voltage requirements in a mobile system and
for a mobile x1 PCI Express connector likely need to meet much different standards than those used in a desktop environment. Since PCI Express is AC coupled, this allows a wide range of options for the common
mode voltages required by a PCI Express device.
This example demonstrates another of the benefits of PCI Express
functionality across multiple segment types. The GMCH and ICH used in
the desktop model could, in fact, be directly reused for this mobile
model. Even though the x1 port off the GMCH is intended as a Gigabit
Ethernet port for desktops, it could just as easily be a x1 docking port for
mobile systems. Since PCI Express accounts for cross-segment features
such as hot-plugging and reduced power capabilities, it can span a wide
variety of platforms.
Server Partitioning
Server chipsets generally follow the MCH and ICH divisions discussed
above, with the difference being that the MCH generally has more I/O
functionality than a desktop or mobile MCH. An example PCI Express topology in the server space is shown in Figure 12.4.
CPU
247
CPU
GbE
PCI Express
GbE
Add-ins
PCI Express
PCI Express
IBA
InfiniBand
Switched
Fabric
Figure 12.4
Memory
Chipset
I/O
Processor
PCI-X
Bridge
SATA
LPC
USB2
RAID
In the above example, the MCH acts as the root complex, interacting
with the CPU and system memory, and fanning out to multiple hierarchy
domains. In this example, the MCH has implemented three x8 interfaces,
but supports each as two separate x4 ports. In this example, the MCH is
running one of interfaces as a x8 port and is running the other two as x4
ports, providing a total of five PCI Express ports (four x4 ports and a x8
port). The full x8 port is connected to an Infiniband device. The second
x8 port splits into two x4 ports, with one x4 port connected to a PCI-X
bridge and the other x4 port connected to an I/O Processor (RAID: Redundant Array of Independent Disks controller). The third x8 port is also
split into two x4 ports, with one x4 port going to a dual Gigabit Ethernet
part and the other x4 port going to a connector.
In this example, the dual Gigabit Ethernet, RAID controller, PCI-X
bridge, and generic add-in connector are each provided a x4 port (with a
maximum of 1 gigabyte per second in each direction). If a function, such
as the PCI-X bridge, requires more bandwidth, this platform is flexible
enough to accommodate that need. The system designer could provide
that function with a full x8 connection (with a maximum of 2 gigabytes
248
per second in each direction) if they were willing to sacrifice one of the
other x4 ports (that is, a generic add-in x4 port). The example shown
here has prioritized the Infiniband device by providing it with a full x8
port, rather than providing an additional x4 port.
This example further demonstrates the great flexibility that PCI Express offers. The chipset designers have simply provided three x8 PCI
Express interfaces, but have allowed a wide variety of implementation
options. Depending on the platforms needs, those x8 interfaces could be
configured as identified here or in a much different manner. If a system
does not need to provide PCI Express or PCI-X connectors, this same
chip could be used to provide three full x8 interfaces to RAID, Gigabit
Ethernet, and Infiniband. Nor do the chip designers need to identify
ahead of time if the port is used on the motherboard, through a single
connector on the main board, or through a riser connector in addition to
the card connector. PCI Express inherently allows for all of those options. In the above example, any one of the identified functions could be
located directly down on the main board, through a connector on the
main board, up on a riser, or through a connector located on a riser.
One important item to note at this point is that PCI Express does not
require larger interfaces to be able to be divided and run as multiple
smaller ports. The chipset designers in this example could have simply
implemented three x8 ports and not supported the bifurcation into multiple x4 ports. Each PCI Express port must be able to downshift and
run as a x1 port, but that does not mean that a x8 port needs to run as 8
separate x1 ports. Implementing multiple port options as discussed here
is an option left to the chip designers.
249
Form Factors
PCI Express can be used in a variety of form factors and can leverage existing infrastructure. Motherboards, connectors, and cards can be designed to incorporate existing form factors such as ATX/ATX in the
desktop space, or rack mount chassis in the server space. This is shown
in Figure 12.5.
CNR Connector
Shared Connector
PCI Connector
Figure 12.5
In the example shown in Figure 12.5, the ATX motherboard has incorporated five connectors using a total of four expansion slots in the
chassis. This design incorporates two PCI slots, one of which shares an
expansion slot with the CNR (Communication and Networking Riser)
connector. In addition to these three connectors, there is also a x1 PCI
Express connector along with a x16 PCI Express connector. The PCI Express connectors are offset (from the back edge of the chassis) by a different amount than CNR, PCI or AGP connectors. Additionally, PCI
Express connectors and cards are keyed differently than other standards.
Neither of these modifications inhibits PCI Express from properly meeting ATX/ ATX expansion slot specifications. Rather, these modifications
are needed to prevent improper insertion of non-PCI Express cards into
PCI Express connectors, and vice versa.
Similarly, PCI Express can meet existing form factor requirements in
both the mobile and server space. The electrical specifications for the in-
250
Modular Designs
Because of PCI Express flexibility, it is not necessarily contained to existing form factors. It can be used to help expand new concepts in form
factors, and help in evolutionary and revolutionary system designs. For
example, PCI Express can facilitate the use of modular or split-system designs. The system core can be separated from peripherals and add-in
cards, and be connected through a PCI Express link. For the desktop
chipset shown in Figure 12.2, there is no reason that the GMCH and ICH
need to be located on the same motherboard. A system designer could
decide to separate the ICH into a separate module, then connect that
module back to the GMCHs module via a PCI Express connection. Naturally, PCI Express electrical and timing requirements would still need to
be met, and the connectors and/or cables needed for such a design
would need extensive simulation and validation. Example modular designs are shown in Figure 12.6.
Figure 12.6
251
Connectors
In order to fit into existing form factors and chassis infrastructure, PCI
Express connectors need to be designed to meet the needs of todays system environment. Since PCI Express is highly scalable, however, it also
needs to have connectors flexible enough to meet the variety of functions that PCI Express can be used for. As such, PCI Express does not inherently require a single connector. Rather, connector standards are
likely to emerge that define connectors and cards for a variety of different
needs. There is already work being done on generic add-in cards for
desktop, mini-PCI and PC Card replacements for communications and
mobile, and modules for server systems. Generic add-in cards are likely to
use the connector family shown in Figure 12.7.
252
x1
x4
x8
x16
Figure 12.7
These connectors are simple through-hole designs that fit within the
existing ATX/ATX form factor. The scalable design allows for connectors from x1 up to x16. The cards associated with these connectors use
the existing PCI I/O bracket and follow PCI card form factor requirements for height (standard versus low profile) and length (half versus
full). The connectors are designed in a modular manner such that each
successively larger connector acts like the superset connector for its
smaller brethren. For example, the x8 connector has all the same connections (in the same places) as the x4 connector, but then adds the four
additional lanes to the end of the connector. This unique design allows
PCI Express connectors to support multiple card sizes. For example, a
x8 connector can support x1, x4 as well as x8 cards. This flexibility is
shown in Table 12.1.
Table 12.1
Slot
x4
x8
x16
x1
Yes
Yes
Yes
Yes
x4
No
Yes
Yes
Yes
Card
x8
No
No
Yes
Yes
x16
No
No
No
Yes
253
A system that implements a x16 connector can support all four card
sizes, but this does not necessarily mean that the interface will run at all
four port widths. If the motherboard uses a x16 connector, the chip attached to that connector is likely to support a port width of x16 (since it
does not make much sense to use a connector larger than the port attached to it). Also following the specification, that port needs to be able
to downshift and run as a x1 port. Whether that port can also run as a x4
port and/or a x8 port is dependent on the implementation details of that
chip.
The ability to support multiple connector sizes, as well as card sizes
within each connector, poses some interesting problems for shock and
vibration. In addition to the connector and card/module standards that
are to emerge for PCI Express, there is a need for new retention mechanisms for those cards and connectors. The retention mechanisms currently in use (for example, with AGP) are not necessarily well suited for
the shock and vibration issues that face PCI Express.
As mentioned in previous chapters, these connectors are very similar,
in terms of materials and manufacturing methods, to those used for conventional PCI. By using the same contact style and through-hole design,
the manufacturing costs are less than they would be for a completely
new connector design. Additionally, the same processes for securing
connectors to the printed circuit board can be reused.
Since PCI Express connectors vary in length (in relationship to the
maximum supported link width), connector and card costs are also likely
to vary. For generic add-in support, akin to the multiple PCI connectors
found in existing desktop systems, system designers are likely to use a x1
connector (providing a maximum of 250 megabytes per second in each
direction, or 500 megabytes per second of total bandwidth). Not only
does this provide increased bandwidth capabilities (PCI provides a theoretical maximum of 132 megabytes per second in total bandwidth), but it
uses a smaller connector as well. The smaller x1 PCI Express connector
should help motherboard designs by freeing up additional real estate for
component placement and routing. Since PCI Express requires a smaller
connector than PCI, there are also some potential material savings from a
manufacturing standpoint. Figure 12.8 shows the comparative size of a
x8 PCI Express connector.
254
Figure 12.8
Presence Detection
The PCI Express connectors shown here provide support for presence
detection. Specific presence detection pins, located throughout the connector, allow the motherboard to determine if and when a card is inserted or removed. This allows the motherboard to react properly to
these types of events. For example, a motherboard may gate power delivery to the connector until it is sure that the card is fully plugged in. Alternatively, the presence detect functionality may be used to log an error
event if a card is unexpectedly removed.
Routing Implications
Stackup
Advances in computer-based electronics, especially cutting edge advances, often require the advancement of printed circuit board (PCB)
manufacturing capabilities. This is usually needed to accommodate new
requirements for electrical characteristics and tolerances.
The printed circuit board industry uses a variety of glass laminates to
manufacture PCBs for various industries. Each laminate exhibits different
electrical characteristics and properties. The most common glass laminate used in the computer industry is FR4. This glass laminate is preferred because it has good electrical characteristics and can be used in a
wide variety of manufacturing processes. Processes that use FR4 have
relatively uniform control of trace impedance, which allows the material
255
to be used in systems that support high speed signaling. PCI Express does
not require system designers to use specialized glass laminates for
printed circuit boards. PCI Express can be implemented on FR4-based
PCBs.
The majority of desktop motherboard and add-in card designs are
based on a four-layer stackup to save money on system fabrication costs.
A traditional four-layer stackup consists of a signal layer, a power layer, a
ground layer, and another signal layer (see Figure 12.9). There is significant cost associated with adding additional signal layers (in multiples of
two to maintain symmetry), which is usually highly undesirable from a
desktop standpoint. Due to increased routing and component density,
mobile and server systems typically require stackups with additional layers. In these designs, there are often signal layers in the interior portion
of the board to alleviate much of the congestion on the outer signal layers. Signal routing on internal layers (referred to as stripline) have different electrical characteristics than those routed on external layers
(referred to as micro-strip). PCI Express electrical requirements are specified in order to accommodate either type of routing.
Signal
Power
Glass
Laminate
Dielectric
Ground
Signal
Four-layer stackups are used primarily in the desktop computer market,
which equates to approximately 70 percent of overall computer sales.
Figure 12.9
Routing Requirements
As discussed in Chapter 8, PCI Express uses differential signaling. This
requires that motherboards and cards use differential routing techniques.
Routing should target 100 ohms differential impedance. The PCB stackup
(micro-strip versus stripline, dielectric thickness, and so on) impacts
what trace thickness and spacing meets that target. For micro-strip rout-
256
ing on a typical desktop stackup, 5-mil wide traces with 7-mil spacing to
a differential partner and 20-mil spacing to other signals (5-7-20), meets
the 100 ohm differential target.
From a length-matching perspective, PCI Express offers some nice
advances over parallel busses such as conventional PCI. In many instances designers have to weave or snake traces across the platform in
order to meet the length-matching requirement between the clock and
data signals for a parallel bus. This is needed to ensure that all the data
and clocks arrive at the receiver at the same time. The length-matching
requirements of parallel busses, especially as bus speeds increase, come
at a high cost to system designers. The snaking required to meet those
requirements leads to extra design time as well as platform real estate, as
shown on the left side of Figure 12.10. Since each PCI Express lane uses
8-bit/10-bit encoding with an embedded clock (refer to Chapter 8), the
lanes length-matching requirements are greatly relaxed. A PCI Express
link can be routed without much consideration for length matching the
individual lanes within the link. This is shown on the right side of Figure
12.10.
Left side shows a parallel bus routing example where the traces are snaked to length-match
them to the clock in order to guarantee data and clock arrive simultaneously.
Right side shows a PCI Express routing solution. Note that the freedom from
length matching frees up board space and simplifies the routing.
257
Polarity Inversion
PCI Express offers several other interesting items to facilitate the routing.
One example of this is the support for polarity inversion. PCI Express
devices can invert a signal after it has been received if its polarity has
been reversed. This occurs if the TX+ pin of one device is connected to
the RX- pin of its link-mate. As discussed in Chapter 8, polarity inversion
is determined during link initialization.
Polarity inversion may occur due to a routing error, or it may be deliberate to facilitate routing. For example, as shown in Figure 12.11, the
natural alignment between these two devices has the D+ of one device
aligned with the D- of its link-mate (naturally one D+ would be a TX
while the other would be an RX). In this scenario, the system designer
may want to purposely use polarity inversion to simplify the routing. Trying to force the D+ of one device to connect to the D+ of the other
would force a crisscross of the signals. That crisscross would require an
extra layer change and would force the routing to be non-differential for
a time. Polarity inversion helps to simplify the routing.
Logical Inversion
PCI
Express
Device
1010....
0101.... DD+
1010....
DD+
PCI
Express
Device
Lane Reversal
Lane reversal is another technique that PCI Express offers to facilitate the
routing. Lane reversal allows a port to essentially reverse the ordering of
its lanes. For instance, if a port is a x2 port, lane 0 may be at the top of
the device with lane 1 at the bottom, as shown in Figure 12.12. If a device supports lane reversal, it can reverse its lane ordering and have lane
1 act like lane 0 and lane 0 act like lane 1.
258
Device A
Device B
RX0
R0+
R0-
T1+
T1-
TX1
TX0
T0+
T0-
R1+
R1-
RX1
RX1
R1+
R1-
T0+
T0-
TX0
TX1
T1+
T1-
Lane 0
Lane 1
Lane 1
Lane 0
R0+
R0-
RX0
Why would this be useful? As with polarity inversion, the natural alignment between devices may line up such that lane 0 of Device A does not
always line up with lane 0 of Device B. Rather than force the connection
of lane 0 to lane 0, forcing a complete crisscross of the interface (referred
to as a bowtie), lane reversal allows for an easier and more natural routing. This is shown in Figure 12.13.
Device A
Device B
RX0
R0+
R0-
T1+
T1-
TX1
TX0
T0+
T0-
R1+
R1-
RX1
RX1
R1+
R1-
T0+
T0-
TX0
TX1
T1+
T1-
Lane 1
Lane 0
259
Lane 1
Lane 0
R0+
R0-
RX0
AC Coupling
PCI Express signals are AC coupled to eliminate the DC Common Mode
element. By removing the DC Common Mode element, the buffer design
process for PCI Express becomes much simpler. Each PCI Express device
can also have a unique DC Common Mode voltage element, eliminating
the need to have all PCI Express devices and buffers share a common
voltage.
This impacts the system in several ways. First, it requires AC coupling
capacitors on all PCI Express traces to remove the common mode voltage
element. As can be seen near the connecter in Figure 12.10, each PCI
Express signal has a discrete series AC capacitor on it (capacitor packs
260
13
Chapter
PCI Express
Timetable
his chapter looks more closely at the timeline for products based on
PCI Express to enter the market. Several factors play a role in the introduction of applications. This chapter looks at the factors that affect
adoption and discusses the profiles, benefits, challenges, and tools for
early adopters as well as late adopters of the technology.
Anticipated Schedule
The applications that can take advantage of the benefits of PCI Express
will be the first to enter the market. Graphics, Gigabit Ethernet, IEEE
1394, and high-speed chip interconnects are a few examples of the types
of applications to adopt PCI Express.
262
Sufficient Resources
A few companies have sufficient resources to develop the necessary intellectual property and building blocks for a PCI Express interface internally. Larger companies can afford to absorb the development costs and
time. For example, Intel is a large company that can afford to develop
PCI Express building blocks to be used across multiple divisions and
markets. Intel plans to offer a wide range of products and support across
multiple market segments that use the PCI Express architecture. Through
263
Market Dynamics
Market dynamics also play a major role in the adoption of PCI Express
applications. Compare the differences between the graphics suppliers
and analog modem suppliers. In the case of the graphics market, the end
user and customers continually drive for greater performance. Due to the
market demand, graphics suppliers such as ATI and nVidia capture additional value through higher average selling prices and providing the latest
technology and performance over the existing technology. If a graphics
supplier can show demonstrable performance gains with PCI Express
over AGP8x, that supplier will likely capture more value and more revenue for their latest product as the older technology continues to experience price erosion. The immediate realization in revenue plays a major
role in recouping the development costs. The analog modem market
264
Sales
Profit
$$
Birth
Growth
Maturity
Decline
Time
Figure 13.1
When new products enter the market, it takes some time before sales
and profits ramp. As products near the growth stage, sales result in profits as the volumes start to become the largest contributing factor. After
achieving a peak volume, products eventually enter the decline stage
where the profits and sales decline. The challenge a business faces is determining where the industry is on the ideal curve.
Industry Enabling
Industry enabling and collaboration between companies and within SIGs
(Special Interest Groups) such as the PCI-SIG will have a significant impact on the adoption of new technologies. As with many new technology
introductions, the term bleeding edge is more commonly used than
265
leading edge to describe the environment. The problem initial implementations face is the lack of a standard test environment or multiple devices
in the market to test against for compliance. This manifests itself in two
ways, the misinterpretation of the specification resulting in fundamental
errors within the design, and the failure to optimize to vendor-specific
implementations.
The specifications goal is to adequately define a standard so that devices can interoperate yet allow enough range for specific vendors to optimize based on application requirements. In spite of every effort to make
the specification clear, implementers are often forced to make assumptions and tradeoffs during the implementation. The risk of developing
devices that do not operate grows as multiple vendors come to different
conclusions and create incompatibilities that cause errors within the design. Another risk initial implementations face is fundamental errors in interpreting the specification. Over time, the compliance tests and
verification programs within the SIG identify the common pitfalls and
typically publish frequently asked questions (FAQs) or clarifications to
the specification. The SIG reduces risk of mistakes in later designs.
Another risk is that vendors may optimize within the range of the
specification. If a supplier provides a portion of the solution (for instance, a graphics device) but did not make the same assumptions as the
host, the overall system will not necessarily be optimized. In this example, although the implementation may not be a specification violation,
the two parts may not operate very well with one another. One example
left open in the specification is maximum packet size. The PCI Express
specification allows for the maximum packet size in the range between
128 bytes and 4096 bytes supporting a wide variety of applications. If the
host decided to support 128 bytes, but the attach point such as a Gigabit
Ethernet controller or graphics device optimized for 4096 bytes, the performance of the overall system could be significantly less than optimal.
Some companies may opt for a wait and see strategy rather than continually modify their design to achieve compatibility as new devices enter the
market.
The PCI-SIG and technology leaders such as Intel play a key role in overcoming this obstacle through compliance workshops, co-validation, and developer forums. Intel as an example launched the Intel Developer Network
for PCI Express (www.intel.com/technology/pciexpress/index.htm) as a
means to provide multiple companies a means to work together to create
compatible implementations and a common understanding of PCI Express.
The Intel Developer Network for PCI Express is a website hosted by Intel
which provides tools, white papers, and a means to contact Intels experts
266
267
age. Companies claiming to lead the technological revolution cannot afford to be caught without a flagship PCI Express product.
Other business models are established on the role of enabling the industry. Intellectual property providers offer core logic designs to PCI Express Physical Layer transceiver (PHY) designs to companies that would
not otherwise decide to develop a PCI Express core internally. In this
business model, the intellectual property provider develops the necessary cores and sells to other companies in exchange for upfront payments or royalty payments. As PCI Express proliferates and competition
increases, the value of intellectual property offering declines with time.
Intellectual property providers will therefore rush to be first to market.
Another example of an industry focused on solving early adoption
problems are the suppliers of such tools as oscilloscopes, logic analyzers,
and bus analyzers. Both Agilent and Tektronix for example have announced tools and capabilities to help the industry validate PCI Express
implementations. The Tektronix AWG710, Arbitrary Waveform Generator shown in Figure 13.2 generates PCI Express waveforms to enable
component suppliers to test, validate, and debug initial PCI Express implementations.
Figure 13.2
268
In addition to assumptions, often initial specifications do not adequately cover real world trade-offs required to implement the design. A
hypothetical example would be that a designer might spend several iterations to achieve the required timing budget in the specification. If there
is enough lobbying to the SIG body that authored the specification, the
timing specification could potentially be modified to match real world
implementations. For the early adopter, this equates to additional resources, time, money, and effort to achieve a stringent specification.
Later adopters can take advantage of the practical knowledge gained
from the experience of earlier attempts and any potential modifications
or clarifications made to the specification. In the case of errata, or identified errors in the specification, the early adopter must navigate through
the issues and modify the design as the errata are published. This is an
unavoidable challenge for the early adopter and has occurred on specifications in the past from multiple standards bodies.
The potential lack of tools and industry infrastructure is another challenge early adopters must overcome. In the early stages, PCI Express is
not expected to have a published standard test suite for a manufacturer
to use to test their product. For example, initial PCI Express validation efforts with waveform generators and oscilloscopes will be a manual process as the tool vendors ramp their capabilities. Over time, tool vendors
will expand their product offerings to provide standard automated test
suites for compliance testing.
269
large volumes. For example, if a component supplier elects to wait instead of developing an early PCI Express device, that component supplier
may be faced with a competitor who is further along in development.
The competitor may have gained practical knowledge from the early
stages of development towards cost optimization. As the market enters
heavy competition, the supplier who took advantage of this practical
knowledge would potentially be able to gain market share and profit over
the supplier late to production who is still working through bugs and defects. The learning curve also applies to market understanding in addition
to design optimization. Another scenario would be that the early adopter
has a better understanding of how to make the best trade-offs. For example, the market may place a premium of performance over power consumption. Through initial implementations and direct customer
feedback, the early adopter would be armed with the information to optimize for performance over power. Direct customer engagement and
experience can make the difference in several markets. Through bringing
initial products to market, companies develop a good understanding of
the market and can ensure follow on product success.
Technology transitions are a unique opportunity when the barrier to
entry for a market is lower than normal and provide a benefit to emerging companies. A company may elect to enter a market as an early
adopter. With a technology or time to market advantage, a company may
be able to win key designs if the competitors are slow to adopt. In a stable market without the technology disruption, entrenched players with a
strong customer base make it difficult to enter a particular market segment.
270
testing workshops and plugfests. Figure 13.3 indicates the types of programs the PCI-SIG provides.
OM15606
Figure 13.3
The intellectual property provider market also provides key tools for
developing products for emerging technologies. Companies exploring
new product plans perform cost/benefit analysis of making versus buying
for the PCI Express block. The benefit of purchasing the PCI Express
Physical Layer transceiver (PCI Express PHY) for example, is that purchasing enables the component manufacturer to expedite product development on the core of the device and not spend time and resources on
the PCI Express interface. Intellectual property providers typically provide tools for testing, debug, and designing with the core block. In this
example, the intellectual property providers are a tool early adopters can
use.
271
providers, tools, and industry deploy PCI Express in volume, there will
be an initial cost impact relative to PCI. The later adopter must overcome
the advances the early adopters have made on being further down the
learning cure and usually these companies have a fundamental difference
in their operating model to support a lower cost structure.
Applications that show little difference in migrating from PCI to PCI
Express will transition later, or potentially never. For example, 10/100
Ethernet Network Interface Cards (NICs) are in abundance on PCI today.
Migrating from PCI (133 megabytes per second) to PCI Express (250
megabytes per second per direction) for a maximum line connection of
100 megabits per second (12.5 megabytes per second) would give few
performance gains because the performance today on PCI is sufficient.
The LAN market is quickly adopting Gigabit Ethernet as the desired networking standard, and it is unlikely that 10/100 NIC cards will migrate to
PCI Express rapidly, if ever.
272
$$
Introduction
2 years
12 months
4 years
Time
Figure 13.4
Late adopters must decide when to enter the market. As most new
architectures take 12 to 18 months to develop, this means companies
must predict accurately or possibly miss the opportunity. For example in
scenario 1 of Figure 13.4, products should enter the market prior to 12
months after introduction prior to the peak volume opportunity. This allows the intellectual property providers and the industry to work
through some initial iterations while not burdening the late adopter with
additional costs. The challenge comes in that, if the company believes
the ramp will follow scenario 2, they will develop product plans and resources plans later than required. The end result is that the company enters the market later than anticipated and misses the peak profit
opportunity. As mentioned previously, late adopters also need to time
the market correctly to overcome the benefits of the learning curve and
brand awareness the early adopter has established.
273
274
14
Chapter
PCI Express
Product Definition
and Planning
his chapter looks more closely at the aspects behind planning and
defining PCI Express-based products from two different perspectives.
The first example represents the challenges and decisions a component
manufacturer must make in developing a PCI Express based device.
There are some unique aspects PCI Express presents to silicon device
manufacturers. The second example takes a look at the challenges and
decisions a motherboard manufacturer must make in using PCI Express
devices.
Market Assessment
As covered in Chapter 4, graphics is a unique application that has continuously evolved with faster and faster interfaces. Figure 14.1 shows the
bandwidth evolution as discussed in Chapter 4 (Note the bandwidth is
shown for one direction in Figure 14.1). PC graphics is an application
where suppliers can achieve higher prices on initial implementations to
recoup the development costs associated with new development. Refer
to Chapter 4 for more information, but rather than revisit the content,
the underlying assumption is that there is a compelling reason to migrate
to a higher bandwidth interface since it has proven to be the natural evolution since the early 1990s.
5000
4000
MB/s
276
AGP4x
2000
1000
0
1992
Figure 14.1
AGP8x
3000
PCI
1994
AGP
1996
1998
2000
2002
2004
277
pothetical scenarios and discusses the relevant topics in buy versus make
for:
Using an ASIC Manufacturing and Design Flow
Option
Figure 14.2
PROs
CONs
ASIC Flow
Vendor A
Foundry
Vendor B
Wide availability of IP
providers
Internal
Fabrication
Vendor C
IP in house
IP used across multiple
products
Manufacturing as a core
competency
278
still critical pieces of the design. Key factors to consider in this scenario
are costs, which are typically on a per-unit basis and include and up front
payment, and the long term strategic impact. Typically, ASIC flows do
not offer multiple sources of intellectual property from other suppliers.
ASIC flows become difficult for multiple product SKUs and multiple
product generations due to the generally limited flexibility. ASIC flows
usually encompass wafer and final component testing. The graphics vendor receives the finished product that is ready to ship.
The other category of silicon graphics suppliers who do not operate
fabrication facilities use a fabless semiconductor business model or foundry flow. Here the tradeoffs are still to buy versus make the necessary
building blocks, but the differences vary significantly. Vendor B for example operates a fabless business model and partners with foundries
such as TSMC and UMC. In this business model, the graphics vendor pays
for the mask set for the specific foundrys fab. The mask is used in the
production facility to create wafers that contain multiple chips. At the
end of the process flow, the vendor receives untested raw wafers. The
foundry typically does not provide intellectual property, but provides the
core libraries necessary to design the end product. Unlike the ASIC flow,
there is a wide availability of intellectual property building blocks from
multiple intellectual property suppliers.
However, the decision to make in this scenario is whether or not to
buy intellectual property from the intellectual property suppliers targeted
at a specific foundry or to develop the necessary building blocks internally. This decision boils down to time to market, cost, and strategic
relevance. For example, if the company can expedite product development by several months at the expense of a million dollars in fees to the
intellectual property provider, this may be a worthwhile tradeoff to gain
market segment share and several months of profits. Alternatively, if the
vendor determines the PCI Express core is critical to their future and
want to own the rights to the intellectual property outright, they may opt
to develop the cores internally. The graphics vendors typically partner
with one fabless semiconductor supplier to attempt to gain the largest
bargaining position on wafer costs. Unlike the ASIC flow, the foundry
model still requires the vendor to determine packaging and testing options for the end device.
Finally, Vendor C is a vertically integrated company and has an established manufacturing process capability (such as Intel and SiS). The decision of buy versus make is significantly altered if the vendor has
internal fabrication facilities. In this scenario, the process manufacturing
capability (ability to build chips) is likely to be protected. The vendor
279
AGPCLK
(66 MHz)
AGP 4X
AD_STB
AD_STB#
AD[31:0]
AGP 8X
AD_STBF
Figure 14.3
Da11
Da11
Da11
Da11
AD[31:0]
AGP8x Introduced
an increase in strobe
sampling rate
Da14
Da15
Da16
Da17
Da18
Da19
Da11
Da11
AD_STBS
Da10
Da11
Da12
Da13
280
281
from the previous parallel data transfer to the serial PCI Express technology.
AGPCLK
(66 MHz)
AGP 8X
PCI Express
(2.5 GHz)
..............................
AGPCLK
(66 MHz)
PCI Express
x1 Lane
Figure 14.4
PCI Express
(2.5 MHz)
Along with the new high-speed challenges are device testing tradeoffs. Graphics suppliers will want to ensure the product that leaves their
factory is of a high standard of quality such that OEMs and end-users will
not experience field failures. The objective of product test programs is to
ensure that the products leaving the factory are of a high enough standard of quality to reduce return and failure costs. Quality is typically
measured in the Defects per Million or DPM. To achieve these standards,
the industry has embraced two methodologies, at-speed testing and
structural testing. The objective of structural testing is to catch manufacturing defects. If the device has a damaged transistor, the functional test
should identify the failure by detecting that the transistor did not turn on.
Structural tests typically implement a scan chain, where a scan of 0s and
1s are inserted serially into the device and the tester captures a chain of
output values that gets compared with the expected outcome. In the
282
damaged transistor example where the transistor failed to turn on, the
tester detects an error in the output.
The objective of at-speed testing is to ensure that the device operates
as expected under real world conditions. In this scenario, the transistor
may have turned on as expected in the functional test, but the at-speed
testing ensures that the transistor turned on in the right amount of time.
The departure from the 66 megahertz clocking domain to the 2.5 gigahertz clocking capability will provide substantial initial challenges for atspeed testing. Although these speeds are not beyond the capabilities of
some of the testers currently on the market, the cost of developing an entire test floor with at-speed testers would be prohibitive. Vendors must
balance cost and risk in developing sufficient test coverage plans.
283
Optional
Header
Data
ECRC
@ Transaction Layer
Sequence
Number
Header
Data
ECRC
LCRC
ECRC
LCRC
Frame
Sequence
Number
Header
Data
Frame
@ Physical Layer
Figure 14.5
284
Efficiency Comparison
100.00%
95.00%
90.00%
85.00%
80.00%
Max_Payload_Size 128
256
512
1024 2048 4096
Data Packet Density 83.12% 90.78% 95.17% 97.52% 98.75% 99.37%
Figure 14.6
Implementing an architecture that is optimized for a 4096-byte payload size would improve the bus efficiency for large data transfers because it provides close to 100 percent data content versus the overhead
in a per packet basis. For a graphics device that needs to pull large
amounts of display data from main memory, the implementation that
uses the largest data payload whenever possible is the most efficient at
the PCI Express interconnect level. So choose 4096 bytes and be done,
right?
However, there are several other factors to take into consideration.
The complete system architecture needs to be comprehended. Figure
14.7 is the main graphics subsystem of the PC today. It comprises of the
processor, memory controller, main system memory, the graphics controller, and the local graphics memory.
285
CPU
Instruction
Core
Cache
CPU
Bus
Display
Graphics
Controller
PCI Express
Memory
Controller
Memory
(64bit)
DDR
MAin
Memory
Memory
(128bit)
DDR
Graphics
Memory
Figure 14.7
Chip
to
Chip
PC Graphics Subsystem
286
287
Price $
Volume
Figure 14.8
288
the connectors for both PCI and PCI Express clearly indicates that in the
long run, PCI Express will succeed in being the lower cost solution. The
PCI connector consumes 120 pins and is roughly 84 millimeters long.
The PCI Express x1 connector is much smaller at 25 millimeters long
with 36 pins. Refer to Figure 14.9 for comparison of the various connectors. Although the cost parity will likely be achieved over time, the initial
higher cost for the necessary components (silicon, connectors, and so
on) may delay adoption in the most price-sensitive markets.
PCI Express Connector Comparisons
x1
x4
x8
0
Performance
Point
20
x16
mm
60 80
40
1X
1X
25.00
4X
39.00
(1.535)
8X
56.00
(2.205)
16X
89.00
(3.504)
16X
84.84
(3.400)
PCI
128.02
73.87
(5.040)
(2.908)
PCI
PCI X
AGP
Figure 14.9
100 120
Connector Length:
mm
(inches)
(0.984)
4X
8X
PCI-X
PCI-X
AGP
289
290
signal deformation that are trivial for most PCI implementations. Unlike
previous technologies, vendors can no longer ignore the affects of Vias,
capacitive parasitics, and connectors.
Device Package
PCB Traces
Series Capacitor
VIA
Connector
Signal Layer
Prepreg
Power Layer
Core
Ground Layer
Prepreg
Signal Layer
Signal Route
Conclusion
Hopefully, this book has helped show that PCI Express is an exciting
new architecture that will help move both the computing and communications indus-tries forward through the next ten years. The technology is
flexible enough to span computing platforms from servers to laptops to
desktop PCs, and serve as the interconnect for Gigabit Ethernet, graphics, as well as numerous other generic I/O devices.
This flexibility is afforded by PCI Expresss layered architecture. The
three architectural layers offer increased error detection and handling
capabili-ties, flexible traffic prioritization and flow control policies, and
the modularity to scale into the future. Additionally, PCI Express provides
revolutionary capabilities for streaming media, hot plugging, and advanced power manage-ment.
PCI Express does this while maintaining compatibility with much of
the existing hardware and software infrastructure to enable a smooth
291
transition. At the same time, it offers exciting new opportunities to develop new form factors, cards and modules, and entirely new usage
models.
For all these reasons, PCI Express truly offers an inflection point that
will help computer and communication platforms evolve over the next
decade. As a result, those that understand and embrace this technology
now have the op-portunity to help steer the direction of those industries
through their evolution.
292
Appendix
PCI Express
Compliance &
Interoperability
Chapter 3: Applications
%65
Chapter 3: Applications
PCI-Express
Architecture
Specs
Spec
Defines
Design
Criteria
PASS
Checklists
Checklist
Defines
Design
Validation
Criteria
Test
Specs
Test Spec
Explains
Assertions,
Test Methods &
Pass/Fail
Criteria
FAIL
Test Tool
Measures
Pass/Fail
Criteria
Test Process
Reconciles
Products with
Spec
Chapter 3: Applications
The checklist documents are organized on the basis of classes of devices RC, Endpoint, Switches and Bridges etc. Inside each of these documents,
there will be applicable checklist items defined at each layer (physical,
link etc.) and at functional aspects (configuration, hot-plug, power
management etc.). The test specifications break the checklist items into
assertions as needed and specify t the topology requirements and he
algorithms to test those assertions. Test Descriptions are the basis for
guiding the process of either performing manual testing or developing
automated test procedures for verifying Compliance and Interoperability
of devices and functions.
Compliance Testing
It is expected that all major test equipment vendors will provide
compliance test tools applicable at various layers. In addition the PCI-SIG
is likely to identify a set of equipment for the compliance testing process.
A PCI-Express product vendor should check the websites listed below to
see what test equipment is relevant to their product and obtain them to
test on their premises before going to a plugfest. The following figures
show example topologies which may be used for testing at platform
(BIOS, RC) and add-in device (endpoints, switches, bridges) levels. The
actual test topologies employed in compliance testing may be different
from what is shown here but are expected to be functionally equivalent.
The Functional Compliance Test (FCT) card or an entity providing a
similar function is intended to test at the link and above layers (including
BIOS). A separate electrical tester card is intended to verify electrical
compliance at multiple link widths. In all cases it is expected that the
tests executed employing such cards will clearly map the test results to
one or a set of assertions. In addition test equipment like protocol
analyzers or oscilloscopes are expected to provide support for automatic
checking of the results as much as possible thus reducing human
intervention and any associated errors.
It is worth noting that
Chapter 3: Applications
Platform Components
This example topology is suited to test BIOSs ability to configure
PCI-Express and PCI devices properly, program resources for supporting
power management and hot-plug, a Root Complexs ability to handle
messages, legacy interrupts, error conditions on the root ports links etc.
For electrical testing the electrical tester card is inserted into an
appropriate slot (as the width the slot supports) and root ports
transmitter and receiver capabilities (like jitter, voltage levels) will be
measured via oscilloscopes and analysis software.
Compliance
Tests
FCT
Tester Card
Protocol
Analyzer
Chapter 3: Applications
Compliance
Tests
Add-in
Card (DUT)
FCT
Tester Card
Intel Architecture
Platform
Example Add-in Card Compliance Test Topology
.
Protocol
Analyzer
Chapter 3: Applications
Interoperability Testing
Compliance is a pre-requisite to interoperability testing. Once
compliance is established, it is necessary to verify that the device or
function in question works with other devices and functions (not just
PCI-Express based but others as well) in a system. A typical way to test
for this is to introduce the device into a well known operating
environment and run applications that will measure the electrical
characteristics of the link to which it is connected, power consumptions
in the various power management states it supports and functional
characteristics like interrupts etc. A xN device will be tested in its
natural xN mode and thus expose any issues with multi-link functioning
like skews, link reversals etc. The test is repeated by using the devices
and functions in as many platforms as applicable.
Plugfests
The PCI-SIG periodically arranges plugfests (multi-day events) where
multiple PCI-Express product vendors bring their products and
participate in a structured testing environment. While it is expected that
vendors would have tested their products to a good extent on their
premises, early plugfest events provide a great venue to test against other
implementations which otherwise may not be accessible. If necessary,
there are usually opportunities to test and debug informally outside the
established process. For these reasons it is recommended that vendors
should plan on sending a mix of developers and test engineers to these
events. The bottom line is vendors should take advantage of these events
to refine their products and gain time to market advantage.
Useful References
These links provide a starting point for finding the latest information on
the C&I Test Specifications and events, architecture specifications and
tools etc.
1.
2.
3.
4.
5.
www.agilent.com
www.catc.com
http://developer.intel.com/technology/pciexpress/devnet/
http://www.pcisig.com/home
http://www.tektronix.com/
Chapter 3: Applications