You are on page 1of 1057

MindShare books are critical in the understanding of complex technical topics, such as

PCI Express 3.0 architecture. Many of our customers and industry partners depend on
PCI Express Technology Comprehensive Guide to Generations 1.x, 2.x and 3.0 these books for the success of their projects Joe Mendolia - Vice President, LeCroy

LIVE COURSES: eLEARNING COURSES:

PCI
MindShare
EXPRESS Comprehensive PCI Express Comprehensive PCI Express
TRAINING Fundamentals of PCI Express Fundamentals of PCI Express Technology
AT www.mindshare.com
Intro to PCI Express Intro to PCI Express Series

For training, visit mindshare.com


PCI Express 3.0 is the latest generation of the Essential topics covered include:
popular peripheral interface found in virtually PCI Express Origins MindShare Technology Series

PCI Express Technology


Comprehensive Guide to Generations 1.x, 2.x and 3.0
PCI Express
Configuration Space and Access Methods
every PC, server, and industrial computer. Its high
Enumeration Process
bandwidth, low latency, and cost-to-performance Packet Types and Fields
ratio make it a natural choice for many peripheral Transaction Ordering
devices today. Each new generation of PCI Trac Classes, Virtual Channels and Arbitration (QoS)

Technology
Flow Control
Express adds more features, capabilities and
ACK/NAK Protocol
bandwidth, which maintains its popularity as a Logical PHY (8b/10b, 128b/130b, Scrambling)
device interconnect. Electrical PHY
Link Training and Initialization
Interrupt Delivery (Legacy, MSI, MSI-X)
MindShares books take the hard work out of Error Detection and Reporting Comprehensive Guide to Generations 1.x, 2.x and 3.0
deciphering the specs, and this one follows that Power Management (for both software and hardware)
2.0 and 2.1 Features (such as 5.0GT/s, TLP Hints ,
tradition. MindShare's PCI Express Technology and Multi-Casting)
book provides a thorough description of the 3.0 Features (such as 8.0GT/s, and a new encoding scheme)
interface with numerous practical examples that Considerations for High Speed Signaling
(such as Equalization) Mike Jackson, Ravi Budruk MindShare, Inc.
illustrate the concepts. Written in a tutorial style,
this book is ideal for anyone new to PCI Express.
At the same time, its thorough coverage of the
Mike Jackson is a Senior Sta Engineer with MindShare and
details makes it an essential resource for
has trained thousands of engineers around the world on the
seasoned veterans. workings of PCI Express. Mike has developed materials and
taught courses on such topics as PC Architecture, PCI, PCI-X,
and SAS. Mike brings several years of design experience to
MindShare, including both systems integration work and
development of several ASIC designs.

MindShare is a world-renowned training and publishing company that


sets a high standard of excellence in training and enables high-tech
companies to adopt, implement, and roll out cutting-edge technologies
quickly and confidently. We bring life to knowledge through a wide variety
of flexible learning methods and delivery options. MindShare now goes
beyond the classroom to deliver engaging interactive eLearning, both in a
virtual classroom and an online module format. Visit www.mindshare.com
to learn more about our enthusiastic and experienced instructors, courses,
eLearning, books and other training delivery options. Mike Jackson

Contact MindShare at training@mindshare.com or 1-800-633-1440


for training on PCI Express or any of our many other topics. $89.99 USA MindShare, Inc.
MindShare Press
PCIe 3.0.book Page i Sunday, September 2, 2012 11:25 AM

PCIExpress
Technology
ComprehensiveGuidetoGenerations1.x,2.x,3.0

MINDSHARE,INC.

MikeJackson
RaviBudruk
TechnicalEditbyJoeWinklesandDonAnderson
Book Ad.fm Page 0 Wednesday, August 29, 2012 5:37 PM

MindShare Live Training and Self-Paced Training


Intel Architecture Virtualization Technology
IntelIvyBridgeProcessor PCVirtualization
Intel64(x86)Architecture IOVirtualization
IntelQuickPathInterconnect(QPI)
ComputerArchitecture

AMD Architecture IO Buses


MDOpteronProcessor(Bulldozer) PCIExpress3.0
MD64Architecture USB3.0/2.0
xHCIforUSB

Firmware Technology Storage Technology


UEFIArchitecture SASArchitecture
BIOSEssentials SerialATAArchitecture
NVMeArchitecture

ARM Architecture Memory Technology


ARMArchitecture odernDRAMArchitecture

Graphics Architecture High Speed Design


GraphicsHardwareArchitecture HighSpeedDesign
EMI/EMC

Programming Surface-Mount Technology (SMT)


X86ArchitectureProgramming SMTManufacturing
X86AssemblyLanguageBasics SMTTesting
OpenCLProgramming

Areyourcompanystechnicaltrainingneedsbeingaddressedinthemosteffectivemanner?

MindSharehasover25yearsexperienceinconductingtechnicaltrainingoncuttingedgetechnologies.
Weunderstandthechallengescompanieshavewhensearchingforquality,effectivetrainingwhich
reducesthestudentstimeawayfromworkandprovidescosteffectivealternatives.MindShareoffers
manyflexiblesolutionstomeetthoseneeds.Ourcoursesaretaughtbyhighlyskilled,enthusiastic,
knowledgeableandexperiencedinstructors.Webringlifetoknowledgethroughawidevarietyoflearn
ingmethodsanddeliveryoptions.
MindShareoffersnumerouscoursesinaselfpacedtrainingformat(eLearning).Wevetakenour25+
yearsofexperienceinthetechnicaltrainingindustryandmadethatknowledgeavailabletoyouatthe
clickofamouse.

training@mindshare.com 18006331440 www.mindshare.com


ARBOR BY

The Ultimate Tool to View, Edit and Verify


Configuration Settings of a Computer

Decode Data from


Live Systems Feature List
Scan config space for all PCI-visible
functions in system
Run standard and custom rule checks
to find errors and non-optimal settings
Write to any config space location,
memory address or IO address
Apply Standard and
Custom Rule Checks View standard and non-standard
structures in a decoded format
Import raw scan data from other
tools (e.g. lspci) to view in Arbors
decoded format
Decode info included for standard
PCI, PCI-X and PCI Express structures
Decode info included for some
x86-based structures and device-
Directly Edit Config, specific registers
Memory and IO Space
Create decode files for structures in
config space, memory address space
and IO space
Save system scans for viewing later
or on other systems
All decode files and saved system
Everything Driven from scans are XML-based and open-format
Open Format XML
COMING SOON
Decoded view of x86 structures
(MSRs, ACPI, Paging, Virtualization, etc.)
mindshare.com | 800.633.1440 | training @mindshare.com
ArborAdEnd.fm Page 1 Wednesday, August 29, 2012 8:52 PM

The Ultimate Tool to View,


Edit and Verify Configuration
Settings of a Computer
BY

MindShare Arbor is a computer system debug, validation, analysis and learning tool
that allows the user to read and write any memory, IO or configuration space address.
The data from these address spaces can be viewed in a clean and informative style as
well as checked for configuration errors and non-optimal settings.

View Reference Info


MindShare Arbor is an excellent reference tool to quickly look at standard PCI, PCI-X and PCIe
structures. All the register and field definitions are up-to-date with the PCI Express 3.0.
x86, ACPI and USB reference info will be coming soon as well.

Decoding Standard and Custom Structures from a Live System


MindShare Arbor can perform a scan of the system it is running on to record the config space from
all PCI-visible functions and show it in a clean and intuitive decoded format. In addition to scanning
PCI config space, MindShare Arbor can also be directed to read any memory address space and IO
address space and display the collected data in the same decoded fashion.

Run Rule Checks of Standard and Custom Structures


In addition to capturing and displaying headers and capability structures from PCI config space, Arbor
can also check the settings of each field for errors (e.g. violates the spec) and non-optimal values
(e.g. a PCIe link trained to something less than its max capability). MindShare Arbor has scores of
these checks built in and can be run on any system scan (live or saved). Any errors or warnings are
flagged and displayed for easy evaluation and debugging.
MindShare Arbor allows users to create their own rule checks to be applied to system scans. These
rule checks can be for any structure, or set of structures, in PCI config space, memory space or IO space.
The rule checks are written in JavaScript. (Python support coming soon.)

Write Capability
MindShare Arbor provides a very simple interface to directly edit a register in PCI config space, memory
address space or IO address space. This can be done in the decoded view so you see what the
meaning of each bit, or by simply writing a hex value to the target location.

Saving System Scans (XML)


After a system scan has been performed, MindShare Arbor allows saving of that system's scanned
data (PCI config space, memory space and IO space) all in a single file to be looked at later or sent to
a colleague. The scanned data in these Arbor system scan files (.ARBSYS files) are XML-based and
can be looked at with any text editor or web browser. Even scans performed with other tools can be
easily converted to the Arbor XML format and evaluated with MindShare Arbor.
PCIe 3.0.book Page i Wednesday, August 29, 2012 4:03 PM

PCIExpress
Technology
ComprehensiveGuidetoGenerations1.x,2.x,3.0

MINDSHARE,INC.

MikeJackson
RaviBudruk
TechnicalEditbyJoeWinklesandDonAnderson
PCIe 3.0.book Page ii Sunday, September 2, 2012 11:25 AM

Manyofthedesignationsusedbymanufacturersandsellerstodistinguishtheirprod
uctsareclaimedastrademarks.Wherethosedesignatorsappearinthisbook,and
MindSharewasawareofthetrademarkclaim,thedesignationshavebeenprintedinini
tialcapitallettersorallcapitalletters.

Theauthorsandpublishershavetakencareinpreparationofthisbook,butmakeno
expressedorimpliedwarrantyofanykindandassumenoresponsibilityforerrorsor
omissions.Noliabilityisassumedforincidentalorconsequentialdamagesinconnec
tionwithorarisingoutoftheuseoftheinformationorprogramscontainedherein.

LibraryofCongressCataloginginPublicationData

Jackson,MikeandBudruk,Ravi
PCIExpressTechnology/MindShare,Inc.,MikeJackson,RaviBudruk....[etal.]

Includesindex
ISBN:9780983646525(alk.paper)
1.ComputerArchitecture.2.0Microcomputersbuses.
I.Jackson,MikeII.MindShare,Inc.III.Title

LibraryofCongressNumber:2011921066
ISBN:9780983646525
Copyright2012byMindShare,Inc.

Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrieval
system,ortransmitted,inanyformorbyanymeans,electronic,mechanical,photocopy
ing,recording,orotherwise,withoutthepriorwrittenpermissionofthepublisher.
PrintedintheUnitedStatesofAmerica.

Editors:JoeWinklesandDonAnderson
ProjectManager:MaryanneDaves
CoverDesign:GreenhouseCreativeandMindShare,Inc.

Setin10pointPalatinoLinotypebyMindShare,Inc.
Textprintedonrecycledandacidfreepaper

FirstEdition,FirstPrinting,September,2012
PCIe 3.0.book Page iii Sunday, September 2, 2012 11:25 AM

Thisbookisdedicatedtomysons,JeremyandBryanIloveyouguys
deeply.Creatingabooktakesalongtimeandateameffort,butitsfinally
doneandnowyouholdtheresultsinyourhand.Itsapictureoftheway
lifeissometimes:investingoveralongtimewithyourteambeforeyou
see the result. You were a gift to us when you were born and weve
investedinyouformanyyears,alongwithanumberofpeoplewhohave
helpedus.Nowyouvebecomefineyoungmeninyourownrightandits
beenajoytobecomeyourfriendasgrownmen.Whatwillyouinvestin
thatwillbecomethebigachievementsinyourlives?Icanhardlywaitto
findout.
PCIe 3.0.book Page vi Sunday, September 2, 2012 11:25 AM
PCIe 3.0.book Page v Sunday, September 2, 2012 11:25 AM

Acknowledgments
Thankstothosewhomadesignificantcontributionstothisbook:

MaryanneDavesforbeingbookprojectmanagerandgettingthebooktopress
inatimelymanner.

Don Anderson for excellent work editing numerous chapters and doing a
completerewriteofChapter8onTransactionOrdering.

Joe Winkles forhis superbjob of technicaleditingand doinga complete re


writeofChapter4onAddressSpaceandTransactionRouting.

JayTroddenforhiscontributionindevelopingChapter4onAddressSpace
andTransactionRouting

SpecialthankstoLeCroyCorporation,Inc.forsupplying:
AppendixA:DebuggingPCIExpressTrafficusingLeCroyTools

SpecialthankstoPLXTechnologyforcontributingtwoappendices:
AppendixB:Markets&ApplicationsforPCIExpress
AppendixC:ImplementingIntelligentAdaptersandMultiHostSystems
WithPCIExpressTechnology

ThanksalsotothePCISIGforgivingpermissiontousesomeofthemechanical
drawingsfromthespecification.
Revision Updates:
1.0 - Initial eBook release
1.01 - Fixed Revision ID field in Figures 1-12, 1-13, 4-2, 4-4, 4-5, 4-6, 4-8, 4-9, 4-10, 4-17, 4-20, 4-21
PCIe 3.0.book Page vii Sunday, September 2, 2012 11:25 AM

Contents

About This Book


The MindShare Technology Series ........................................................................................ 1
Cautionary Note ......................................................................................................................... 2
Intended Audience .................................................................................................................... 2
Prerequisite Knowledge ........................................................................................................... 2
Book Topics and Organization................................................................................................ 3
Documentation Conventions ................................................................................................... 3
PCI Express ....................................................................................................................... 3
Hexadecimal Notation ........................................................................................................ 4
Binary Notation .................................................................................................................... 4
Decimal Notation ................................................................................................................. 4
Bits, Bytes and Transfers Notation .................................................................................... 4
Bit Fields ................................................................................................................................ 4
Active Signal States.............................................................................................................. 5
Visit Our Web Site ..................................................................................................................... 5
We Want Your Feedback........................................................................................................... 5

Part One: The Big Picture

Chapter 1: Background
Introduction................................................................................................................................. 9
PCI and PCI-X ........................................................................................................................... 10
PCI Basics .................................................................................................................................. 11
Basics of a PCI-Based System ........................................................................................... 11
PCI Bus Initiator and Target............................................................................................. 12
Typical PCI Bus Cycle ....................................................................................................... 13
Reflected-Wave Signaling................................................................................................. 16
PCI Bus Architecture Perspective ......................................................................................... 18
PCI Transaction Models.................................................................................................... 18
Programmed I/O ........................................................................................................ 18
Direct Memory Access (DMA).................................................................................. 19
Peer-to-Peer ................................................................................................................. 20
PCI Bus Arbitration ........................................................................................................... 20
PCI Inefficiencies................................................................................................................ 21
PCI Retry Protocol ...................................................................................................... 21
PCI Disconnect Protocol ............................................................................................ 22
PCI Interrupt Handling..................................................................................................... 23
PCI Error Handling............................................................................................................ 24
PCI Address Space Map.................................................................................................... 25
PCI Configuration Cycle Generation .............................................................................. 26

vii
PCIe 3.0.book Page viii Sunday, September 2, 2012 11:25 AM

Contents

PCI Function Configuration Register Space .................................................................. 27


Higher-bandwidth PCI ..................................................................................................... 29
Limitations of 66 MHz PCI bus ................................................................................ 30
Signal Timing Problems with the Parallel PCI Bus Model beyond 66 MHz...... 31
Introducing PCI-X .................................................................................................................... 31
PCI-X System Example...................................................................................................... 31
PCI-X Transactions ............................................................................................................ 32
PCI-X Features.................................................................................................................... 33
Split-Transaction Model............................................................................................. 33
Message Signaled Interrupts..................................................................................... 34
Transaction Attributes ............................................................................................... 35
No Snoop (NS): .................................................................................................... 35
Relaxed Ordering (RO): ...................................................................................... 35
Higher Bandwidth PCI-X.................................................................................................. 36
Problems with the Common Clock Approach of PCI and PCI-X 1.0
Parallel Bus Model .................................................................................................. 36
PCI-X 2.0 Source-Synchronous Model..................................................................... 37

Chapter 2: PCIe Architecture Overview


Introduction to PCI Express ................................................................................................... 39
Software Backward Compatibility .................................................................................. 41
Serial Transport .................................................................................................................. 41
The Need for Speed .................................................................................................... 41
Overcoming Problems ........................................................................................ 41
Bandwidth ............................................................................................................ 42
PCIe Bandwidth Calculation..................................................................................... 43
Differential Signals ..................................................................................................... 44
No Common Clock ..................................................................................................... 45
Packet-based Protocol ................................................................................................ 46
Links and Lanes ................................................................................................................. 46
Scalable Performance ................................................................................................. 46
Flexible Topology Options ........................................................................................ 47
Some Definitions ................................................................................................................ 47
Root Complex.............................................................................................................. 48
Switches and Bridges ................................................................................................. 48
Native PCIe Endpoints and Legacy PCIe Endpoints ............................................ 49
Software Compatibility Characteristics................................................................... 49
System Examples ........................................................................................................ 52
Introduction to Device Layers .............................................................................................. 54
Device Core / Software Layer ........................................................................................ 59
Transaction Layer............................................................................................................... 59
TLP (Transaction Layer Packet) Basics.................................................................... 60

viii
PCIe 3.0.book Page ix Sunday, September 2, 2012 11:25 AM

Contents

TLP Packet Assembly.......................................................................................... 62


TLP Packet Disassembly..................................................................................... 64
Non-Posted Transactions........................................................................................... 65
Ordinary Reads.................................................................................................... 65
Locked Reads ....................................................................................................... 66
IO and Configuration Writes ............................................................................. 68
Posted Writes............................................................................................................... 69
Memory Writes .................................................................................................... 69
Message Writes .................................................................................................... 70
Transaction Ordering ................................................................................................. 71
Data Link Layer.................................................................................................................. 72
DLLPs (Data Link Layer Packets) ............................................................................ 73
DLLP Assembly ................................................................................................... 73
DLLP Disassembly .............................................................................................. 73
Ack/Nak Protocol ...................................................................................................... 74
Flow Control................................................................................................................ 76
Power Management.................................................................................................... 76
Physical Layer..................................................................................................................... 76
General ......................................................................................................................... 76
Physical Layer - Logical ............................................................................................. 77
Link Training and Initialization ............................................................................... 78
Physical Layer - Electrical.......................................................................................... 78
Ordered Sets ................................................................................................................ 79
Protocol Review Example ....................................................................................................... 81
Memory Read Request............................................................................................... 81
Completion with Data................................................................................................ 83

Chapter 3: Configuration Overview


Definition of Bus, Device and Function .............................................................................. 85
PCIe Buses........................................................................................................................... 86
PCIe Devices ....................................................................................................................... 86
PCIe Functions.................................................................................................................... 86
Configuration Address Space ................................................................................................ 88
PCI-Compatible Space....................................................................................................... 88
Extended Configuration Space ........................................................................................ 89
Host-to-PCI Bridge Configuration Registers...................................................................... 90
General................................................................................................................................. 90
Only the Root Sends Configuration Requests ............................................................... 91
Generating Configuration Transactions.............................................................................. 91
Legacy PCI Mechanism..................................................................................................... 91
Configuration Address Port...................................................................................... 92
Bus Compare and Data Port Usage.......................................................................... 93

ix
PCIe 3.0.book Page x Sunday, September 2, 2012 11:25 AM

Contents

Single Host System ..................................................................................................... 94


Multi-Host System...................................................................................................... 96
Enhanced Configuration Access Mechanism ................................................................ 96
General ......................................................................................................................... 96
Some Rules................................................................................................................... 98
Configuration Requests .......................................................................................................... 99
Type 0 Configuration Request ......................................................................................... 99
Type 1 Configuration Request ....................................................................................... 100
Example PCI-Compatible Configuration Access ............................................................. 102
Example Enhanced Configuration Access......................................................................... 103
Enumeration - Discovering the Topology ......................................................................... 104
Discovering the Presence or Absence of a Function ................................................... 105
Device not Present .................................................................................................... 105
Device not Ready ...................................................................................................... 106
Determining if a Function is an Endpoint or Bridge .................................................. 108
Single Root Enumeration Example..................................................................................... 109
Multi-Root Enumeration Example...................................................................................... 114
General............................................................................................................................... 114
Multi-Root Enumeration Process................................................................................... 114
Hot-Plug Considerations ...................................................................................................... 116
MindShare Arbor: Debug/Validation/Analysis and Learning Software Tool........... 117
General............................................................................................................................... 117
MindShare Arbor Feature List ....................................................................................... 119

Chapter 4: Address Space & Transaction Routing


I Need An Address................................................................................................................. 121
Configuration Space ........................................................................................................ 122
Memory and IO Address Spaces ................................................................................... 122
General ....................................................................................................................... 122
Prefetchable vs. Non-prefetchable Memory Space .............................................. 123
Base Address Registers (BARs) ........................................................................................... 126
General............................................................................................................................... 126
BAR Example 1: 32-bit Memory Address Space Request .......................................... 128
BAR Example 2: 64-bit Memory Address Space Request .......................................... 130
BAR Example 3: IO Address Space Request ................................................................ 133
All BARs Must Be Evaluated Sequentially................................................................... 135
Resizable BARs................................................................................................................. 135
Base and Limit Registers ...................................................................................................... 136
General............................................................................................................................... 136
Prefetchable Range (P-MMIO) ....................................................................................... 137
Non-Prefetchable Range (NP-MMIO)........................................................................... 139
IO Range............................................................................................................................ 141

x
PCIe 3.0.book Page xi Sunday, September 2, 2012 11:25 AM

Contents

Unused Base and Limit Registers .................................................................................. 144


Sanity Check: Registers Used For Address Routing ....................................................... 144
TLP Routing Basics ................................................................................................................ 145
Receivers Check For Three Types of Traffic ................................................................ 147
Routing Elements............................................................................................................. 147
Three Methods of TLP Routing...................................................................................... 147
General ....................................................................................................................... 147
Purpose of Implicit Routing and Messages .......................................................... 148
Why Messages?.................................................................................................. 148
How Implicit Routing Helps............................................................................ 148
Split Transaction Protocol............................................................................................... 149
Posted versus Non-Posted.............................................................................................. 150
Header Fields Define Packet Format and Type........................................................... 151
General ....................................................................................................................... 151
Header Format/Type Field Encodings ................................................................. 153
TLP Header Overview .................................................................................................... 154
Applying Routing Mechanisms .......................................................................................... 155
ID Routing......................................................................................................................... 155
Bus Number, Device Number, Function Number Limits................................... 155
Key TLP Header Fields in ID Routing ................................................................... 155
Endpoints: One Check.............................................................................................. 156
Switches (Bridges): Two Checks Per Port ............................................................. 157
Address Routing .............................................................................................................. 158
Key TLP Header Fields in Address Routing ........................................................ 159
TLPs with 32-Bit Address................................................................................. 159
TLPs with 64-Bit Address................................................................................. 159
Endpoint Address Checking ................................................................................... 160
Switch Routing .......................................................................................................... 161
Downstream Traveling TLPs (Received on Primary Interface).................. 162
Upstream Traveling TLPs (Received on Secondary Interface) ................... 163
Multicast Capabilities............................................................................................... 163
Implicit Routing ............................................................................................................... 163
Only for Messages .................................................................................................... 163
Key TLP Header Fields in Implicit Routing ......................................................... 164
Message Type Field Summary................................................................................ 164
Endpoint Handling................................................................................................... 165
Switch Handling ....................................................................................................... 165
DLLPs and Ordered Sets Are Not Routed......................................................................... 166

xi
PCIe 3.0.book Page xii Sunday, September 2, 2012 11:25 AM

Contents

Part Two: Transaction Layer

Chapter 5: TLP Elements


Introduction to Packet-Based Protocol............................................................................... 169
General............................................................................................................................... 169
Motivation for a Packet-Based Protocol ....................................................................... 171
1. Packet Formats Are Well Defined ...................................................................... 171
2. Framing Symbols Define Packet Boundaries.................................................... 171
3. CRC Protects Entire Packet ................................................................................. 172
Transaction Layer Packet (TLP) Details............................................................................. 172
TLP Assembly And Disassembly .................................................................................. 172
TLP Structure.................................................................................................................... 174
Generic TLP Header Format .......................................................................................... 175
General ....................................................................................................................... 175
Generic Header Field Summary ............................................................................. 175
Generic Header Field Details ......................................................................................... 178
Header Type/Format Field Encodings ................................................................. 179
Digest / ECRC Field................................................................................................. 180
ECRC Generation and Checking ..................................................................... 180
Who Checks ECRC? .......................................................................................... 180
Using Byte Enables ................................................................................................... 181
General ................................................................................................................ 181
Byte Enable Rules .............................................................................................. 181
Byte Enable Example......................................................................................... 182
Transaction Descriptor Fields ................................................................................. 182
Transaction ID.................................................................................................... 183
Traffic Class ........................................................................................................ 183
Transaction Attributes ...................................................................................... 183
Additional Rules For TLPs With Data Payloads.................................................. 183
Specific TLP Formats: Request & Completion TLPs................................................... 184
IO Requests ................................................................................................................ 184
IO Request Header Format .............................................................................. 185
IO Request Header Fields................................................................................. 186
Memory Requests ..................................................................................................... 188
Memory Request Header Fields ...................................................................... 188
Memory Request Notes .................................................................................... 192
Configuration Requests ........................................................................................... 192
Definitions Of Configuration Request Header Fields .................................. 193
Configuration Request Notes .......................................................................... 196

xii
PCIe 3.0.book Page xiii Sunday, September 2, 2012 11:25 AM

Contents

Completions............................................................................................................... 196
Definitions Of Completion Header Fields ..................................................... 197
Summary of Completion Status Codes .......................................................... 200
Calculating The Lower Address Field............................................................ 200
Using The Byte Count Modified Bit................................................................ 201
Data Returned For Read Requests: ................................................................. 201
Receiver Completion Handling Rules: ........................................................... 202
Message Requests ..................................................................................................... 203
Message Request Header Fields...................................................................... 204
Message Notes: .................................................................................................. 206
INTx Interrupt Messages.................................................................................. 206
Power Management Messages ........................................................................ 208
Error Messages................................................................................................... 209
Locked Transaction Support............................................................................ 209
Set Slot Power Limit Message.......................................................................... 210
Vendor-Defined Message 0 and 1 ................................................................... 210
Ignored Messages .............................................................................................. 211
Latency Tolerance Reporting Message........................................................... 212
Optimized Buffer Flush and Fill Messages.................................................... 213

Chapter 6: Flow Control


Flow Control Concept ........................................................................................................... 215
Flow Control Buffers and Credits....................................................................................... 217
VC Flow Control Buffer Organization.......................................................................... 218
Flow Control Credits ....................................................................................................... 219
Initial Flow Control Advertisement ................................................................................... 219
Minimum and Maximum Flow Control Advertisement ........................................... 219
Infinite Credits.................................................................................................................. 221
Special Use for Infinite Credit Advertisements........................................................... 221
Flow Control Initialization................................................................................................... 222
General............................................................................................................................... 222
The FC Initialization Sequence....................................................................................... 223
FC_Init1 Details ................................................................................................................ 224
FC_Init2 Details ................................................................................................................ 225
Rate of FC_INIT1 and FC_INIT2 Transmission .......................................................... 226
Violations of the Flow Control Initialization Protocol ............................................... 227
Introduction to the Flow Control Mechanism.................................................................. 227
General............................................................................................................................... 227
The Flow Control Elements ............................................................................................ 227
Transmitter Elements ............................................................................................... 228
Receiver Elements..................................................................................................... 229

xiii
PCIe 3.0.book Page xiv Sunday, September 2, 2012 11:25 AM

Contents

Flow Control Example........................................................................................................... 230


Stage 1 Flow Control Following Initialization........................................................ 230
Stage 2 Flow Control Buffer Fills Up........................................................................ 233
Stage 3 Counters Roll Over........................................................................................ 234
Stage 4 FC Buffer Overflow Error Check ................................................................ 235
Flow Control Updates ........................................................................................................... 237
FC_Update DLLP Format and Content ........................................................................ 238
Flow Control Update Frequency ................................................................................... 239
Immediate Notification of Credits Allocated ....................................................... 239
Maximum Latency Between Update Flow Control DLLPs................................ 240
Calculating Update Frequency Based on Payload Size and Link Width ......... 240
Error Detection Timer A Pseudo Requirement ...................................................... 243

Chapter 7: Quality of Service


Motivation ............................................................................................................................... 245
Basic Elements ........................................................................................................................ 246
Traffic Class (TC).............................................................................................................. 247
Virtual Channels (VCs) ................................................................................................... 247
Assigning TCs to each VC TC/VC Mapping .................................................. 248
Determining the Number of VCs to be Used ....................................................... 249
Assigning VC Numbers (IDs) ................................................................................. 251
VC Arbitration ........................................................................................................................ 252
General............................................................................................................................... 252
Strict Priority VC Arbitration ......................................................................................... 253
Group Arbitration ............................................................................................................ 255
Hardware Fixed Arbitration Scheme..................................................................... 257
Weighted Round Robin Arbitration Scheme........................................................ 257
Setting up the Virtual Channel Arbitration Table ............................................... 258
Port Arbitration ...................................................................................................................... 261
General............................................................................................................................... 261
Port Arbitration Mechanisms......................................................................................... 264
Hardware-Fixed Arbitration ................................................................................... 265
Weighted Round Robin Arbitration ...................................................................... 265
Time-Based, Weighted Round Robin Arbitration (TBWRR).............................. 266
Loading the Port Arbitration Tables ............................................................................. 267
Switch Arbitration Example ........................................................................................... 269
Arbitration in Multi-Function Endpoints ......................................................................... 270
Isochronous Support ............................................................................................................. 272
Timing is Everything ....................................................................................................... 273
How Timing is Defined............................................................................................ 274
How Timing is Enforced.......................................................................................... 275

xiv
PCIe 3.0.book Page xv Sunday, September 2, 2012 11:25 AM

Contents

Software Support ............................................................................................................. 275


Device Drivers........................................................................................................... 276
Isochronous Broker................................................................................................... 276
Bringing it all together .................................................................................................... 276
Endpoints ................................................................................................................... 276
Switches...................................................................................................................... 278
Arbitration Issues .............................................................................................. 278
Timing Issues ..................................................................................................... 278
Bandwidth Allocation Problems ..................................................................... 280
Latency Issues .................................................................................................... 281
Root Complex............................................................................................................ 281
Problem: Snooping ............................................................................................ 281
Snooping Solutions............................................................................................ 282
Power Management.................................................................................................. 282
Error Handling .......................................................................................................... 282

Chapter 8: Transaction Ordering


Introduction............................................................................................................................. 285
Definitions............................................................................................................................... 286
Simplified Ordering Rules................................................................................................... 287
Ordering Rules and Traffic Classes (TCs) .................................................................... 287
Ordering Rules Based On Packet Type......................................................................... 288
The Simplified Ordering Rules Table ........................................................................... 288
Producer/Consumer Model .................................................................................................. 290
Producer/Consumer Sequence No Errors .............................................................. 291
Producer/Consumer Sequence Errors..................................................................... 295
Relaxed Ordering ................................................................................................................... 296
RO Effects on Memory Writes and Messages.............................................................. 297
RO Effects on Memory Read Transactions................................................................... 298
Weak Ordering ....................................................................................................................... 299
Transaction Ordering and Flow Control ...................................................................... 299
Transaction Stalls ............................................................................................................. 300
VC Buffers Offer an Advantage..................................................................................... 301
ID Based Ordering (IDO) ..................................................................................................... 301
The Solution ...................................................................................................................... 301
When to use IDO.............................................................................................................. 302
Software Control .............................................................................................................. 303
Deadlock Avoidance.............................................................................................................. 303

xv
PCIe 3.0.book Page xvi Sunday, September 2, 2012 11:25 AM

Contents

Part Three: Data Link Layer

Chapter 9: DLLP Elements


General ..................................................................................................................................... 307
DLLPs Are Local Traffic ....................................................................................................... 308
Receiver handling of DLLPs ................................................................................................ 309
Sending DLLPs ....................................................................................................................... 309
General............................................................................................................................... 309
DLLP Packet Size is Fixed at 8 Bytes............................................................................. 310
DLLP Packet Types ................................................................................................................ 311
Ack/Nak DLLP Format .................................................................................................. 312
Power Management DLLP Format ............................................................................... 313
Flow Control DLLP Format............................................................................................ 314
Vendor-Specific DLLP Format ....................................................................................... 316

Chapter 10: Ack/Nak Protocol


Goal: Reliable TLP Transport .............................................................................................. 317
Elements of the Ack/Nak Protocol...................................................................................... 320
Transmitter Elements ...................................................................................................... 320
NEXT_TRANSMIT_SEQ Counter.......................................................................... 321
LCRC Generator........................................................................................................ 321
Replay Buffer............................................................................................................. 321
REPLAY_TIMER Count........................................................................................... 323
REPLAY_NUM Count ............................................................................................. 323
ACKD_SEQ Register ................................................................................................ 323
DLLP CRC Check ..................................................................................................... 324
Receiver Elements ............................................................................................................ 324
LCRC Error Check .................................................................................................... 325
NEXT_RCV_SEQ Counter....................................................................................... 326
Sequence Number Check......................................................................................... 326
NAK_SCHEDULED Flag ........................................................................................ 327
AckNak_LATENCY_TIMER................................................................................... 328
Ack/Nak Generator ................................................................................................. 328
Ack/Nak Protocol Details ..................................................................................................... 329
Transmitter Protocol Details .......................................................................................... 329
Sequence Number..................................................................................................... 329
32-Bit LCRC ............................................................................................................... 329
Replay (Retry) Buffer................................................................................................ 330
General ................................................................................................................ 330
Replay Buffer Sizing.......................................................................................... 330

xvi
PCIe 3.0.book Page xvii Sunday, September 2, 2012 11:25 AM

Contents

Transmitters Response to an Ack DLLP .............................................................. 331


Ack/Nak Examples .................................................................................................. 331
Example 1............................................................................................................ 331
Example 2............................................................................................................ 332
Transmitters Response to a Nak............................................................................ 333
TLP Replay................................................................................................................. 333
Efficient TLP Replay................................................................................................. 334
Example of a Nak...................................................................................................... 334
Repeated Replay of TLPs......................................................................................... 335
General ................................................................................................................ 335
Replay Number Rollover.................................................................................. 336
Replay Timer ............................................................................................................. 336
REPLAY_TIMER Equation............................................................................... 337
REPLAY_TIMER Summary Table .................................................................. 338
Transmitter DLLP Handling ................................................................................... 340
Receiver Protocol Details ................................................................................................ 340
Physical Layer ........................................................................................................... 340
TLP LCRC Check ...................................................................................................... 341
Next Received TLPs Sequence Number............................................................... 341
Duplicate TLP..................................................................................................... 342
Out of Sequence TLP......................................................................................... 342
Receiver Schedules An Ack DLLP ......................................................................... 342
Receiver Schedules a Nak........................................................................................ 343
AckNak_LATENCY_TIMER.................................................................................. 343
AckNak_LATENCY_TIMER Equation .......................................................... 344
AckNak_LATENCY_TIMER Summary Table .............................................. 345
More Examples ....................................................................................................................... 345
Lost TLPs........................................................................................................................... 345
Bad Ack ............................................................................................................................. 347
Bad Nak ............................................................................................................................. 348
Error Situations Handled by Ack/Nak............................................................................... 349
Recommended Priority To Schedule Packets................................................................... 350
Timing Differences for Newer Spec Versions ................................................................. 350
Ack Transmission Latency (AckNak Latency) ............................................................ 351
2.5 GT/s Operation................................................................................................... 351
5.0 GT/s Operation................................................................................................... 352
8.0 GT/s Operation................................................................................................... 352
Replay Timer .................................................................................................................... 353
2.5 GT/s Operation................................................................................................... 353
5.0 GT/s Operation................................................................................................... 354
8.0 GT/s Operation................................................................................................... 354

xvii
PCIe 3.0.book Page xviii Sunday, September 2, 2012 11:25 AM

Contents

Switch Cut-Through Mode .................................................................................................. 354


Background....................................................................................................................... 355
A Latency Improvement Option.................................................................................... 355
Cut-Through Operation .................................................................................................. 356
Example of Cut-Through Operation ............................................................................. 356

Part Four: Physical Layer

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)


Physical Layer Overview ...................................................................................................... 362
Observation....................................................................................................................... 364
Transmit Logic Overview ............................................................................................... 364
Receive Logic Overview ................................................................................................. 366
Transmit Logic Details (Gen1 and Gen2 Only) ............................................................... 368
Tx Buffer ............................................................................................................................ 368
Mux and Control Logic ................................................................................................... 368
Byte Striping (for Wide Links) ....................................................................................... 371
Packet Format Rules ........................................................................................................ 373
General Rules ............................................................................................................ 373
Example: x1 Format.................................................................................................. 374
x4 Format Rules......................................................................................................... 374
Example x4 Format................................................................................................... 375
Large Link-Width Packet Format Rules ................................................................ 376
x8 Packet Format Example ...................................................................................... 376
Scrambler........................................................................................................................... 377
Scrambler Algorithm................................................................................................ 378
Some Scrambler implementation rules:................................................................. 379
Disabling Scrambling ............................................................................................... 379
8b/10b Encoding.............................................................................................................. 380
General ....................................................................................................................... 380
Motivation.................................................................................................................. 380
Properties of 10-bit Symbols ................................................................................... 381
Character Notation ................................................................................................... 382
Disparity..................................................................................................................... 383
Definition ............................................................................................................ 383
CRD (Current Running Disparity).................................................................. 383
Encoding Procedure ................................................................................................. 383
Example Transmission ............................................................................................. 385
Control Characters.................................................................................................... 386
Ordered sets............................................................................................................... 388
General ................................................................................................................ 388

xviii
PCIe 3.0.book Page xix Sunday, September 2, 2012 11:25 AM

Contents

TS1 and TS2 Ordered Set (TS1OS/TS2OS) .................................................... 388


Electrical Idle Ordered Set (EIOS)................................................................... 388
FTS Ordered Set (FTSOS) ................................................................................. 388
SKP Ordered Set (SOS) ..................................................................................... 389
Electrical Idle Exit Ordered Set (EIEOS) ........................................................ 389
Serializer............................................................................................................................ 389
Differential Driver............................................................................................................ 389
Transmit Clock (Tx Clock).............................................................................................. 390
Miscellaneous Transmit Topics...................................................................................... 390
Logical Idle ................................................................................................................ 390
Tx Signal Skew .......................................................................................................... 390
Clock Compensation ................................................................................................ 391
Background ........................................................................................................ 391
SKIP ordered set Insertion Rules..................................................................... 391
Receive Logic Details (Gen1 and Gen2 Only) .................................................................. 392
Differential Receiver ........................................................................................................ 393
Rx Clock Recovery ........................................................................................................... 394
General ....................................................................................................................... 394
Achieving Bit Lock ................................................................................................... 395
Losing Bit Lock.......................................................................................................... 395
Regaining Bit Lock.................................................................................................... 395
Deserializer ....................................................................................................................... 395
General ....................................................................................................................... 395
Achieving Symbol Lock ........................................................................................... 396
Receiver Clock Compensation Logic ............................................................................ 396
Background................................................................................................................ 396
Elastic Buffers Role .................................................................................................. 397
Lane-to-Lane Skew .......................................................................................................... 398
Flight Time Will Vary Between Lanes ................................................................... 398
Ordered sets Help De-Skewing .............................................................................. 398
Receiver Lane-to-Lane De-Skew Capability ......................................................... 398
De-Skew Opportunities ........................................................................................... 399
8b/10b Decoder................................................................................................................ 400
General ....................................................................................................................... 400
Disparity Calculator ................................................................................................. 400
Code Violation and Disparity Error Detection..................................................... 400
General ................................................................................................................ 400
Code Violations.................................................................................................. 400
Disparity Errors ................................................................................................. 400
Descrambler ...................................................................................................................... 402
Some Descrambler Implementation Rules:........................................................... 402
Disabling Descrambling........................................................................................... 402

xix
PCIe 3.0.book Page xx Sunday, September 2, 2012 11:25 AM

Contents

Byte Un-Striping............................................................................................................... 402


Filter and Packet Alignment Check............................................................................... 403
Receive Buffer (Rx Buffer) .............................................................................................. 403
Physical Layer Error Handling ............................................................................................ 404
General............................................................................................................................... 404
Response of Data Link Layer to Receiver Error .......................................................... 404
Active State Power Management ........................................................................................ 405
Link Training and Initialization ......................................................................................... 405

Chapter 12: Physical Layer - Logical (Gen3)


Introduction to Gen3 ............................................................................................................. 407
New Encoding Model...................................................................................................... 409
Sophisticated Signal Equalization ................................................................................. 410
Encoding for 8.0 GT/s ............................................................................................................ 410
Lane-Level Encoding....................................................................................................... 410
Block Alignment............................................................................................................... 411
Ordered Set Blocks........................................................................................................... 412
Data Stream and Data Blocks ......................................................................................... 413
Data Block Frame Construction ..................................................................................... 414
Framing Tokens ........................................................................................................ 415
Packets ........................................................................................................................ 415
Transmitter Framing Requirements....................................................................... 417
Receiver Framing Requirements ............................................................................ 419
Recovery from Framing Errors ............................................................................... 420
Gen3 Physical Layer Transmit Logic.................................................................................. 421
Multiplexer........................................................................................................................ 421
Byte Striping ..................................................................................................................... 423
Byte Striping x8 Example......................................................................................... 424
Nullified Packet x8 Example ................................................................................... 425
Ordered Set Example - SOS..................................................................................... 426
Transmitter SOS Rules ............................................................................................. 429
Receiver SOS Rules................................................................................................... 430
Scrambling ........................................................................................................................ 430
Number of LFSRs...................................................................................................... 430
First Option: Multiple LFSRs ........................................................................... 431
Second Option: Single LFSR ............................................................................ 432
Scrambling Rules ...................................................................................................... 433
Serializer............................................................................................................................ 434
Mux for Sync Header Bits ............................................................................................... 435
Gen3 Physical Layer Receive Logic .................................................................................... 435
Differential Receiver ........................................................................................................ 435
CDR (Clock and Data Recovery) Logic......................................................................... 437

xx
PCIe 3.0.book Page xxi Sunday, September 2, 2012 11:25 AM

Contents

Rx Clock Recovery.................................................................................................... 437


Deserializer ................................................................................................................ 438
Achieving Block Alignment .................................................................................... 438
Unaligned Phase ................................................................................................ 439
Aligned Phase .................................................................................................... 439
Locked Phase...................................................................................................... 439
Special Case: Loopback..................................................................................... 439
Block Type Detection................................................................................................ 439
Receiver Clock Compensation Logic ............................................................................ 440
Background................................................................................................................ 440
Elastic Buffers Role .................................................................................................. 440
Lane-to-Lane Skew .......................................................................................................... 442
Flight Time Variance Between Lanes..................................................................... 442
De-skew Opportunities............................................................................................ 442
Receiver Lane-to-Lane De-skew Capability.......................................................... 443
Descrambler ...................................................................................................................... 444
General ....................................................................................................................... 444
Disabling Descrambling........................................................................................... 444
Byte Un-Striping............................................................................................................... 445
Packet Filtering................................................................................................................. 446
Receive Buffer (Rx Buffer) .............................................................................................. 446
Notes Regarding Loopback with 128b/130b ..................................................................... 446

Chapter 13: Physical Layer - Electrical


Backward Compatibility....................................................................................................... 448
Component Interfaces ........................................................................................................... 449
Physical Layer Electrical Overview .................................................................................... 449
High Speed Signaling ........................................................................................................... 451
Clock Requirements .............................................................................................................. 452
General............................................................................................................................... 452
SSC (Spread Spectrum Clocking) .................................................................................. 453
Refclk Overview............................................................................................................... 455
2.5 GT/s...................................................................................................................... 455
5.0 GT/s...................................................................................................................... 455
Common Refclk ................................................................................................. 456
Data Clocked Rx Architecture ......................................................................... 456
Separate Refclks ................................................................................................. 457
8.0 GT/s...................................................................................................................... 457
Transmitter (Tx) Specs .......................................................................................................... 458
Measuring Tx Signals ...................................................................................................... 458
Tx Impedance Requirements.......................................................................................... 459
ESD and Short Circuit Requirements............................................................................ 459

xxi
PCIe 3.0.book Page xxii Sunday, September 2, 2012 11:25 AM

Contents

Receiver Detection ........................................................................................................... 460


General ....................................................................................................................... 460
Detecting Receiver Presence.................................................................................... 460
Transmitter Voltages ....................................................................................................... 462
DC Common Mode Voltage.................................................................................... 462
Full-Swing Differential Voltage.............................................................................. 462
Differential Notation ................................................................................................ 463
Reduced-Swing Differential Voltage ..................................................................... 464
Equalized Voltage..................................................................................................... 464
Voltage Margining.................................................................................................... 465
Receiver (Rx) Specs ................................................................................................................ 466
Receiver Impedance......................................................................................................... 466
Receiver DC Common Mode Voltage........................................................................... 466
Transmission Loss............................................................................................................ 468
AC Coupling..................................................................................................................... 468
Signal Compensation ............................................................................................................ 468
De-emphasis Associated with Gen1 and Gen2 PCIe .................................................. 468
The Problem............................................................................................................... 468
How Does De-Emphasis Help? .............................................................................. 469
Solution for 2.5 GT/s................................................................................................ 470
Solution for 5.0 GT/s................................................................................................ 472
Solution for 8.0 GT/s - Transmitter Equalization ....................................................... 474
Three-Tap Tx Equalizer Required .......................................................................... 475
Pre-shoot, De-emphasis, and Boost........................................................................ 476
Presets and Ratios ..................................................................................................... 478
Equalizer Coefficients .............................................................................................. 479
Coefficient Example .......................................................................................... 480
EIEOS Pattern..................................................................................................... 483
Reduced Swing .................................................................................................. 483
Beacon Signaling .............................................................................................................. 483
General ....................................................................................................................... 483
Properties of the Beacon Signal .............................................................................. 484
Eye Diagram ............................................................................................................................ 485
Jitter, Noise, and Signal Attenuation ............................................................................ 485
The Eye Test...................................................................................................................... 485
Normal Eye Diagram....................................................................................................... 486
Effects of Jitter................................................................................................................... 487
Transmitter Driver Characteristics ..................................................................................... 489
Receiver Characteristics ........................................................................................................ 492
Stressed-Eye Testing........................................................................................................ 492
2.5 and 5.0 GT/s........................................................................................................ 492
8.0 GT/s...................................................................................................................... 492

xxii
PCIe 3.0.book Page xxiii Sunday, September 2, 2012 11:25 AM

Contents

Receiver (Rx) Equalization ............................................................................................. 493


Continuous-Time Linear Equalization (CTLE) .................................................... 493
Decision Feedback Equalization (DFE) ................................................................. 495
Receiver Characteristics ........................................................................................................ 497
Link Power Management States.......................................................................................... 500

Chapter 14: Link Initialization & Training


Overview.................................................................................................................................. 506
Ordered Sets in Link Training ............................................................................................ 509
General............................................................................................................................... 509
TS1 and TS2 Ordered Sets............................................................................................... 510
Link Training and Status State Machine (LTSSM) ......................................................... 518
General............................................................................................................................... 518
Overview of LTSSM States ............................................................................................. 519
Introductions, Examples and State/Substates............................................................. 521
Detect State.............................................................................................................................. 522
Introduction ...................................................................................................................... 522
Detailed Detect Substate ................................................................................................. 523
Detect.Quiet ............................................................................................................... 523
Detect.Active ............................................................................................................. 524
Polling State ............................................................................................................................ 525
Introduction ...................................................................................................................... 525
Detailed Polling Substates .............................................................................................. 526
Polling.Active ............................................................................................................ 526
Polling.Configuration............................................................................................... 527
Polling.Compliance .................................................................................................. 529
Compliance Pattern for 8b/10b ....................................................................... 529
Compliance Pattern for 128b/130b ................................................................. 530
Modified Compliance Pattern for 8b/10b...................................................... 532
Modified Compliance Pattern for 128b/130b................................................ 533
Compliance Pattern ........................................................................................... 537
Modified Compliance Pattern ......................................................................... 537
Configuration State................................................................................................................ 539
Configuration State General...................................................................................... 540
Designing Devices with Links that can be Merged .................................................... 541
Configuration State Training Examples .................................................................. 542
Introduction ............................................................................................................... 542
Link Configuration Example 1................................................................................ 542
Link Number Negotiation................................................................................ 542
Lane Number Negotiation ............................................................................... 543
Confirming Link and Lane Numbers ............................................................. 544

xxiii
PCIe 3.0.book Page xxiv Sunday, September 2, 2012 11:25 AM

Contents

Link Configuration Example 2................................................................................ 545


Link Number Negotiation................................................................................ 546
Lane Number Negotiation ............................................................................... 547
Confirming Link and Lane Numbers ............................................................. 548
Link Configuration Example 3: Failed Lane ......................................................... 549
Link Number Negotiation................................................................................ 549
Lane Number Negotiation ............................................................................... 550
Confirming Link and Lane Numbers ............................................................. 551
Detailed Configuration Substates.................................................................................. 552
Configuration.Linkwidth.Start ............................................................................... 553
Downstream Lanes............................................................................................ 553
Crosslinks............................................................................................................ 554
Upconfiguring the Link Width ........................................................................ 554
Upstream Lanes ................................................................................................. 556
Crosslinks............................................................................................................ 556
Configuration.Linkwidth.Accept ........................................................................... 558
Configuration.Lanenum.Wait................................................................................. 559
Configuration.Lanenum.Accept ............................................................................. 560
Configuration.Complete .......................................................................................... 562
Configuration.Idle .................................................................................................... 566
L0 State ..................................................................................................................................... 568
Speed Change ................................................................................................................... 568
Link Width Change ......................................................................................................... 570
Link Partner Initiated ...................................................................................................... 570
Recovery State......................................................................................................................... 571
Reasons for Entering Recovery State ............................................................................ 572
Initiating the Recovery Process...................................................................................... 572
Detailed Recovery Substates .......................................................................................... 573
Speed Change Example................................................................................................... 576
Link Equalization Overview .......................................................................................... 577
Phase 0 ........................................................................................................................ 578
Phase 1 ........................................................................................................................ 581
Phase 2 ........................................................................................................................ 583
Phase 3 ........................................................................................................................ 586
Equalization Notes ................................................................................................... 586
Detailed Equalization Substates .................................................................................... 587
Recovery.Equalization ............................................................................................. 587
Phase 1 Downstream......................................................................................... 589
Phase 2 Downstream......................................................................................... 589
Phase 3 Downstream......................................................................................... 591
Phase 0 Upstream .............................................................................................. 592
Phase 1 Upstream .............................................................................................. 593

xxiv
PCIe 3.0.book Page xxv Sunday, September 2, 2012 11:25 AM

Contents

Phase 2 Upstream .............................................................................................. 593


Phase 3 Upstream .............................................................................................. 594
Recovery.Speed ......................................................................................................... 595
Recovery.RcvrCfg ..................................................................................................... 598
Recovery.Idle............................................................................................................. 601
L0s State............................................................................................................................. 603
L0s Transmitter State Machine ............................................................................... 603
Tx_L0s.Entry....................................................................................................... 604
Tx_L0s.Idle.......................................................................................................... 604
Tx_L0s.FTS.......................................................................................................... 604
L0s Receiver State Machine ..................................................................................... 605
Rx_L0s.Entry ...................................................................................................... 606
Rx_L0s.Idle ......................................................................................................... 606
Rx_L0s.FTS ......................................................................................................... 606
L1 State .............................................................................................................................. 607
L1.Entry ...................................................................................................................... 608
L1.Idle ......................................................................................................................... 609
L2 State .............................................................................................................................. 609
L2.Idle ......................................................................................................................... 611
L2.TransmitWake...................................................................................................... 612
Hot Reset State.................................................................................................................. 612
Disable State...................................................................................................................... 613
Loopback State ................................................................................................................. 613
Loopback.Entry ......................................................................................................... 614
Loopback.Active ....................................................................................................... 617
Loopback.Exit............................................................................................................ 618
Dynamic Bandwidth Changes............................................................................................. 618
Dynamic Link Speed Changes ....................................................................................... 619
Upstream Port Initiates Speed Change......................................................................... 622
Speed Change Example................................................................................................... 622
Software Control of Speed Changes.............................................................................. 627
Dynamic Link Width Changes....................................................................................... 629
Link Width Change Example ......................................................................................... 630
Related Configuration Registers......................................................................................... 638
Link Capabilities Register............................................................................................... 638
Max Link Speed [3:0]................................................................................................ 639
Maximum Link Width[9:4]...................................................................................... 640
Link Capabilities 2 Register............................................................................................ 640
Link Status Register ......................................................................................................... 641
Current Link Speed[3:0]:.......................................................................................... 641
Negotiated Link Width[9:4] .................................................................................... 641
Undefined[10]............................................................................................................ 642

xxv
PCIe 3.0.book Page xxvi Sunday, September 2, 2012 11:25 AM

Contents

Link Training[11] ...................................................................................................... 642


Link Control Register ...................................................................................................... 642
Link Disable............................................................................................................... 643
Retrain Link ............................................................................................................... 643
Extended Synch......................................................................................................... 643

Part Five: Additional System Topics

Chapter 15: Error Detection and Handling


Background ............................................................................................................................. 648
PCIe Error Definitions .......................................................................................................... 650
PCIe Error Reporting ............................................................................................................. 650
Baseline Error Reporting................................................................................................. 650
Advanced Error Reporting (AER) ................................................................................. 651
Error Classes............................................................................................................................ 651
Correctable Errors ............................................................................................................ 651
Uncorrectable Errors........................................................................................................ 652
Non-fatal Uncorrectable Errors .............................................................................. 652
Fatal Uncorrectable Errors....................................................................................... 652
PCIe Error Checking Mechanisms...................................................................................... 652
CRC .................................................................................................................................... 653
Error Checks by Layer..................................................................................................... 655
Physical Layer Errors ............................................................................................... 655
Data Link Layer Errors ............................................................................................ 655
Transaction Layer Errors ......................................................................................... 656
Error Pollution ........................................................................................................................ 656
Sources of PCI Express Errors.............................................................................................. 657
ECRC Generation and Checking ................................................................................... 657
TLP Digest.................................................................................................................. 659
Variant Bits Not Included in ECRC Mechanism .................................................. 659
Data Poisoning ................................................................................................................. 660
Split Transaction Errors .................................................................................................. 662
Unsupported Request (UR) Status ......................................................................... 663
Completer Abort (CA) Status.................................................................................. 664
Unexpected Completion .......................................................................................... 664
Completion Timeout ................................................................................................ 665
Link Flow Control Related Errors ................................................................................. 666
Malformed TLP ................................................................................................................ 666
Internal Errors .................................................................................................................. 667
The Problem............................................................................................................... 667
The Solution............................................................................................................... 668

xxvi
PCIe 3.0.book Page xxvii Sunday, September 2, 2012 11:25 AM

Contents

How Errors are Reported ...................................................................................................... 668


Introduction ...................................................................................................................... 668
Error Messages ................................................................................................................. 668
Advisory Non-Fatal Errors...................................................................................... 670
Advisory Non-Fatal Cases....................................................................................... 671
Baseline Error Detection and Handling............................................................................. 674
PCI-Compatible Error Reporting Mechanisms ........................................................... 674
General ....................................................................................................................... 674
Legacy Command and Status Registers ................................................................ 675
Baseline Error Handling.................................................................................................. 677
Enabling/Disabling Error Reporting..................................................................... 678
Device Control Register.................................................................................... 680
Device Status Register....................................................................................... 681
Roots Response to Error Message ......................................................................... 682
Link Errors ................................................................................................................. 683
Advanced Error Reporting (AER) ....................................................................................... 685
Advanced Error Capability and Control ...................................................................... 686
Handling Sticky Bits ........................................................................................................ 688
Advanced Correctable Error Handling ........................................................................ 688
Advanced Correctable Error Status ....................................................................... 689
Advanced Correctable Error Masking................................................................... 690
Advanced Uncorrectable Error Handling .................................................................... 691
Advanced Uncorrectable Error Status ................................................................... 691
Selecting Uncorrectable Error Severity.................................................................. 693
Uncorrectable Error Masking.................................................................................. 694
Header Logging................................................................................................................ 695
Root Complex Error Tracking and Reporting ............................................................. 696
Root Complex Error Status Registers .................................................................... 696
Advanced Source ID Register ................................................................................. 697
Root Error Command Register ............................................................................... 698
Summary of Error Logging and Reporting ....................................................................... 698
Example Flow of Software Error Investigation ................................................................ 699

Chapter 16: Power Management


Introduction............................................................................................................................. 704
Power Management Primer.................................................................................................. 705
Basics of PCI PM .............................................................................................................. 705
ACPI Spec Defines Overall PM...................................................................................... 707
System PM States ...................................................................................................... 708
Device PM States....................................................................................................... 709
Definition of Device Context................................................................................... 709
General ................................................................................................................ 709

xxvii
PCIe 3.0.book Page xxviii Sunday, September 2, 2012 11:25 AM

Contents

PME Context ...................................................................................................... 710


Device-Class-Specific PM Specs ............................................................................. 710
Default Device Class Spec ................................................................................ 710
Device Class-Specific PM Specs ...................................................................... 711
Power Management Policy Owner ........................................................................ 711
PCI Express Power Management vs. ACPI.................................................................. 711
PCI Express Bus Driver Accesses PM Registers................................................... 711
ACPI Driver Controls Non-Standard Embedded Devices ................................. 712
Function Power Management .............................................................................................. 713
The PM Capability Register Set ..................................................................................... 713
Device PM States.............................................................................................................. 713
D0 StateFull On ..................................................................................................... 714
Mandatory. ......................................................................................................... 714
D0 Uninitialized................................................................................................. 714
D0 Active ............................................................................................................ 714
Dynamic Power Allocation (DPA) ......................................................................... 714
D1 StateLight Sleep............................................................................................... 716
D2 StateDeep Sleep............................................................................................... 717
D3Full Off .............................................................................................................. 719
D3Hot State......................................................................................................... 719
D3Cold State....................................................................................................... 721
Function PM State Transitions................................................................................ 722
Detailed Description of PCI-PM Registers ................................................................... 724
PM Capabilities (PMC) Register............................................................................. 724
PM Control and Status Register (PMCSR) ............................................................ 727
Data Register ............................................................................................................. 731
Determining Presence of the Data Register ................................................... 731
Operation of the Data Register ........................................................................ 731
Multi-Function Devices .................................................................................... 732
Virtual PCI-to-PCI Bridge Power Data........................................................... 732
Introduction to Link Power Management......................................................................... 733
Active State Power Management (ASPM)......................................................................... 735
Electrical Idle .................................................................................................................... 736
Transmitter Entry to Electrical Idle ........................................................................ 736
Gen1/Gen2 Mode Encoding............................................................................ 737
Gen3 Mode Encoding........................................................................................ 737
Transmitter Exit from Electrical Idle...................................................................... 738
Gen1 Mode ......................................................................................................... 738
Gen2 Mode ......................................................................................................... 738
Gen3 Mode ......................................................................................................... 739
Receiver Entry to Electrical Idle.............................................................................. 740
Detecting Electrical Idle Voltage ..................................................................... 740

xxviii
PCIe 3.0.book Page xxix Sunday, September 2, 2012 11:25 AM

Contents

Inferring Electrical Idle ..................................................................................... 741


Receiver Exit from Electrical Idle ........................................................................... 742
L0s State............................................................................................................................. 744
Entry into L0s ............................................................................................................ 745
Entry into L0s ..................................................................................................... 745
Flow Control Credits Must be Delivered....................................................... 746
Transmitter Initiates Entry to L0s ................................................................... 746
Exit from L0s State .................................................................................................... 746
Transmitter Initiates L0s Exit........................................................................... 746
Actions Taken by Switches that Receive L0s Exit......................................... 746
L1 ASPM State .................................................................................................................. 747
Downstream Component Decides to Enter L1 ASPM ........................................ 748
Negotiation Required to Enter L1 ASPM .............................................................. 748
Scenario 1: Both Ports Ready to Enter L1 ASPM State ........................................ 748
Downstream Component Requests L1 State ................................................. 748
Upstream Component Response to L1 ASPM Request ............................... 749
Upstream Component Acknowledges Request to Enter L1........................ 749
Downstream Component Sees Acknowledgement...................................... 749
Upstream Component Receives Electrical Idle ............................................. 749
Scenario 2: Upstream Component Transmits TLP Just Prior to
Receiving L1 Request............................................................................................. 750
TLP Must Be Accepted by Downstream Component .................................. 751
Upstream Component Receives Request to Enter L1................................... 751
Scenario 3: Downstream Component Receives TLP During Negotiation........ 751
Scenario 4: Upstream Component Receives TLP During Negotiation ............. 751
Scenario 5: Upstream Component Rejects L1 Request........................................ 752
Exit from L1 ASPM State ......................................................................................... 753
L1 ASPM Exit Signaling.................................................................................... 753
Switch Receives L1 Exit from Downstream Component............................. 753
Switch Receives L1 Exit from Upstream Component .................................. 754
ASPM Exit Latency .......................................................................................................... 756
Reporting a Valid ASPM Exit Latency .................................................................. 756
L0s Exit Latency Update................................................................................... 756
L1 Exit Latency Update .................................................................................... 757
Calculating Latency from Endpoint to Root Complex........................................ 758
Software Initiated Link Power Management ................................................................... 760
D1/D2/D3Hot and the L1 State .................................................................................... 760
Entering the L1 State ................................................................................................ 760
Exiting the L1 State ................................................................................................... 762
Upstream Component Initiates ....................................................................... 762
Downstream Component Initiates L1 to L0 Transition ............................... 763
The L1 Exit Protocol .......................................................................................... 763

xxix
PCIe 3.0.book Page xxx Sunday, September 2, 2012 11:25 AM

Contents

L2/L3 Ready Removing Power from the Link....................................................... 763


L2/L3 Ready Handshake Sequence....................................................................... 764
Exiting the L2/L3 Ready State Clock and Power Removed.......................... 767
The L2 State................................................................................................................ 767
The L3 State................................................................................................................ 767
Link Wake Protocol and PME Generation ........................................................................ 768
The PME Message ............................................................................................................ 769
The PME Sequence........................................................................................................... 770
PME Message Back Pressure Deadlock Avoidance .................................................... 770
Background................................................................................................................ 770
The Problem............................................................................................................... 771
The Solution............................................................................................................... 771
The PME Context ............................................................................................................. 771
Waking Non-Communicating Links............................................................................. 772
Beacon......................................................................................................................... 772
WAKE#....................................................................................................................... 773
Auxiliary Power ............................................................................................................... 775
Improving PM Efficiency ..................................................................................................... 776
Background....................................................................................................................... 776
OBFF (Optimized Buffer Flush and Fill) ...................................................................... 776
The Problem............................................................................................................... 776
The Solution............................................................................................................... 778
Using the WAKE# Pin....................................................................................... 779
Using the OBFF Message.................................................................................. 780
LTR (Latency Tolerance Reporting) .............................................................................. 784
LTR Registers............................................................................................................. 784
LTR Messages............................................................................................................ 786
Guidelines Regarding LTR Use .............................................................................. 786
LTR Example ............................................................................................................. 789

Chapter 17: Interrupt Support


Interrupt Support Background ............................................................................................ 794
General............................................................................................................................... 794
Two Methods of Interrupt Delivery.............................................................................. 794
The Legacy Model .................................................................................................................. 796
General............................................................................................................................... 796
Changes to Support Multiple Processors ..................................................................... 798
Legacy PCI Interrupt Delivery....................................................................................... 800
Device INTx# Pins .................................................................................................... 800
Determining INTx# Pin Support ............................................................................ 801
Interrupt Routing...................................................................................................... 802
Associating the INTx# Line to an IRQ Number ................................................... 802

xxx
PCIe 3.0.book Page xxxi Sunday, September 2, 2012 11:25 AM

Contents

INTx# Signaling ........................................................................................................ 803


Interrupt Disable................................................................................................ 803
Interrupt Status .................................................................................................. 804
Virtual INTx Signaling .................................................................................................... 805
General ....................................................................................................................... 805
Virtual INTx Wire Delivery..................................................................................... 806
INTx Message Format .............................................................................................. 807
Mapping and Collapsing INTx Messages .................................................................... 808
INTx Mapping........................................................................................................... 808
INTx Collapsing ........................................................................................................ 810
INTx Delivery Rules................................................................................................. 812
The MSI Model....................................................................................................................... 812
The MSI Capability Structure......................................................................................... 812
Capability ID ............................................................................................................. 814
Next Capability Pointer ........................................................................................... 814
Message Control Register ........................................................................................ 814
Message Address Register....................................................................................... 816
Message Data Register ............................................................................................. 817
Mask Bits Register and Pending Bits Register...................................................... 817
Basics of MSI Configuration ........................................................................................... 817
Basics of Generating an MSI Interrupt Request .......................................................... 820
Multiple Messages ........................................................................................................... 820
The MSI-X Model................................................................................................................... 821
General............................................................................................................................... 821
MSI-X Capability Structure ............................................................................................ 822
MSI-X Table....................................................................................................................... 824
Pending Bit Array ............................................................................................................ 825
Memory Synchronization When Interrupt Handler Entered ........................................ 826
The Problem...................................................................................................................... 826
One Solution ..................................................................................................................... 827
An MSI Solution ............................................................................................................... 827
Traffic Classes Must Match ............................................................................................ 828
Interrupt Latency.................................................................................................................... 829
MSI May Result In Errors..................................................................................................... 829
Some MSI Rules and Recommendations .......................................................................... 830
Special Consideration for Base System Peripherals ....................................................... 830
Example Legacy System.................................................................................................. 831

Chapter 18: System Reset


Two Categories of System Reset ......................................................................................... 833
Conventional Reset................................................................................................................ 834
Fundamental Reset .......................................................................................................... 834

xxxi
PCIe 3.0.book Page xxxii Sunday, September 2, 2012 11:25 AM

Contents

PERST# Fundamental Reset Generation ............................................................... 835


Autonomous Reset Generation............................................................................... 835
Link Wakeup from L2 Low Power State ............................................................... 836
Hot Reset (In-band Reset) ............................................................................................... 837
Response to Receiving Hot Reset ........................................................................... 837
Switches Generate Hot Reset on Downstream Ports........................................... 838
Bridges Forward Hot Reset to the Secondary Bus ............................................... 838
Software Generation of Hot Reset .......................................................................... 838
Software Can Disable the Link ............................................................................... 840
Function Level Reset (FLR) .................................................................................................. 842
Time Allowed ................................................................................................................... 844
Behavior During FLR ...................................................................................................... 845
Reset Exit.................................................................................................................................. 846

Chapter 19: Hot Plug and Power Budgeting


Background ............................................................................................................................. 848
Hot Plug in the PCI Express Environment ........................................................................ 848
Surprise Removal Notification....................................................................................... 849
Differences between PCI and PCIe Hot Plug............................................................... 849
Elements Required to Support Hot Plug ........................................................................... 852
Software Elements ........................................................................................................... 852
Hardware Elements ......................................................................................................... 853
Card Removal and Insertion Procedures........................................................................... 855
On and Off States ............................................................................................................. 855
Turning Slot Off ........................................................................................................ 855
Turning Slot On......................................................................................................... 855
Card Removal Procedure................................................................................................ 856
Card Insertion Procedure................................................................................................ 857
Standardized Usage Model .................................................................................................. 858
Background....................................................................................................................... 858
Standard User Interface .................................................................................................. 859
Attention Indicator ................................................................................................... 859
Power Indicator......................................................................................................... 860
Manually Operated Retention Latch and Sensor ................................................. 861
Electromechanical Interlock (optional).................................................................. 862
Software User Interface............................................................................................ 862
Attention Button ....................................................................................................... 862
Slot Numbering Identification ................................................................................ 862
Standard Hot Plug Controller Signaling Interface.......................................................... 863
The Hot-Plug Controller Programming Interface............................................................ 864
Slot Capabilities................................................................................................................ 865
Slot Power Limit Control ................................................................................................ 867

xxxii
PCIe 3.0.book Page xxxiii Sunday, September 2, 2012 11:25 AM

Contents

Slot Control ....................................................................................................................... 868


Slot Status and Events Management............................................................................. 870
Add-in Card Capabilities................................................................................................ 872
Quiescing Card and Driver .................................................................................................. 873
General............................................................................................................................... 873
Pausing a Driver (Optional) ........................................................................................... 874
Quiescing a Driver That Controls Multiple Devices ........................................... 874
Quiescing a Failed Card........................................................................................... 874
The Primitives......................................................................................................................... 874
Introduction to Power Budgeting ....................................................................................... 876
The Power Budgeting Elements .......................................................................................... 877
System Firmware ............................................................................................................. 877
The Power Budget Manager........................................................................................... 878
Expansion Ports................................................................................................................ 878
Add-in Devices................................................................................................................. 879
Slot Power Limit Control...................................................................................................... 881
Expansion Port Delivers Slot Power Limit................................................................... 881
Expansion Device Limits Power Consumption........................................................... 883
The Power Budget Capabilities Register Set.................................................................... 883

Chapter 20: Updates for Spec Revision 2.1


Changes for PCIe Spec Rev 2.1 ............................................................................................ 887
System Redundancy Improvement: Multi-casting.......................................................... 888
Multicast Capability Registers ....................................................................................... 889
Multicast Capability ................................................................................................. 889
Multicast Control ...................................................................................................... 890
Multicast Base Address............................................................................................ 891
MC Receive ................................................................................................................ 892
MC Block All ............................................................................................................. 892
MC Block Untranslated............................................................................................ 892
Multicast Example ........................................................................................................... 893
MC Overlay BAR ............................................................................................................. 894
Overlay Example.............................................................................................................. 895
Routing Multicast TLPs................................................................................................... 896
Congestion Avoidance .................................................................................................... 897
Performance Improvements................................................................................................. 897
AtomicOps ........................................................................................................................ 897
TPH (TLP Processing Hints)........................................................................................... 899
TPH Examples........................................................................................................... 900
Device Write to Host Read ............................................................................... 900
Host Write to Device Read ............................................................................... 902
Device to Device ................................................................................................ 903

xxxiii
PCIe 3.0.book Page xxxiv Sunday, September 2, 2012 11:25 AM

Contents

TPH Header Bits ....................................................................................................... 904


Steering Tags ............................................................................................................. 906
TLP Prefixes............................................................................................................... 908
IDO (ID-based Ordering)................................................................................................ 909
ARI (Alternative Routing-ID Interpretation) ............................................................... 909
Power Management Improvements ................................................................................... 910
DPA (Dynamic Power Allocation.................................................................................. 910
LTR (Latency Tolerance Reporting) .............................................................................. 910
OBFF (Optimized Buffer Flush and Fill) ...................................................................... 910
ASPM Options.................................................................................................................. 910
Configuration Improvements .............................................................................................. 911
Internal Error Reporting ................................................................................................. 911
Resizable BARs................................................................................................................. 911
Capability Register ................................................................................................... 912
Control Register ........................................................................................................ 912
Simplified Ordering Table .............................................................................................. 914

Appendices

Appendix A: Debugging PCIe Traffic with LeCroy Tools


Overview.................................................................................................................................. 917
Pre-silicon Debugging .......................................................................................................... 918
RTL Simulation Perspective ........................................................................................... 918
PCI Express RTL Bus Monitor ....................................................................................... 918
RTL vector export to PETracer Application................................................................. 918
Post-Silicon Debug ................................................................................................................ 919
Oscilloscope ...................................................................................................................... 919
Protocol Analyzer ............................................................................................................ 920
Logic Analyzer ................................................................................................................. 921
Using a Protocol Analyzer Probing Option ...................................................................... 921
Viewing Traffic Using the PETracer Application............................................................ 924
CATC Trace Viewer......................................................................................................... 924
LTSSM Graphs.................................................................................................................. 927
Flow Control Credit Tracking ........................................................................................ 928
Bit Tracer ........................................................................................................................... 929
Analysis overview ........................................................................................................... 931
Traffic generation................................................................................................................... 931
Pre-Silicon ......................................................................................................................... 931
Post-Silicon........................................................................................................................ 931
Exerciser Card ........................................................................................................... 931
PTC card..................................................................................................................... 932

xxxiv
PCIe 3.0.book Page xxxv Sunday, September 2, 2012 11:25 AM

Contents

Conclusion............................................................................................................................... 933

Appendix B: Markets & Applications for PCI Express


Introduction............................................................................................................................. 935
PCI Express IO Virtualization Solutions........................................................................... 937
Multi-Root (MR) PCIe Switch Solution ............................................................................ 938
PCIe Beyond Chip-to-Chip Interconnect .......................................................................... 939
SSD/Storage IO Expansion Boxes....................................................................................... 940
PCIe in SSD Modules for Servers....................................................................................... 940
Conclusion............................................................................................................................... 942

Appendix C: Implementing Intelligent Adapters and


Multi-Host Systems With PCI Express
Technology
Introduction............................................................................................................................. 943
Usage Models.......................................................................................................................... 944
Intelligent Adapters......................................................................................................... 944
Host Failover .................................................................................................................... 944
Multiprocessor Systems ................................................................................................. 945
The History Multi-Processor Implementations Using PCI ........................................... 945
Implementing Multi-host/Intelligent Adapters in PCI Express Base Systems.......... 947
Example: Implementing Intelligent Adapters in a PCI Express Base System ....... 950
Example: Implementing Host Failover in a PCI Express System ............................ 952
Example: Implementing Dual Host in a PCI Express Base System.......................... 955
Summary .................................................................................................................................. 957
Address Translation .............................................................................................................. 958
Direct Address Translation............................................................................................. 959
Lookup Table Based Address Translation ................................................................... 959
Downstream BAR Limit Registers................................................................................. 960
Forwarding 64bit Address Memory Transactions ...................................................... 961

Appendix D: Locked Transactions


Introduction............................................................................................................................. 963
Background ............................................................................................................................. 963
The PCI Express Lock Protocol............................................................................................ 964
Lock Messages The Virtual Lock Signal .................................................................. 964
The Lock Protocol Sequence an Example ................................................................ 965
The Memory Read Lock Operation........................................................................ 965
Read Data Modified and Written to Target and Lock Completes..................... 967
Notification of an Unsuccessful Lock ........................................................................... 970

xxxv
PCIe 3.0.book Page xxxvi Sunday, September 2, 2012 11:25 AM

Contents

Summary of Locking Rules.................................................................................................. 970


Rules Related To the Initiation and Propagation of Locked Transactions .............. 970
Rules Related to Switches ............................................................................................... 971
Rules Related To PCI Express/PCI Bridges................................................................. 972
Rules Related To the Root Complex.............................................................................. 972
Rules Related To Legacy Endpoints.............................................................................. 972
Rules Related To PCI Express Endpoints..................................................................... 972

Glossary........................................................................................973

xxxvi
PCIe 3.0.book Page xxxvii Sunday, September 2, 2012 11:25 AM

Figures

1-1 Legacy PCI Bus-Based Platform ............................................................................... 12


1-2 PCI Bus Arbitration .................................................................................................... 13
1-3 Simple PCI Bus Transfer ............................................................................................ 15
1-4 PCI Reflected-Wave Signaling .................................................................................. 17
1-5 33 MHz PCI System, Including a PCI-to-PCI Bridge ............................................ 18
1-6 PCI Transaction Models............................................................................................. 19
1-7 PCI Transaction Retry Mechanism........................................................................... 21
1-8 PCI Transaction Disconnect Mechanism................................................................. 23
1-9 PCI Error Handling .................................................................................................... 24
1-10 Address Space Mapping ............................................................................................ 26
1-11 Configuration Address Register............................................................................... 27
1-12 PCI Configuration Header Type 1 (Bridge) ............................................................ 28
1-13 PCI Configuration Header Type 0 (not a Bridge) .................................................. 29
1-14 66 MHz PCI Bus Based Platform .............................................................................. 30
1-15 66 MHz/133 MHz PCI-X Bus Based Platform ....................................................... 32
1-16 Example PCI-X Burst Memory Read Bus Cycle ..................................................... 33
1-17 PCI-X Split Transaction Protocol .............................................................................. 34
1-18 Inherent Problems in a Parallel Design ................................................................... 36
1-19 Source-Synchronous Clocking Model ..................................................................... 38
2-1 Dual-Simplex Link...................................................................................................... 40
2-2 One Lane ...................................................................................................................... 40
2-3 Parallel Bus Limitations ............................................................................................. 42
2-4 Differential Signaling ................................................................................................. 44
2-5 Simple PLL Block Diagram ....................................................................................... 45
2-6 Example PCIe Topology ............................................................................................ 47
2-7 Configuration Headers .............................................................................................. 50
2-8 Topology Example...................................................................................................... 51
2-9 Example Results of System Enumeration ............................................................... 52
2-10 Low-Cost PCIe System............................................................................................... 53
2-11 Server PCIe System..................................................................................................... 54
2-12 PCI Express Device Layers........................................................................................ 56
2-13 Switch Port Layers ...................................................................................................... 57
2-14 Detailed Block Diagram of PCI Express Devices Layers ..................................... 58
2-15 TLP Origin and Destination ...................................................................................... 62
2-16 TLP Assembly ............................................................................................................. 63
2-17 TLP Disassembly......................................................................................................... 64
2-18 Non-Posted Read Example........................................................................................ 65
2-19 Non-Posted Locked Read Transaction Protocol..................................................... 67
2-20 Non-Posted Write Transaction Protocol.................................................................. 68
2-21 Posted Memory Write Transaction Protocol........................................................... 69
2-22 QoS Example .............................................................................................................. 71
2-23 Flow Control Basics .................................................................................................... 72

xxxvii
PCIe 3.0.book Page xxxviii Sunday, September 2, 2012 11:25 AM

Figures

2-24 DLLP Origin and Destination ................................................................................... 73


2-25 Data Link Layer Replay Mechanism........................................................................ 74
2-26 TLP and DLLP Structure at the Data Link Layer................................................... 75
2-27 Non-Posted Transaction with Ack/Nak Protocol ................................................. 76
2-28 TLP and DLLP Structure at the Physical Layer...................................................... 77
2-29 Physical Layer Electrical ............................................................................................ 79
2-30 Ordered Sets Origin and Destination ...................................................................... 80
2-31 Ordered-Set Structure ................................................................................................ 80
2-32 Memory Read Request Phase.................................................................................... 81
2-33 Completion with Data Phase .................................................................................... 83
3-1 Example System .......................................................................................................... 87
3-2 PCI Compatible Configuration Register Space ...................................................... 89
3-3 4KB Configuration Space per PCI Express Function............................................. 90
3-4 Configuration Address Port at 0CF8h ..................................................................... 92
3-5 Single-Root System ..................................................................................................... 95
3-6 Multi-Root System ...................................................................................................... 97
3-7 Type 0 Configuration Read and Write Request Headers ................................... 100
3-8 Type 1 Configuration Read and Write Request Headers ................................... 101
3-9 Example Configuration Read Access..................................................................... 104
3-10 Topology View At Startup ...................................................................................... 105
3-11 Root Control Register in PCIe Capability Block................................................... 108
3-12 Header Type Register............................................................................................... 108
3-13 Single-Root System ................................................................................................... 113
3-14 Multi-Root System .................................................................................................... 116
3-15 Partial Screenshot of MindShare Arbor................................................................. 118
4-1 Generic Memory And IO Address Maps .............................................................. 125
4-2 BARs in Configuration Space.................................................................................. 127
4-3 PCI Express Devices And Type 0 And Type 1 Header Use ............................... 128
4-4 32-Bit Non-Prefetchable Memory BAR Set Up..................................................... 130
4-5 64-Bit Prefetchable Memory BAR Set Up.............................................................. 132
4-6 IO BAR Set Up........................................................................................................... 134
4-7 Example Topology for Setting Up Base and Limit Values ................................. 137
4-8 Example Prefetchable Memory Base/Limit Register Values ............................. 138
4-9 Example Non-Prefetchable Memory Base/Limit Register Values.................... 140
4-10 Example IO Base/Limit Register Values............................................................... 142
4-11 Final Example Address Routing Setup.................................................................. 145
4-12 Multi-Port PCIe Devices Have Routing Responsibilities.................................... 146
4-13 PCI Express Transaction Request And Completion TLPs .................................. 149
4-14 Transaction Layer Packet Generic 3DW And 4DW Headers ............................. 152
4-15 3DW TLP Header - ID Routing Fields ................................................................... 156
4-16 4DW TLP Header - ID Routing Fields ................................................................... 156
4-17 Switch Checks Routing Of An Inbound TLP Using ID Routing ....................... 158

xxxviii
PCIe 3.0.book Page xxxix Sunday, September 2, 2012 11:25 AM

Figures

4-18 3DW TLP Header - Address Routing Fields......................................................... 159


4-19 4DW TLP Header - Address Routing Fields......................................................... 160
4-20 Endpoint Checks Incoming TLP Address............................................................. 161
4-21 Switch Checks Routing Of An Inbound TLP Using Address ............................ 162
4-22 4DW Message TLP Header - Implicit Routing Fields ......................................... 164
5-1 TLP And DLLP Packets ........................................................................................... 170
5-2 PCIe TLP Assembly/Disassembly ......................................................................... 173
5-3 Generic TLP Header Fields ..................................................................................... 175
5-4 Using First DW and Last DW Byte Enable Fields................................................ 182
5-5 Transaction Descriptor Fields ................................................................................. 183
5-6 System IO Map.......................................................................................................... 185
5-7 3DW IO Request Header Format............................................................................ 185
5-8 3DW And 4DW Memory Request Header Formats ............................................ 188
5-9 3DW Configuration Request And Header Format .............................................. 193
5-10 3DW Completion Header Format ......................................................................... 197
5-11 4DW Message Request Header Format................................................................. 203
5-12 Vendor-Defined Message Header .......................................................................... 211
5-13 LTR Message Header ............................................................................................... 212
5-14 OBFF Message Header............................................................................................. 213
6-1 Location of Flow Control Logic .............................................................................. 217
6-2 Flow Control Buffer Organization ......................................................................... 218
6-3 Physical Layer Reports That Its Ready................................................................. 222
6-4 The Data Link Control & Management State Machine ....................................... 223
6-5 INIT1 Flow Control DLLP Format and Contents ................................................ 224
6-6 Devices Send InitFC1 in the DL_Init State ............................................................ 225
6-7 FC Values Registered - Send InitFC2s, Report DL_Up ....................................... 226
6-8 Flow Control Elements ............................................................................................ 228
6-9 Types and Format of Flow Control DLLPs........................................................... 229
6-10 Flow Control Elements Following Initialization .................................................. 231
6-11 Flow Control Elements After First TLP Sent ........................................................ 232
6-12 Flow Control Elements with Flow Control Buffer Filled.................................... 234
6-13 Flow Control Rollover Problem.............................................................................. 235
6-14 Buffer Overflow Error Check .................................................................................. 236
6-15 Flow Control Update Example ............................................................................... 238
6-16 Update Flow Control Packet Format and Contents ............................................ 239
7-1 Virtual Channel Capability Registers .................................................................... 246
7-2 Traffic Class Field in TLP Header .......................................................................... 247
7-3 TC to VC Mapping Example ................................................................................... 249
7-4 Multiple VCs Supported by a Device .................................................................... 250
7-5 Extended VCs Supported Field .............................................................................. 251
7-6 VC Arbitration Example .......................................................................................... 253
7-7 Strict Priority Arbitration......................................................................................... 254

xxxix
PCIe 3.0.book Page xl Sunday, September 2, 2012 11:25 AM

Figures

7-8 Low-Priority Extended VCs .................................................................................... 255


7-9 VC Arbitration Capabilities..................................................................................... 256
7-10 VC Arbitration Priorities ......................................................................................... 257
7-11 WRR VC Arbitration Table...................................................................................... 258
7-12 VC Arbitration Table Offset and Load VC Arbitration Table Fields ................ 259
7-13 Loading the VC Arbitration Table Entries ............................................................ 260
7-14 Port Arbitration Concept ......................................................................................... 262
7-15 Port Arbitration Tables for Each VC ...................................................................... 263
7-16 Port Arbitration Buffering ....................................................................................... 264
7-17 Software Selects Port Arbitration Scheme............................................................. 265
7-18 Maximum Time Slots Register................................................................................ 267
7-19 Format of Port Arbitration Tables .......................................................................... 268
7-20 Arbitration Examples in a Switch........................................................................... 270
7-21 Simple Multi-Function Arbitration ........................................................................ 271
7-22 QoS Support in Multi-Function Arbitration ......................................................... 272
7-23 Example Application of Isochronous Transaction............................................... 274
7-24 Example Isochronous System ................................................................................. 277
7-25 Injection of Isochronous Packets ............................................................................ 279
7-26 Over-Subscribing the Bandwidth........................................................................... 280
7-27 Bandwidth Congestion ............................................................................................ 281
8-1 Example Producer/Consumer Topology.............................................................. 291
8-2 Producer/Consumer Sequence Example Part 1.............................................. 293
8-3 Producer/Consumer Sequence Example Part 2.............................................. 294
8-4 Producer/Consumer Sequence with Error ........................................................... 296
8-5 Relaxed Ordering Bit in a 32-bit Header ............................................................... 297
8-6 Strongly Ordered Example Results in Temporary Stall...................................... 300
8-7 Different Sources are Unlikely to Have Dependencies ....................................... 302
8-8 IDO Attribute in 64-bit Header............................................................................... 303
9-1 Data Link Layer Sends A DLLP.............................................................................. 308
9-2 Generic Data Link Layer Packet Format ............................................................... 310
9-3 Ack Or Nak DLLP Format....................................................................................... 312
9-4 Power Management DLLP Format ........................................................................ 314
9-5 Flow Control DLLP Format..................................................................................... 315
9-6 Vendor-Specific DLLP Format................................................................................ 316
10-1 Data Link Layer......................................................................................................... 318
10-2 Overview of the Ack/Nak Protocol....................................................................... 319
10-3 Elements of the Ack/Nak Protocol ........................................................................ 320
10-4 Transmitter Elements Associated with the Ack/Nak Protocol ......................... 322
10-5 Receiver Elements Associated with the Ack/Nak Protocol ............................... 325
10-6 Examples of Sequence Number Ranges ................................................................ 327
10-7 Ack Or Nak DLLP Format....................................................................................... 328
10-8 Example 1 - Example of Ack ................................................................................... 332

xl
PCIe 3.0.book Page xli Sunday, September 2, 2012 11:25 AM

Figures

10-9 Example 2 - Ack with Sequence Number Rollover ............................................. 333


10-10 Example of a Nak...................................................................................................... 335
10-11 Gen1 Unadjusted REPLAY_TIMER Values.......................................................... 339
10-12 Ack/Nak Receiver Elements................................................................................... 341
10-13 Handling Lost TLPs.................................................................................................. 346
10-14 Handling Bad Ack .................................................................................................... 347
10-15 Handling Bad Nak.................................................................................................... 349
10-16 Switch Cut-Through Mode Showing Error Handling......................................... 357
11-1 PCIe Port Layers ....................................................................................................... 362
11-2 Logical and Electrical Sub-Blocks of the Physical Layer..................................... 363
11-3 Physical Layer Transmit Details ............................................................................. 365
11-4 Physical Layer Receive Logic Details..................................................................... 367
11-5 Physical Layer Transmit Logic Details (Gen1 and Gen2 Only) ......................... 369
11-6 Transmit Logic Multiplexer..................................................................................... 370
11-7 TLP and DLLP Packet Framing with Start and End Control Characters ......... 371
11-8 x1 Byte Striping ......................................................................................................... 372
11-9 x4 Byte Striping ......................................................................................................... 372
11-10 x8 Byte Striping with DWord Parallel Data.......................................................... 373
11-11 x1 Packet Format....................................................................................................... 374
11-12 x4 Packet Format....................................................................................................... 375
11-13 x8 Packet Format....................................................................................................... 377
11-14 Scrambler ................................................................................................................... 378
11-15 Example of 8-bit Character 00h Encoding............................................................. 381
11-16 8b/10b Nomenclature .............................................................................................. 382
11-17 8-bit to 10-bit (8b/10b) Encoder.............................................................................. 384
11-18 Example 8b/10b Encodings .................................................................................... 385
11-19 Example 8b/10b Transmission ............................................................................... 386
11-20 SKIP Ordered Set ...................................................................................................... 392
11-21 Physical Layer Receive Logic Details (Gen1 and Gen2 Only)............................ 393
11-22 Receiver Logics Front End Per Lane ..................................................................... 394
11-23 Receivers Link De-Skew Logic .............................................................................. 399
11-24 8b/10b Decoder per Lane ........................................................................................ 401
11-25 Example of Delayed Disparity Error Detection ................................................... 401
11-26 Example of x8 Byte Un-Striping ............................................................................. 403
12-1 8b/10b Lane Encoding............................................................................................. 409
12-2 128b/130b Block Encoding...................................................................................... 410
12-3 Sync Header Data Block Example .......................................................................... 411
12-4 Gen3 Mode EIEOS Symbol Pattern ........................................................................ 411
12-5 Gen3 x1 Ordered Set Block Example ..................................................................... 412
12-6 Gen3 FTS Ordered Set Example ............................................................................. 413
12-7 Gen3 x1 Frame Construction Example .................................................................. 414
12-8 Gen3 Frame Token Examples ................................................................................. 417

xli
PCIe 3.0.book Page xlii Sunday, September 2, 2012 11:25 AM

Figures

12-9 AER Correctable Error Register.............................................................................. 421


12-10 Gen3 Physical Layer Transmitter Details.............................................................. 422
12-11 Gen3 Byte Striping x4............................................................................................... 424
12-12 Gen3 x8 Example: TLP Straddles Block Boundary .............................................. 425
12-13 Gen3 x8 Nullified Packet ......................................................................................... 426
12-14 Gen3 x1 Ordered Set Construction ........................................................................ 427
12-15 Gen3 x8 Skip Ordered Set (SOS) Example ............................................................ 428
12-16 Gen3 Per-Lane LFSR Scrambling Logic................................................................. 431
12-17 Gen3 Single-LFSR Scrambler .................................................................................. 433
12-18 Gen3 Physical Layer Receiver Details.................................................................... 436
12-19 Gen3 CDR Logic........................................................................................................ 437
12-20 EIEOS Symbol Pattern.............................................................................................. 438
12-21 Gen3 Elastic Buffer Logic......................................................................................... 441
12-22 Receiver Link De-Skew Logic ................................................................................. 444
12-23 Physical Layer Receive Logic Details..................................................................... 445
13-1 Electrical Sub-Block of the Physical Layer ............................................................ 450
13-2 Differential Transmitter/Receiver.......................................................................... 451
13-3 Differential Common-Mode Noise Rejection ....................................................... 452
13-4 SSC Motivation.......................................................................................................... 454
13-5 Signal Rate Less Than Half the Clock Rate ........................................................... 454
13-6 SSC Modulation Example........................................................................................ 455
13-7 Shared Refclk Architecture...................................................................................... 456
13-8 Data Clocked Rx Architecture ................................................................................ 457
13-9 Separate Refclk Architecture................................................................................... 457
13-10 Test Circuit Measurement Channels...................................................................... 458
13-11 Receiver Detection Mechanism............................................................................... 461
13-12 Differential Signaling ............................................................................................... 463
13-13 Differential Peak-to-Peak (VDIFFp-p) and Peak (VDIFFp) Voltages................ 464
13-14 Transmit Margin Field in Link Control 2 Register............................................... 465
13-15 Receiver DC Common-Mode Voltage Adjustment ............................................. 467
13-16 Transmission with De-emphasis ............................................................................ 469
13-17 Benefit of De-emphasis at the Receiver ................................................................. 471
13-18 Benefit of De-emphasis at Receiver Shown With Differential Signals.............. 472
13-19 De-emphasis Options for 5.0 GT/s ........................................................................ 473
13-20 Reduced-Swing Option for 5.0 GT/s with No De-emphasis ............................. 474
13-21 3-Tap Tx Equalizer.................................................................................................... 475
13-22 Tx 3-Tap Equalizer Shaping of an Output Pulse.................................................. 476
13-23 8.0 GT/s Tx Voltage Levels ..................................................................................... 477
13-24 Tx 3-Tap Equalizer Output...................................................................................... 482
13-25 Example Beacon Signal ............................................................................................ 484
13-26 Transmitter Eye Diagram ........................................................................................ 486
13-27 Rx Normal Eye (No De-emphasis) ......................................................................... 488

xlii
PCIe 3.0.book Page xliii Sunday, September 2, 2012 11:25 AM

Figures

13-28 Rx Bad Eye (No De-emphasis)................................................................................ 488


13-29 Rx Discrete-Time Linear Equalizer (DLE)............................................................. 494
13-30 Rx Continuous-Time Linear Equalizer (CTLE) .................................................... 494
13-31 Effect of Rx Continuous-Time Linear Equalizer (CTLE) on Received Signal .. 495
13-32 Rx 1-Tap DFE............................................................................................................. 495
13-33 Rx 2-Tap DFE............................................................................................................. 497
13-34 2.5 GT/s Receiver Eye Diagram ............................................................................. 499
13-35 L0 Full-On Link State ............................................................................................... 500
13-36 L0s Low Power Link State ....................................................................................... 501
13-37 L1 Low Power Link State......................................................................................... 502
13-38 L2 Low Power Link State......................................................................................... 503
13-39 L3 Link Off State ....................................................................................................... 504
14-1 Link Training and Status State Machine Location ............................................... 506
14-2 Lane Reversal Example (Support Optional) ......................................................... 508
14-3 Polarity Inversion Example (Support Required).................................................. 509
14-4 TS1 and TS2 Ordered Sets When In Gen1 or Gen2 Mode .................................. 510
14-5 TS1 and TS2 Ordered Set Block When In Gen3 Mode of Operation................. 511
14-6 Link Training and Status State Machine (LTSSM)............................................... 519
14-7 States Involved in Initial Link Training at 2.5 Gb/s ............................................ 522
14-8 Detect State Machine ................................................................................................ 523
14-9 Polling State Machine............................................................................................... 525
14-10 Polling State Machine with Legacy Speed Change.............................................. 528
14-11 Link Control 2 Register ............................................................................................ 536
14-12 Link Control 2 Registers Enter Compliance Bit .............................................. 539
14-13 Link and Lane Number Encoding in TS1/TS2..................................................... 540
14-14 Combining Lanes to Form Wider Links (Link Merging) .................................... 541
14-15 Example 1 - Steps 1 and 2 ........................................................................................ 543
14-16 Example 1 - Steps 3 and 4 ........................................................................................ 544
14-17 Example 1 - Steps 5 and 6 ........................................................................................ 545
14-18 Example 2 - Step 1..................................................................................................... 546
14-19 Example 2 - Step 2..................................................................................................... 547
14-20 Example 2 - Steps 3, 4 and 5 .................................................................................... 548
14-21 Example 3 - Steps 1 and 2 ........................................................................................ 550
14-22 Example 3 - Steps 3 and 4 ........................................................................................ 551
14-23 Example 3 - Steps 5 and 6 ........................................................................................ 552
14-24 Configuration State Machine .................................................................................. 553
14-25 Link Control Register ............................................................................................... 569
14-26 Link Control 2 Register ............................................................................................ 569
14-27 Recovery State Machine........................................................................................... 573
14-28 EC Field in TS1s and TS2s for 8.0 GT/s................................................................. 578
14-29 Equalization Control Registers ............................................................................... 579
14-30 Equalization Process: Starting Point ...................................................................... 581

xliii
PCIe 3.0.book Page xliv Sunday, September 2, 2012 11:25 AM

Figures

14-31 Equalization Process: Initiating Phase 2 ................................................................ 583


14-32 Equalization Coefficients Exchanged .................................................................... 584
14-33 3-Tap Transmitter Equalization.............................................................................. 585
14-34 Equalization Process: Adjustments During Phase 2............................................ 585
14-35 Equalization Process: Adjustments During Phase 3............................................ 586
14-36 Link Status 2 Register............................................................................................... 588
14-37 Link Control 3 Register ............................................................................................ 588
14-38 TS1s - Rejecting Coefficient Values ........................................................................ 590
14-39 Link Status Register.................................................................................................. 597
14-40 L0s Tx State Machine................................................................................................ 603
14-41 L0s Receiver State Machine ..................................................................................... 605
14-42 L1 State Machine ....................................................................................................... 608
14-43 L2 State Machine ....................................................................................................... 611
14-44 Loopback State Machine .......................................................................................... 614
14-45 LTSSM Overview...................................................................................................... 620
14-46 TS1 Contents.............................................................................................................. 621
14-47 TS2 Contents.............................................................................................................. 621
14-48 Recovery Sub-States ................................................................................................. 622
14-49 Speed Change - Initiated.......................................................................................... 623
14-50 Speed Change - Part 2 .............................................................................................. 624
14-51 Speed Change - Part 3 .............................................................................................. 625
14-52 Bandwidth Change Status Bits ............................................................................... 625
14-53 Bandwidth Notification Capability........................................................................ 626
14-54 Bandwidth Change Notification Bits ..................................................................... 626
14-55 Speed Change Finish ................................................................................................ 627
14-56 Link Control 2 Register ............................................................................................ 628
14-57 Link Control Register ............................................................................................... 629
14-58 TS2 Contents.............................................................................................................. 630
14-59 Link Width Change Example.................................................................................. 631
14-60 Link Width Change LTSSM Sequence................................................................... 631
14-61 Simplified Configuration Substates ....................................................................... 632
14-62 Link Width Change - Start....................................................................................... 633
14-63 Link Width Change - Recovery.Idle....................................................................... 634
14-64 Marking Active Lanes .............................................................................................. 635
14-65 Response to Lane Number Changes ...................................................................... 636
14-66 Link Width Change - Finish .................................................................................... 637
14-67 Link Control Register ............................................................................................... 638
14-68 Link Capabilities Register........................................................................................ 639
14-69 Link Capabilities 2 Register..................................................................................... 640
14-70 Link Status Register.................................................................................................. 642
14-71 Link Control Register ............................................................................................... 644
15-1 PCI Error Handling .................................................................................................. 649

xliv
PCIe 3.0.book Page xlv Sunday, September 2, 2012 11:25 AM

Figures

15-2 Scope of PCI Express Error Checking and Reporting ......................................... 653
15-3 ECRC Usage Example .............................................................................................. 654
15-4 Location of Error-Related Configuration Registers ............................................. 658
15-5 TLP Digest Bit in a Completion Header ................................................................ 659
15-6 The Error/Poisoned Bit in a Completion Header................................................ 660
15-7 Completion Status Field within the Completion Header ................................... 662
15-8 Device Control Register 2 ........................................................................................ 665
15-9 Error Message Format.............................................................................................. 669
15-10 Device Capabilities Register.................................................................................... 670
15-11 Role-Based Error Reporting Example.................................................................... 672
15-12 Advanced Source ID Register ................................................................................. 672
15-13 Command Register in Configuration Header ...................................................... 675
15-14 Status Register in Configuration Header .............................................................. 676
15-15 PCI Express Capability Structure ........................................................................... 678
15-16 Device Control Register Fields Related to Error Handling ................................ 681
15-17 Device Status Register Bit Fields Related to Error Handling ............................. 682
15-18 Root Control Register............................................................................................... 683
15-19 Link Control Register - Force Link Retraining ..................................................... 684
15-20 Link Training Status in the Link Status Register.................................................. 685
15-21 Advanced Error Capability Structure.................................................................... 686
15-22 The Advanced Error Capability and Control Register........................................ 687
15-23 Advanced Correctable Error Status Register........................................................ 689
15-24 Advanced Correctable Error Mask Register ......................................................... 690
15-25 Advanced Uncorrectable Error Status Register ................................................... 691
15-26 Advanced Uncorrectable Error Severity Register................................................ 694
15-27 Advanced Uncorrectable Error Mask Register..................................................... 694
15-28 Root Error Status Register ....................................................................................... 697
15-29 Advanced Source ID Register ................................................................................. 698
15-30 Advanced Root Error Command Register ............................................................ 698
15-31 Flow Chart of Error Handling Within a Function ............................................... 699
15-32 Error Investigation Example System ..................................................................... 701
16-1 Relationship of OS, Device Drivers, Bus Driver, PCI Express Registers,
and ACPI712
16-2 PCI Power Management Capability Register Set................................................. 713
16-3 Dynamic Power Allocation Registers .................................................................... 715
16-4 DPA Capability Register.......................................................................................... 716
16-5 DPA Status Register ................................................................................................. 716
16-6 PCIe Function D-State Transitions ......................................................................... 722
16-7 PCI Functions PM Registers................................................................................... 724
16-8 PM Registers .............................................................................................................. 732
16-9 Gen1/Gen2 Mode EIOS Pattern ............................................................................. 737
16-10 Gen3 Mode EIOS Pattern......................................................................................... 737

xlv
PCIe 3.0.book Page xlvi Sunday, September 2, 2012 11:25 AM

Figures

16-11 Gen1/Gen2 Mode EIEOS Symbol Pattern ............................................................ 739


16-12 128b/130b EIEOS Block ........................................................................................... 740
16-13 ASPM Link State Transitions .................................................................................. 742
16-14 ASPM Support .......................................................................................................... 743
16-15 Active State PM Control Field ................................................................................ 744
16-16 Only Upstream Ports Initiate L1 ASPM ................................................................ 747
16-17 Negotiation Sequence Required to Enter L1 Active State PM ........................... 750
16-18 Negotiation Sequence Resulting in Rejection to Enter L1 ASPM State ............ 752
16-19 Switch Behavior When Downstream Component Signals L1 Exit.................... 754
16-20 Switch Behavior When Upstream Component Signals L1 Exit ......................... 755
16-21 Config. Registers for ASPM Exit Latency Management and Reporting........... 757
16-22 Example of Total L1 Latency................................................................................... 759
16-23 Devices Transition to L1 When Software Changes their Power Level
from D0760
16-24 Procedure Used to Transition a Link from the L0 to L1 State............................ 762
16-25 Link States Transitions Associated with Preparing Devices
for Removal of the Reference Clock and Power764
16-26 Negotiation for Entering L2/L3 Ready State........................................................ 766
16-27 State Transitions from L2/L3 Ready When Power is Removed........................ 767
16-28 PME Message Format............................................................................................... 769
16-29 WAKE# Signal Implementations............................................................................ 774
16-30 Auxiliary Current Enable for Devices Not Supporting PMEs ........................... 775
16-31 Poor System Idle Time ............................................................................................. 777
16-32 Improved System Idle Time .................................................................................... 777
16-33 OBFF Signaling Example ......................................................................................... 778
16-34 WAKE# Pin OBFF Signaling ................................................................................... 779
16-35 OBFF Message Contents .......................................................................................... 781
16-36 OBFF Support Indication......................................................................................... 782
16-37 OBFF Enable Register............................................................................................... 783
16-38 LTR Capability Status .............................................................................................. 785
16-39 LTR Enable................................................................................................................. 785
16-40 LTR Message Format................................................................................................ 788
16-41 LTR Example ............................................................................................................. 789
16-42 LTR - Change but no Update .................................................................................. 790
16-43 LTR - Change with Update ..................................................................................... 791
16-44 LTR - Link Down Case............................................................................................. 791
17-1 PCI Interrupt Delivery ............................................................................................. 795
17-2 Interrupt Delivery Options in PCIe System.......................................................... 796
17-3 Legacy Interrupt Example ....................................................................................... 797
17-4 APIC Model for Interrupt Delivery ....................................................................... 799
17-5 Interrupt Registers in PCI Configuration Header................................................ 801
17-6 INTx Signal Routing is Platform Specific.............................................................. 803

xlvi
PCIe 3.0.book Page xlvii Sunday, September 2, 2012 11:25 AM

Figures

17-7 Configuration Command Register Interrupt Disable Field........................... 804


17-8 Configuration Status Register Interrupt Status Field ..................................... 805
17-9 Example of INTx Messages to Virtualize INTA#-INTD#
Signal Transitions806
17-10 INTx Message Format and Type ............................................................................ 807
17-11 Example of INTx Mapping...................................................................................... 810
17-12 Switch Uses Bridge Mapping of INTx Messages ................................................. 811
17-13 MSI Capability Structure Variations...................................................................... 813
17-14 Message Control Register ........................................................................................ 814
17-15 Device MSI Configuration Process......................................................................... 819
17-16 Format of Memory Write Transaction for Native-Device MSI Delivery.......... 821
17-17 MSI-X Capability Structure ..................................................................................... 822
17-18 Location of MSI-X Table .......................................................................................... 824
17-19 MSI-X Table Entries.................................................................................................. 825
17-20 Pending Bit Array ..................................................................................................... 826
17-21 Memory Synchronization Problem ........................................................................ 827
17-22 MSI Delivery.............................................................................................................. 829
17-23 PCI Express System with PCI-Based IO Controller Hub.................................... 831
18-1 PERST# Generation .................................................................................................. 836
18-2 TS1 Ordered-Set Showing the Hot Reset Bit......................................................... 837
18-3 Switch Generates Hot Reset on One Downstream Port ...................................... 838
18-4 Switch Generates Hot Reset on All Downstream Ports ...................................... 839
18-5 Secondary Bus Reset Register to Generate Hot Reset ......................................... 840
18-6 Link Control Register ............................................................................................... 841
18-7 TS1 Ordered-Set Showing Disable Link Bit .......................................................... 842
18-8 Function-Level Reset Capability............................................................................. 843
18-9 Function-Level Reset Initiate Bit............................................................................. 843
19-1 PCI Hot Plug Elements ............................................................................................ 850
19-2 PCI Express Hot-Plug Elements ............................................................................. 851
19-3 Hot Plug Control Functions within a Switch........................................................ 864
19-4 PCIe Capability Registers Used for Hot-Plug ...................................................... 865
19-5 Slot Capabilities Register ......................................................................................... 866
19-6 Slot Control Register ................................................................................................ 868
19-7 Slot Status Register ................................................................................................... 870
19-8 Device Capabilities Register.................................................................................... 873
19-9 Power Budget Registers ........................................................................................... 878
19-10 Elements Involved in Power Budget ..................................................................... 880
19-11 Slot Power Limit Sequence...................................................................................... 882
19-12 Power Budget Capability Registers ....................................................................... 884
19-13 Power Budget Data Field Format and Definition ................................................ 885
20-1 Multicast System Example ...................................................................................... 888
20-2 Multicast Capability Registers ................................................................................ 889

xlvii
PCIe 3.0.book Page xlviii Sunday, September 2, 2012 11:25 AM

Figures

20-3 Multicast Capability Register.................................................................................. 890


20-4 Multicast Control Register....................................................................................... 890
20-5 Multicast Base Address Register ............................................................................ 891
20-6 Position of Multicast Group Number .................................................................... 892
20-7 Multicast Address Example .................................................................................... 894
20-8 Multicast Overlay BAR ............................................................................................ 895
20-9 Overlay Example ...................................................................................................... 896
20-10 Device Capabilities 2 Register................................................................................. 899
20-11 TPH Example............................................................................................................. 901
20-12 TPH Example with System Cache.......................................................................... 902
20-13 TPH Usage for TLPs to Endpoint ........................................................................... 903
20-14 TPH Usage Between Endpoints.............................................................................. 904
20-15 TPH Header Bits ....................................................................................................... 905
20-16 TPH Requester Capability Structure...................................................................... 906
20-17 TPH Capability and Control Registers .................................................................. 907
20-18 TPH Capability ST Table ......................................................................................... 908
20-19 TPH Prefix Indication............................................................................................... 909
20-20 Resizable BAR Registers .......................................................................................... 912
20-21 Resizable BAR Capability Register ........................................................................ 912
20-22 Resizable BAR Control Register ............................................................................. 913
20-23 BARs in a Type0 Configuration Header................................................................ 914
1 LeCroy Oscilloscope with ProtoSync Software Option ...................................... 920
2 LeCroy PCI Express Slot Interposer x16................................................................ 922
3 LeCroy XMC, AMC, and Mini Card Interposers ................................................. 923
4 LeCroy PCI Express Gen3 Mid-Bus Probe............................................................ 923
5 LeCroy PCI Express Gen2 Flying Lead Probe ...................................................... 924
6 TLP Packet with ECRC Error .................................................................................. 925
7 Link Level Groups TLP Packets with their Link Layer Response................. 925
8 Split Level Groups Completions with Associated Non-Posted Request...... 926
9 Compact View Collapses Related Packets for Easy Viewing
of Link Training927
10 LTSSM Graph Shows Link State Transitions Across the Trace ......................... 928
11 Flow Control Credit Tracking................................................................................. 929
12 BitTracer View of Gen2 Traffic ............................................................................... 930
13 LeCroy Gen3 PETrainer Exerciser Card ................................................................ 932
14 LeCroy Gen2 Protocol Test Card (PTC) ................................................................ 933
1 MR-IOV Switch Usage ............................................................................................ 938
2 MR-IOV Switch Internal Architecture .................................................................. 939
3 PCIe in a Data Center for HPC Applications....................................................... 940
4 PCIe Switch Application in an SSD Add-In Card............................................... 941
5 Server Motherboard Use PCIe Switches............................................................... 941
6 Server Failover in 1 + N Failover Scheme ............................................................ 942

xlviii
PCIe 3.0.book Page xlix Sunday, September 2, 2012 11:25 AM

Figures

1 Enumeration Using Transparent Bridges.............................................................. 947


2 Direct Address Translation ..................................................................................... 949
3 Look Up Table Translation Creates Multiple Windows ..................................... 950
4 Intelligent Adapters in PCI and PCI Express Systems ........................................ 951
5 Host Failover in PCI and PCI Express Systems.................................................... 953
6 Dual Host in a PCI and PCI Express System ........................................................ 955
7 Dual-Star Fabric ........................................................................................................ 957
8 Direct Address Translation ..................................................................................... 959
9 Lookup Table Based Translation ............................................................................ 960
10 Use of Limit Register ................................................................................................ 961
1 Lock Sequence Begins with Memory Read Lock Request .................................. 967
2 Lock Completes with Memory Write Followed by Unlock Message ............... 969

xlix
PCIe 3.0.book Page l Sunday, September 2, 2012 11:25 AM

Figures

l
PCIe 3.0.book Page li Sunday, September 2, 2012 11:25 AM

Tables

1 PC Architecture Book Series ....................................................................................... 1


1-1 Comparison of Bus Frequency, Bandwidth and Number of Slots ..................... 11
2-1 PCIe Aggregate Gen1, Gen2 and Gen3 Bandwidth for Various Link Widths... 43
2-2 PCI Express Request Types ....................................................................................... 59
2-3 PCI Express TLP Types.............................................................................................. 61
3-1 Enhanced Configuration Mechanism Memory-Mapped Address Range.......... 98
4-1 Results of Reading the BAR after Writing All 1s To It ........................................ 129
4-2 Results Of Reading the BAR Pair after Writing All 1s To Both ......................... 132
4-3 Results Of Reading the IO BAR after Writing All 1s To It.................................. 134
4-4 Example Prefetchable Memory Base/Limit Register Meanings........................ 139
4-5 Example Non-Prefetchable Memory Base/Limit Register Meanings............... 141
4-6 Example IO Base/Limit Register Meanings ......................................................... 143
4-7 PCI Express TLP Types And Routing Methods ................................................... 147
4-8 Posted and Non-Posted Transactions .................................................................... 150
4-9 TLP Header Format and Type Field Encodings................................................... 153
4-10 Message Request Header Type Field Usage......................................................... 165
5-1 TLP Header Type Field Defines Transaction Variant ......................................... 174
5-2 Generic Header Field Summary ............................................................................. 176
5-3 TLP Header Type and Format Field Encodings................................................... 179
5-4 IO Request Header Fields........................................................................................ 186
5-5 4DW Memory Request Header Fields ................................................................... 189
5-6 Configuration Request Header Fields ................................................................... 194
5-7 Completion Header Fields ...................................................................................... 197
5-8 Message Request Header Fields ............................................................................. 204
5-9 INTx Interrupt Signaling Message Coding........................................................... 207
5-10 Power Management Message Coding ................................................................... 208
5-11 Error Message Coding ............................................................................................. 209
5-12 Unlock Message Coding .......................................................................................... 209
5-13 Slot Power Limit Message Coding ......................................................................... 210
5-14 Vendor-Defined Message Coding .......................................................................... 211
5-15 Hot Plug Message Coding....................................................................................... 212
5-16 LTR Message Coding ............................................................................................... 213
5-17 LTR Message Coding ............................................................................................... 213
6-1 Required Minimum Flow Control Advertisements ............................................ 219
6-2 Maximum Flow Control Advertisements ............................................................. 220
6-3 Gen1 Unadjusted AckNak_LATENCY_TIMER Values (Symbol Times)......... 241
6-4 Gen2 Unadjusted AckNak_LATENCY_TIMER Values (Symbol Times)......... 241
6-5 Gen3 Unadjusted AckNak_LATENCY_TIMER Values (Symbol Times)......... 242
8-1 Simplified Ordering Rules Table ............................................................................ 289
8-2 Transactions That Can Be Reordered Due to Relaxed Ordering ....................... 299
9-1 DLLP Types ............................................................................................................... 311
9-2 Ack/Nak DLLP Fields ............................................................................................. 313

li
PCIe 3.0.book Page lii Sunday, September 2, 2012 11:25 AM

Tables

9-3 Power Management DLLP Fields........................................................................... 314


9-4 Flow Control DLLP Fields....................................................................................... 315
10-1 Ack or Nak DLLP Fields.......................................................................................... 329
10-2 Gen1 Unadjusted Ack Transmission Latency ...................................................... 345
10-3 Gen1 Unadjusted AckNak_LATENCY_TIMER Values (Symbol Times)......... 351
10-4 Gen2 Unadjusted AckNak_LATENCY_TIMER Values (Symbol Times)......... 352
10-5 Gen3 Unadjusted AckNak_LATENCY_TIMER Values (Symbol Times)......... 352
10-6 Gen1 Unadjusted REPLAY_TIMER Values in Symbol Times ........................... 353
10-7 Gen2 Unadjusted REPLAY_TIMER Values in Symbol Times ........................... 354
10-8 Gen3 Unadjusted REPLAY_TIMER Values.......................................................... 354
11-1 Control Character Encoding and Definition......................................................... 386
11-2 Allowable Transmitter Signal Skew....................................................................... 391
11-3 Allowable Receiver Signal Skew ............................................................................ 399
12-1 PCI Express Aggregate Bandwidth for Various Link Widths ........................... 408
12-2 Gen3 16-bit Skip Ordered Set Encoding................................................................ 428
12-3 Gen3 Scrambler Seed Values................................................................................... 432
12-4 Gen3 Tap Equations for Single-LFSR Scrambler.................................................. 433
12-5 Signal Skew Parameters........................................................................................... 443
13-1 Tx Preset Encodings with Coefficients and Voltage Ratios................................ 478
13-2 Tx Coefficient Table.................................................................................................. 480
13-3 Transmitter Specs...................................................................................................... 489
13-4 Parameters Specific to 8.0 GT/s.............................................................................. 491
13-5 Common Receiver Characteristics ......................................................................... 498
14-1 Summary of TS1 Ordered Set Contents................................................................. 514
14-2 Summary of TS2 Ordered Set Contents................................................................. 516
14-3 Symbol Sequence 8b/10b Compliance Pattern .................................................... 529
14-4 Second Block of 128b/130b Compliance Pattern ................................................. 530
14-5 Third Block of 128b/130b Compliance Pattern .................................................... 531
14-6 Symbol Sequence of 8b/10b Modified Compliance Pattern .............................. 532
14-7 Sequence of Compliance Tx Settings ..................................................................... 535
14-8 Tx Preset Encodings ................................................................................................. 579
14-9 Rx Preset Hint Encodings ........................................................................................ 580
14-10 Conditions for Inferring Electrical Idle.................................................................. 596
15-1 Completion Code and Description ........................................................................ 663
15-2 Error Message Codes and Description .................................................................. 669
15-3 Error-Related Fields in Command Register.......................................................... 675
15-4 Error-Related Fields in Status Register.................................................................. 677
15-5 Default Classification of Errors............................................................................... 679
15-6 Errors That Can Use Header Log Registers .......................................................... 695
16-1 Major Software/Hardware Elements Involved In PC PM ................................. 706
16-2 System PM States as Defined by the OnNow Design Initiative ........................ 708
16-3 OnNow Definition of Device-Level PM States..................................................... 709

lii
PCIe 3.0.book Page liii Sunday, September 2, 2012 11:25 AM

Tables

16-4 Default Device Class PM States .............................................................................. 710


16-5 D0 Power Management Policies ............................................................................. 714
16-6 D1 Power Management Policies ............................................................................. 717
16-7 D2 Power Management Policies ............................................................................. 719
16-8 D3hot Power Management Policies ....................................................................... 721
16-9 D3cold Power Management Policies ..................................................................... 722
16-10 Description of Function State Transitions ............................................................. 723
16-11 Function State Transition Delays ........................................................................... 724
16-12 The PMC Register Bit Assignments ....................................................................... 725
16-13 PM Control/Status Register (PMCSR) Bit Assignments .................................... 728
16-14 Data Register Interpretation.................................................................................... 733
16-15 Relationship Between Device and Link Power States ......................................... 734
16-16 Link Power State Characteristics ............................................................................ 735
16-17 Electrical Idle Inference Conditions ....................................................................... 741
16-18 Active State Power Management Control Field Definition................................ 743
17-1 INTx Message Mapping Across Virtual PCI-to-PCI Bridges ............................. 809
17-2 Format and Usage of Message Control Register .................................................. 814
17-3 Format and Usage of MSI-X Message Control Register...................................... 823
19-1 Introduction to Major Hot-Plug Software Elements............................................ 852
19-2 Major Hot-Plug Hardware Elements ..................................................................... 853
19-3 Behavior and Meaning of the Slot Attention Indicator ....................................... 860
19-4 Behavior and Meaning of the Power Indicator .................................................... 861
19-5 Slot Capability Register Fields and Descriptions................................................. 866
19-6 Slot Control Register Fields and Descriptions...................................................... 869
19-7 Slot Status Register Fields and Descriptions......................................................... 871
19-8 The Primitives ........................................................................................................... 875
19-9 Maximum Power Consumption for System Board Expansion Slots ................ 881
20-1 PH Encoding Table................................................................................................... 905
20-2 ST Table Location Encoding.................................................................................... 907

liii
PCIe 3.0.book Page liv Sunday, September 2, 2012 11:25 AM

Tables

liv
PCIe 3.0.book Page 1 Sunday, September 2, 2012 11:25 AM

The MindShare Technology Series


TheMindShareTechnologyseriesincludesthebookslistedinTable1.

Table1:PCArchitectureBookSeries

Category Title Edition ISBN


x86InstructionSetArchitecture 1st 9780977087853
TheUnabridgedPentium4 1st 032124656X
PentiumProandPentiumII 2nd 0201309734
SystemArchitecture
PentiumProcessorSystem 2nd 0201409925
Processor
Architecture
Architectures
ProtectedModeSoftware 1st 020155447X
Architecture
80486SystemArchitecture 3rd 0201409941
PowerPC601System 1st 0201409909
Architecture
PCIExpressTechnology1.x,2.x,3.0 1st 9780977087860
UniversalSerialBusSystem 1st 9780983646518
Architecture3.0
HyperTransport3.1Interconnect 1st 9780977087822
Technology
PCIExpressSystem 2nd 0321156307
Architecture
UniversalSerialBusSystem 2nd 0201461374
Interconnect Architecture2.0
Architectures HyperTransportSystem 1st 0321168453
Architecture
PCIXSystemArchitecture 1st 0201726823
PCISystemArchitecture 4th 0201309742
FirewireSystem 2nd 0201485354
Architecture:IEEE1394a
EISASystemArchitecture Outof 020140995X
print
ISASystemArchitecture 3rd 0201409968

1
PCIe 3.0.book Page 2 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table1:PCArchitectureBookSeries(Continued)

Category Title Edition ISBN


Network InfiniBandNetworkArchitecture 1st 0321117654
Architecture
PCMCIASystemArchitecture:16 2nd 0201409917
BitPCCards
Other CardBusSystemArchitecture 1st 0201409976
Architectures
PlugandPlaySystemArchitecture 1st 0201410133
AGPSystemArchitecture 1st 0201379643
Storage SASStorageArchitecture 1st 9780977087808
Technologies SATAStorageTechnology 1st 9780977087815

Cautionary Note
Please keep in mind that MindShares books often describe rapidly changing
technologies,andthatstrueforPCIExpressaswell.Thisbookisasnapshot
of the state of the technology at the time the book was completed. We make
everyefforttoproducethebooksonatimelybasis,butthenextrevisionofthe
specdoesntalwaysarriveintimetobeincludedinabook.ThisPCIExpress
bookcomprehendsrevision3.0ofthePCIExpressBaseSpecificationreleased
andtrademarkedbythePCISIG(PCISpecialInterestGroup).

Intended Audience
Theintendedaudienceforthisbookishardwareandsoftwaredesign,verifica
tion,andothersupportpersonnel.Thetutorialapproachtakenmayalsomakeit
usefultotechnicalpeoplewhoarentdirectlyinvolved.

Prerequisite Knowledge
Togetthefullbenefitofthismaterial,itsrecommendedthatthereaderhavea
reasonablebackgroundinPCarchitecture,includingknowledgeofanI/Obus
anditsrelatedprotocol.BecausePCIExpressmaintainsseverallevelsofcom
patibilitywiththeoriginalPCIdesign,criticalbackgroundinformationregard
ing PCI has been incorporated into this book. However, the reader may well
finditbeneficialtoreadtheMindSharebookPCISystemArchitecture.

2
PCIe 3.0.book Page 3 Sunday, September 2, 2012 11:25 AM

AboutThisBook

Book Topics and Organization


Topicscoveredinthisbookandchapterflowareasfollows:

Part 1: Big Picture. Provides an architectural perspective of the PCI Express


technologybycomparingandcontrastingitwithPCIandPCIXbuses.Italso
introducesfeaturesofthePCIExpressarchitecture.Anoverviewofconfigura
tionspaceconceptsplusmethodsofpacketroutingaredescribed.

Part2:TransactionLayer.Includeshighlevelpacket(TLP)formatandfielddef
initions, along with Transaction Layer functions and responsibilities such as
QualityofService,FlowControlandTransactionOrdering.

Part3:DataLinkLayer.IncludesdescriptionofACK/NAKerrordetectionand
correctionmechanismoftheDataLinkLayer.DLLPformatisalsodescribed.

Part 4: Physical Layer. Describes Lane management functions, as well as link


trainingandinitialization,reset,electricalsignaling,andlogicalPhysicalLayer
responsibilitiesassociatedwithGen1,Gen2andGen3PCIExpress.

Part 5: Additional System Topics. Discusses additional system topics of PCI


Express,includingerrordetectionandhandling,powermanagement,interrupt
handling,HotPlugandPowerBudgetingdetails.Additionalchangesmadein
thePCIExpress2.1specnotdescribedinearlierchaptersarecoveredhere.

Part6:Appendices.
DebuggingPCIExpressTrafficusingLeCroyTools
Markets&ApplicationsofPCIExpressArchitecture
Implementing Intelligent Adapters and MultiHost Systems with PCI
ExpressTechnology
LegacySupportforLocking
Glossary

Documentation Conventions
Thissectiondefinesthetypographicalconventionusedthroughoutthisbook.

PCI Express
PCIExpressisatrademarkofthePCISIG,commonlyabbreviatedasPCIe.

3
PCIe 3.0.book Page 4 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Hexadecimal Notation
Allhexnumbersarefollowedbyalowercaseh.Forexample:
89F2BD02h
0111h

Binary Notation
Allbinarynumbersarefollowedbyalowercaseb.Forexample:
1000100111110010b
01b

Decimal Notation
Number without any suffix are decimal. When required for clarity, decimal
numbersarefollowedbyalowercased.Examples:
9
15
512d

Bits, Bytes and Transfers Notation


Thisbookrepresentsbitswithalowercasebandbyteswithanuppercase
B.Forexample:

Megabits/second=Mb/s

Megabytes/second=MB/s

Megatransfers/second=MT/s

Bit Fields
Groupsbitsarerepresentedwiththehighorderbitsfirstfollowedbythelow
orderbitsandenclosedbybrackets.Forexample:

[7:0]=bits0through7

4
PCIe 3.0.book Page 5 Sunday, September 2, 2012 11:25 AM

AboutThisBook

Active Signal States


Signalsthatareactivelowarefollowedby#,asinPERST#andWAKE#.Active
highsignalshavenosuffix,suchasPOWERGOOD.

Visit Our Web Site


Our web site, www.mindshare.com, lists all of our current courses and the
deliveryoptionsavailableforeachcourse:

eLearningmodules
Livewebdeliveredclasses
Liveonsiteclasses.
Inaddition,otheritemsareavailableonoursite:
Freeshortcoursesonselectedtopics
Technicalpapers
Errataforourbooks

OurbookscanbeorderedinhardcopyoreBookversions.

We Want Your Feedback


MindSharevaluesyourcommentsandsuggestions.Contactusat:

www.mindshare.com

Phone:US18006331440,International15753730336

Generalinformation:training@mindshare.com

CorporateMailingAddress:

MindShare,Inc.
481Highway105
SuiteB,#246
Monument,CO80132
USA

5
PCIe 3.0.book Page 6 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

6
PCIe 3.0.book Page 7 Sunday, September 2, 2012 11:25 AM

PartOne:

TheBigPicture
PCIe 3.0.book Page 8 Sunday, September 2, 2012 11:25 AM
PCIe 3.0.book Page 9 Sunday, September 2, 2012 11:25 AM

1 Background
This Chapter
ThischapterreviewsthePCI(PeripheralComponentInterface)busmodelsthat
precededPCIExpress(PCIe)asawayofbuildingafoundationforunderstand
ingPCIExpressarchitecture.PCIandPCIX(PCIeXtended)areintroducedand
theirbasicfeaturesandcharacteristicsaredescribed,followedbyadiscussion
of the motivation for migrating from those earlier parallel bus models to the
serialbusmodelusedbyPCIe.

The Next Chapter


ThenextchapterprovidesanintroductiontothePCIExpressarchitectureand
isintendedtoserveasanexecutiveleveloverview,coveringallthebasicsof
thearchitectureatahighlevel.ItintroducesthelayeredapproachtoPCIeport
designgiveninthespecanddescribestheresponsibilitiesofeachlayer.

Introduction
EstablishingasolidfoundationinthetechnologiesonwhichPCIeisbuiltisa
helpfulfirststeptounderstandingit,andanoverviewofthosearchitecturesis
presentedhere.ReadersalreadyfamiliarwithPCImayprefertoskiptothenext
chapter.Thisbackgroundisonlyintendedasabriefoverview.Formoredepth
and detail on PCI and PCIX, please refer to MindShares books: PCI System
Architecture,andPCIXSystemArchitecture.

As an example of how this background can be helpful, the software used for
PCIeremainsmuchthesameasitwasforPCI.Maintainingthisbackwardcom
patibility encouragesmigration fromtheolder designs to the newby making
thesoftwarechangesassimpleandinexpensiveaspossible.Asaresult,older
PCI software works unchanged in a PCIe system and new software will con
tinuetousethesamemodelsofoperation.Forthisreasonandothers,under
standing PCI and its models of operation will facilitate an understanding of
PCIe.

9
PCIe 3.0.book Page 10 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

PCI and PCI-X


The PCI (Peripheral Component Interface) bus was developed in the early
1990s to address the shortcomings of the peripheral buses that were used in
PCs(personalcomputers)atthetime.ThestandardatthetimewasIBMsAT
(AdvancedTechnology)bus,referredtobyothervendorsastheISA(Industry
StandardArchitecture)bus.ISAhadbeensufficientforthe28616bitmachines
forwhichitwasdesigned,butadditionalbandwidthandimprovedcapabilities,
such plugandplay, were needed for the newer 32bit machines and their
peripherals.Besidesthat,ISAusedbigconnectorsthathadahighpincount.PC
vendors recognized the need for a change and several alternate bus designs
wereproposed,suchasIBMsMCA(MicroChannelArchitecture),theEISAbus
(Extended ISA, proposed as an open standard by IBM competitors), and the
VESA bus (Video Electronics Standards Association, proposed by video card
vendorsforvideodevices).However,allofthesedesignshaddrawbacksthat
preventedwideacceptance.Eventually,PCIwasdevelopedasanopenstandard
byaconsortiumofmajorplayersinthePCmarketwhoformedagroupcalled
thePCISIG(PCISpecialInterestGroup).Theperformanceofthenewlydevel
opedbusarchitecturewasmuchhigherthanISA,anditalsodefinedanewset
ofregisterswithineachdevicereferredtoasconfigurationspace.Theseregis
tersallowedsoftwaretoseethememoryandIOresourcesadeviceneededand
assigneachdeviceaddressesthatwouldntconflictwithotheraddressesinthe
system. These features: open design, high speed, and software visibility and
control, helped PCI overcome the obstacles that had limited ISA and other
busesPCIquicklybecamethestandardperipheralbusinPCs.

Afewyearslater,PCIX(PCIeXtended)wasdevelopedasalogicalextensionof
thePCIarchitectureandimprovedtheperformanceofthebusquiteabit.Well
discussthechangesalittlelater,butamajordesigngoalforPCIXwasmain
tainingcompatibilitywithPCIdevices,bothinhardwareandsoftware,tomake
migrationfromPCIassimpleaspossible.Later,thePCIX2.0revisionadded
evenhigherspeeds,achievingarawdatarateofupto4GB/s.SincePCIXmain
tained hardware backward compatibility with PCI, it remained a parallel bus
andinheritedtheproblemsassociatedwiththatmodel.Thatsinterestingforus
because parallel buses eventually reach a practical ceiling on effective band
widthandcantreadilybemadetogofaster.Goingtoahigherdataratewith
PCIX was explored by the PCISIG, but the effort was eventually abandoned.
Thatspeedceiling,alongwithahighpincount,motivatedthetransitionaway
fromtheparallelbusmodeltothenewserialbusmodel.

TheseearlierbusdefinitionsarelistedinTable 11onpage 11,whichshowsthe


developmentovertimeofhigherfrequenciesandbandwidths.Oneoftheinter

10
PCIe 3.0.book Page 11 Sunday, September 2, 2012 11:25 AM

Chapter 1: Background

estingthingstonoteinthistableisthecorrelationofclockfrequencyandthe
numberofaddincardslotsonthebus.ThiswasduetoPCIslowpowersignal
ing model, which meant that higher frequencies required shorter traces and
fewer loads on the bus (see ReflectedWave Signaling on page 16). Another
pointofinterestisthat,astheclockfrequencyincreases,thenumberofdevices
permittedonthesharedbusdecreases.WhenPCIX2.0wasintroduced,itshigh
speedmandatedthatthebusbecomeapointtopointinterconnect.

Table11:ComparisonofBusFrequency,BandwidthandNumberofSlots

PeakBandwidth NumberofCard
BusType ClockFrequency
32bit64bitbus SlotsperBus

PCI 33MHz 133266MB/s 45

PCI 66MHz 266533MB/s 12

PCIX1.0 66MHz 266533MB/s 4

PCIX1.0 133MHz 5331066MB/s 12

PCIX2.0(DDR) 133MHz 10662132MB/s 1(pointtopointbus)

PCIX2.0(QDR) 133MHz 21324262MB/s 1(pointtopointbus)

PCI Basics

Basics of a PCI-Based System


Figure11onpage12showsanoldersystembasedonaPCIbus.Thesystem
includesaNorthBridge(callednorthbecauseifthediagramisviewedasa
map, it appears geographically north of the central PCI bus) that interfaces
betweentheprocessorandthePCIbus.AssociatedwiththeNorthBridgeisthe
processorbus,systemmemorybus,AGPgraphicsbus,andPCI.Severaldevices
sharethePCIbusandareeitherconnecteddirectlytothebusorpluggedinto
anaddincardconnector.ASouthBridgeconnectsPCItosystemperipherals,
such as the ISA bus where legacy peripherals were carried forward for a few
years.TheSouthBridgewastypicallyalsothecentralresourceforPCIthatpro
videdsystemsignalslikereset,referenceclock,anderrorreporting.

11
PCIe 3.0.book Page 12 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure11:LegacyPCIBusBasedPlatform

Processor

FSB

Graphics
North
North Bridge
Bridge
(Intel
(Intel 440)
440) S
SDRAM
Address Port Data Port

PCI 33 MHz

Slots
IDE
CD HDD
Error
South Bridge Logic Ethernet SCSI
USB

ISA

Boot Modem Audio Super


ROM Chip Chip I/O

COM1
COM2

PCI Bus Initiator and Target


InaPCIhierarchyeachdeviceonthebusmaycontainuptoeightfunctionsthat
all share the bus interface for that device, numbered 07 (a singlefunction
deviceisalwaysassignedfunctionnumber0).Everyfunctioniscapableofact
ingasatargetfortransactionsonthebus,andmostwillalsobeabletoinitiate
transactions.Suchaninitiator(calledaBusMaster)hasapairofpins(REQ#and
GNT#)dedicatedtoarbitratingforuseofthesharedPCIbus.AsshowninFig
ure12onpage13,aRequest(REQ#)pinindicatesthatthemasterneedstouse
thebusandissenttothebusarbiterforevaluationagainstalltheotherrequests
atthatmoment.Thearbiterisoftenlocatedinthebridgethatishierarchically
abovethebusandreceivesarbitrationrequestsfromallthedevicesthatcanact
as initiators (Bus Masters) on that bus. The arbiter decides which requester
shouldbethenextownerofthebusandassertstheGrant(GNT#)pinforthat
device. According to the protocol, whenever the previous transaction finishes
andthebusgoesidle,whicheverdeviceseesitsGNT#assertedatthattimeis
designatedasthenextBusMasterandcanbeginitstransaction.

12
PCIe 3.0.book Page 13 Sunday, September 2, 2012 11:25 AM

Chapter 1: Background

Figure12:PCIBusArbitration

Processor

FSB

Graphics
North
North Bridge
Bridge
(Intel
(Intel 440)
440) S
SDRAM
Address Port Arbiter
Data Port

PCI 33 MHz

Slots
IDE
CD HDD
Error
South Bridge Logic Ethernet SCSI
USB REQ#
GNT#
ISA Pair

Boot Modem Audio Super


ROM Chip Chip I/O

COM1
COM2

Typical PCI Bus Cycle


Figure13 on page 15 represents a typical PCI bus cycle. PCI is synchronous,
meaningeventshappenonclockedges,sotheclockisshownatthetopofthe
diagramanditsrisingedgesaremarkedwithdottedlinesbecausethosearethe
timeswhensignalsaredrivenoutorsampled.Abriefdescriptionofwhathap
pensonthebusisasfollows:
1. On clock edge 1, FRAME# (used to indicate when a bus access is in
progress)andIRDY#(InitiatorReadyfordata)arebothinactive,showing
thatthebusisidle.Atthesametime,GNT#isactive,meaningthebusarbi
terhasselectedthisdevicetobethenextinitiator.
2. Onclockedge2,FRAME#isassertedbytheinitiator,indicatingthatanew
transaction has started. At the same time, it drives the address and com
mandforthistransaction.Alloftheotherdevicesonthebuswilllatchthis
informationandbegintheprocessofdecodingtheaddresstoseewhether
itsamatchforthem.

13
PCIe 3.0.book Page 14 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

3. On clock edge 3, the initiator indicates its readiness for data transfer by
assertingIRDY#.TheroundarrowsymbolshownontheADbusindicates
thatthetristatedbusisundergoingaturnaroundcycleasownershipof
thesignalschanges(neededherebecausethisisareadtransaction;theiniti
ator drives the address but receives data on the same pins). The targets
bufferisnotturnedonusingthesameclockedgethatturnstheinitiators
bufferoffbecausewewanttoavoidthepossibilityofbothbufferstryingto
driveasignalsimultaneously,evenforabrieftime.Thatcontentiononthe
buscoulddamagethedevicesso,instead,thepreviousbufferisturnedoff
oneclockbeforethenewoneisturnedon.Everysharedsignalishandled
thiswaybeforechangingdirection.
4. Onclockedge4,adeviceonthebushasrecognizedtherequestedaddress
andrespondedbyassertingDEVSEL#(deviceselect)toclaimthistransac
tionandparticipateinit.Atthesametime,itassertsTRDY#(targetready)
toshowthatitisdeliveringthefirstpartofthereaddataanddrivesthat
dataontotheADbus(thiscouldhavebeendelayedthetargetisallowed
16 clocks from the assertion of FRAME# until TRDY#). Since both IRDY#
andTRDY#areactiveatthesametimehere,datawillbetransferredonthat
clockedge,completingthefirstdataphase.Theinitiatorknowshowmany
byteswilleventuallybetransferred,butthetargetdoesnot.Thecommand
does not provide a byte count, so the target must look at the status of
FRAME# whenever a data phase completes to learn when the initiator is
satisfied with the amount of data transferred. If FRAME# is still asserted,
thiswasnotthelastdataphaseandthetransactionwillcontinuewiththe
nextcontiguoussetofbytes,asisthecasehere.
5. Onclockedge5,thetargetisnotpreparedtodeliverthenextsetofdata,so
itdeassertsTRDY#.ThisiscalledinsertingaWaitStateandthetransac
tionisdelayedforaclock.Bothinitiatorandtargetareallowedtodothis,
andeachcandelaythenextdatatransferbyupto8consecutiveclocks.
6. Onclockedge6,theseconddataitemistransferred,andsinceFRAME#is
stillasserted,thetargetknowsthattheinitiatorstillwantsmoredata.
7. Onclockedge7,theinitiatorforcesaWaitState.WaitStatesallowdevices
topauseatransactiontoquicklyfilloremptyabufferandcanbehelpful
because they allow the transaction to resume without having to stop and
restart.Ontheotherhand,theyareoftenveryinefficientbecausetheynot
onlystallthecurrenttransaction,theyalsopreventotherdevicesfromgain
ingaccesstothebuswhileitsstalled.
8. On clock edge 8, the third data set is transferred and now FRAME# has
beendeassertedsothetargetcantellthatthiswasthelastdataitem.Conse
quently,afterthisclock,allthecontrollinesareturnedoffandthebusonce
againgoestotheidlestate.

14
PCIe 3.0.book Page 15 Sunday, September 2, 2012 11:25 AM

Chapter 1: Background

InkeepingwiththelowcostdesigngoalforPCI,severalsignalshavemorethan
onemeaningonthebustoreducethepincount.The32addressanddatasig
nalsaremultiplexedandtheC/BE#(Command/ByteEnable)signalssharetheir
fourpinsforthesamereason.Althoughreducingthepincountisdesirable,its
also the reason that PCI uses turnaround cycles, which add more delay. It
alsoprecludestheoptiontopipelinetransactions(sendingtheaddressforthe
nextcyclewhiledataforthepreviousoneisdelivered).Handshakesignalslike
FRAME#, DEVSEL#, TRDY#, IRDY#, and STOP# control the timing of events
duringthetransaction.

Figure13:SimplePCIBusTransfer

Wait Wait Wait


State State State
Address Data Phase 2
Data Phase 1 Data Phase 3
Phase
1 2 3 4 5 6 7 8

CLK

FRAME#

Addr Data Data Data


AD 1 2 3

Bus Byte Byte Byte


C/BE# Cmd Enables Enables Enables

IRDY#

TRDY#

DEVSEL#

GNT#

15
PCIe 3.0.book Page 16 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Reflected-Wave Signaling
PCIarchitecturallysupportsupto32devicesoneachbus,butthepracticalelec
trical limit is considerably less, on the order of 10 to 12 electrical loads at the
basefrequencyof33MHz.Thereasonforthisisthatthebususesatechnique
calledreflectedwavesignalingtoreducethepowerconsumptiononthebus
(see Figure 14 on page 17). In this model, devices save cost and power by
implementingweaktransmitbuffersthatcanonlydrivethesignaltoabouthalf
thevoltageneededtoswitchthesignal.Theincidentwaveofthesignalpropa
gatesdownthetransmissionlineuntilitreachestheend.Bydesign,thereisno
terminationattheendofthelinesothewavefrontencountersaninfiniteimped
anceandreflectsback.Thisreflectionisadditiveinnatureandincreasesthesig
naltothefullvoltagelevelasitmakesitswaybacktothetransmitter.Whenthe
signal reaches the originating buffer, the low output impedance of the driver
terminates the signal and prevents further reflections. The total elapsed time
fromthebufferassertingasignaluntilthereceiverdetectsavalidsignalisthus
thepropagationtimedownthewireplusthereflectiondelaycomingbackand
thesetuptime.Allofthatmustbelessthantheclockperiod.

Asthelengthofthetraceandthenumberofelectricalloadsonabusincrease,
thetimerequiredforthesignaltomakethisroundtripincreases.A33MHzPCI
buscanonlymeetthesignaltimingwithabout1012electricalloads.Anelectri
calloadisonedeviceinstalledonthesystemboard,butapopulatedconnector
slotactuallycountsastwoloads.Therefore,asindicatedinTable 11onpage 11,
a33MHzPCIbuscanonlybedesignedforreliableoperationwithamaximum
of4or5addincardconnectors.

16
PCIe 3.0.book Page 17 Sunday, September 2, 2012 11:25 AM

Chapter 1: Background

Figure14:PCIReflectedWaveSignaling

PCI CLK Cycle


30ns (at 33MHz)

Tprop Tsu
10ns max 7 min

Tval A B
11ns max

Toconnectmoreloadsinasystem,aPCItoPCIbridgeisneeded,asshownin
Figure15.Bythetimemoremodernchipsetswereavailable,peripheralshad
grownsofastthattheircompetitionforaccesstothesharedPCIbuswaslimit
ingtheirperformance.PCIspeedsdidntkeepup,anditbecameasystembot
tleneck even though it was still very popular for peripherals. The solution to
thisproblemwastomovePCIoutofthemainpathbetweensystemperipherals
andmemory,replacingthechipsetinterconnectwithaproprietarysolution(in
thisexample,IntelsHubLinkinterface).

APCIBridgeisanextensiontothetopology.EachBridgecreatesanewPCIbus
thatiselectricallyisolatedfromthebusaboveit,allowinganother1012loads.
Someofthesedevicescouldalsobebridges,allowingalargenumberofdevices
tobeconnectedinasystem.ThePCIarchitectureallowsupto256busesina
singlesystemandeachofthosebusescanhaveupto32devices.

17
PCIe 3.0.book Page 18 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure15:33MHzPCISystem,IncludingaPCItoPCIBridge

Processor

FSB

AGP
4x Memory Controller Hub
GFX (Intel 8XX GMCH) DDR
SDRAM
Hub Link Slots
IDE
CD HDD PCI-33MHz

USB Primary PCI Bus


IO Controller Hub
(ICH4) PCI
LPC Bridge

Super AC97
IO Link Secondary PCI Bus

Ethernet
Ethernet
COM1
COM1 Modem Audio Boot
COM2
COM2 Codec Codec Ethernet ROM

PCI Bus Architecture Perspective

PCI Transaction Models


PCIusesthreemodels fordatatransfer just aspreviousbus modelsdid:Pro
grammed I/O (PIO), Peertopeer, and DMA. These models are illustrated in
Figure16onpage19anddescribedinthefollowingsections.

Programmed I/O
PIO was commonly used in the early days of the PC because designers were
reluctanttoaddtheexpenseorcomplexitytotheirdevicesoftransactionman
agementlogic.Theprocessorcoulddothejobfasterthananyotherdeviceany
way so, in this model, it handles all the work. For example, if a PCI device

18
PCIe 3.0.book Page 19 Sunday, September 2, 2012 11:25 AM

Chapter 1: Background

interruptstheCPUtoindicatethatitneedstoputdatainmemory,theCPUwill
end up reading data from the PCI device into an internal register and then
copying that register to memory. Going the other way, if data is to be moved
frommemorytothePCIdevice,softwareinstructstheCPUtoreadfrommem
oryintoitsinternalregisterandthenwritethatregistertothePCIdevice.

The process works but is inefficient for two reasons. First, there are two bus
cycles generated by the CPU for every data transfer, and second, the CPU is
busywithdatatransferhousekeepingratherthanmoreinterestingwork.Inthe
earlydaysthiswasthefastesttransfermethodandthesingletaskingprocessor
didnt have much else to do. These types of inefficiencies are typically not
acceptable in modern systems, so this method is no longer very common for
datatransfers,andinsteadtheDMAmethoddescribedinthenextsectionisthe
preferred approach. However, programmed IO is still a necessary transaction
modelinorderforsoftwaretointeractwithadevice.

Figure16:PCITransactionModels

Processor

FSB Programmed I/O


Graphics
North
North Bridge
Bridge
(Intel
(Intel 440)
440) S
SDRAM
Address Port Data Port

DMA
PCI 33 MHz

Peer
to
Peer Slots
IDE
CD HDD
Error
South Bridge Logic Ethernet SCSI
USB

ISA

Boot Modem Audio Super


ROM Chip Chip I/O

COM1
COM2

Direct Memory Access (DMA)


A more efficient method of transferring data is called DMA (direct memory
access).Inthismodelanotherdevice,calledaDMAengine,handlesthedetails
ofmemorytransferstoaperipheralonbehalfoftheprocessor,offloadingthis

19
PCIe 3.0.book Page 20 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

tedious task. Once the CPU has programmed the starting address and byte
countintoit,theDMAenginehandledthebusprotocolandaddresssequencing
onitsown.ThisdidntinvolveanychangetothePCIperipheralsandallowed
them to keep their lowcost designs. Later, improved integration allowed
peripheralstointegratethisDMAfunctionalitylocally,sotheydidntneedan
external DMA engine. These devices were capable of handling their own bus
transfersandwerecalledBusMasterdevices.

Figure 13 on page 15 is an example of a Bus Master transaction on PCI. The


NorthBridgemightdecodetheaddressandrecognizethatitwillbethetarget
for the transaction. In the data phase of the bus cycle, data is transferred
betweentheBusMasterandtheNorthBridgeactingasthetarget.TheNorth
Bridge in turn will generate DRAM bus cycles to communicate with system
memory.Afterthetransferiscompleted,thePCIperipheralmightgeneratean
interrupttoinformthesystem.TheDMAmethodofdatatransferismoreeffi
cientbecausetheCPUisnotinvolvedinthedatamovement,andasinglebus
cyclemaybesufficienttomoveablockofdata.

Peer-to-Peer
IfadeviceiscapableofactingasaBusMaster,thenanotherinterestingoption
presents itself. One PCI Bus Master could initiate a transfer to another PCI
device,withtheresultthattheentiretransactionremainslocaltothePCIbus
and doesnt involve any other system resources. Since this transaction takes
placebetweendevicesthatareconsideredpeersinthesystem,itsreferredtoas
apeertopeertransaction.Thishassomeobviousefficienciesbecausetherestof
thesystemremainsfreetodootherwork.Nevertheless,itsrarelyusedinprac
ticebecausetheinitiatorandtargetdontoftenusethesameformatforthedata
unlessbotharemadebythesamevendor.Consequently,thedatausuallymust
firstbesenttomemorywheretheCPUcanreformatitbeforeitisthentrans
ferredtothetarget,defeatingthegoalofapeertopeertransfer.

PCI Bus Arbitration


ConsiderFigure12onpage13.SincePCIdevicestodayarealmostallcapable
ofbeingbusmaster,theyareabletodobothDMAandpeertopeertransfers.
InasharedbusarchitecturelikePCI,theyhavetotaketurnsonthebus,soa
devicethatwantstoinitiatetransactionsmustfirstrequestownershipofthebus
fromthebusarbiter.Thearbiterseesallthecurrentrequestsandusesanimple
mentationspecificalgorithmtodecidewhichBusMastergetsownershipofthe
bus next. The PCI spec doesnt describe this algorithm, but does state that it
mustbefairandnotstarveanydeviceforaccess.

20
PCIe 3.0.book Page 21 Sunday, September 2, 2012 11:25 AM

Chapter 1: Background

Thearbitercangrantbusownershiptothenextrequestingdevicewhilethepre
viousBusMasterisstillexecutingitstransfer,sothatnoclocksareusedonthe
bus to sortout the next owner. As a result, thearbitration appears to happen
behindthescenesandisreferredtoashiddenbusarbitration,whichwasa
designimprovementoverearlierbusprotocols.

PCI Inefficiencies
PCI Retry Protocol
WhenaPCImasterinitiatesatransactiontoaccessatargetdeviceandthetarget
deviceisnotready,thetargetsignalsatransactionretry.Thisscenarioisshown
inFigure17.

Figure 1-7: PCI Transaction Retry Mechanism

Processor

FSB

Graphics
North
North Bridge
Bridge
(Intel
(Intel 440)
440) S
SDRAM
Address Port Data Port
1. Initiate
PCI 33 MHz 3. Retry

Slots
IDE
CD HDD
Error
South Bridge Logic Ethernet SCSI
USB

ISA

Boot Modem Audio Super 2. Target device


ROM Chip Chip I/O not ready
COM1
COM2

ConsiderthefollowingexampleinwhichtheNorthbridgeinitiatesamemory
read transaction to read data from the Ethernet device. The Ethernet target
claimsthebuscycle.However,theEthernettargetdoesnotimmediatelyhave
the data to return to the North bridge master. The Ethernet device has two
choicesbywhichtodelaythedatatransfer.Thefirstistoinsertwaitstatesin

21
PCIe 3.0.book Page 22 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

thedataphase.Ifonlyafewwaitstatesareneeded,thenthedataisstilltrans
ferredefficiently.Ifhoweverthetargetdevicerequiresmoretime(morethan16
clocksfromthebeginningofthetransaction),thenthesecondoptionthetarget
hasistosignalaretrywithasignalcalledSTOP#.Aretrytellsthemastertoend
thebuscycleprematurelywithouttransferringdata.Doingsopreventsthebus
frombeingheldforalongtimeinwaitstates,whichcompromisesthebuseffi
ciency.TheBusMasterthatisretriedbythetargetwaitsaminimumof2clocks
andmustonceagainarbitrateforuseofthebustoreinitiatetheidenticalbus
cycle.DuringthetimethattheBusMasterisretried,thearbitercangrantthe
bustootherrequestingmasterssothatthePCIbusismoreefficientlyutilized.
Bythetimetheretriedmasterisgrantedthebusanditreinitiatesthebuscycle,
hopefullythetargetwillclaimthecycleandwillbereadytotransferdata.The
buscyclegoestocompletionwithdatatransfer.Otherwise,ifthetargetisstill
notready,itretriesthemastersbuscycleagainandtheprocessisrepeateduntil
themastersuccessfullytransfersdata.

PCI Disconnect Protocol


WhenaPCImasterinitiatesatransactiontoaccessatargetdeviceandifthetar
getdeviceisabletotransferatleastonedoublewordofdatabutcannotcom
pletetheentiredatatransfer,itdisconnectsthetransactionatthepointatwhich
itcannotcontinue.ThisscenarioisillustratedinFigure18onpage23.

Consider the following example in which the North bridge initiates a burst
memoryreadtransactiontoreaddatafromtheEthernetdevice.TheEthernet
targetdeviceclaimsthebuscycleandtransferssomedata,butthenrunsoutof
datatotransfer.TheEthernetdevicehastwochoicestodelaythedatatransfer.
Thefirstoptionistoinsertwaitstatesduringthecurrentdataphasewhilewait
ingforadditional data to arrive.If thetarget needstoinsert onlyafew wait
states,thenthedataisstilltransferredefficiently.Ifhoweverthetargetdevice
requires more time (the PCI specification allows maximum of 8 clocks in the
dataphase),thenthetargetdevicemustsignaladisconnect.Todothisthetar
getassertsSTOP#inthemiddleofthebuscycletotellthemastertoendthebus
cycleprematurely.Adisconnectresultsinsomedatatransferred,whilearetry
doesnot.Disconnectfreesthebusfromlongperiodsofwaitstates.Thediscon
nectedmasterwaitsaminimumof2clocksbeforeonceagainarbitratingforuse
ofthebusandcontinuingthebuscycleatthedisconnectedaddress.Duringthe
timethattheBusMasterisdisconnected,thearbitermaygrantthebustoother
requestingmasterssothatthePCIbusisutilizedmoreefficiently.Bythetime
thedisconnectedmasterisgrantedthebusandcontinuesthebuscycle,hope
fullythetargetisreadytocontinuethedatatransferuntilitiscompleted.Oth
erwise,thetargetonceagainretriesordisconnectsthemastersbuscycleand
theprocessisrepeateduntilthemastersuccessfullytransfersallitsdata.

22
PCIe 3.0.book Page 23 Sunday, September 2, 2012 11:25 AM

Chapter 1: Background

Figure 1-8: PCI Transaction Disconnect Mechanism

Processor

FSB

Graphics
North
North Bridge
Bridge
(Intel
(Intel 440)
440) S
SDRAM
Address Port Data Port
1. Initiate
PCI 33 MHz
3. Disconnect

Slots
IDE
CD HDD
Error
South Bridge Logic Ethernet SCSI
USB

ISA

Boot Modem Audio Super 2. Some data


ROM Chip Chip I/O transferred
COM1
COM2

PCI Interrupt Handling


PCIdevicesuseoneoffoursidebandinterruptsignals(INTA#,INTB#,INTC#,
orINTD#)tosendaninterruptrequesttothesystem.Whenoneofthepinsis
asserted,theinterruptcontrollerinasingleCPUsystemrespondedbyasserting
theINTR(interruptrequest)pintotheCPU.LatermultiCPUdesignsneededto
improve on the single wire input for interrupts and changed to an APIC
(AdvancedProgrammableInterruptController)model,inwhichthecontroller
sendsamessagetothemultipleCPUsinsteadofassertingtheINTRpintoone
ofthem.Regardlessofthedeliverymodel,aninterruptedCPUmustdetermine
the source of the interrupt and then service the interrupt. The legacy model
requiredseveralbuscyclesforthisandwasntveryefficient.TheAPICmodelis
betterbutalsoleavesroomforimprovement.

23
PCIe 3.0.book Page 24 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

PCI Error Handling


PCI devices can optionally detect and report address and data phase parity
errorsduringtransactions.PCIgeneratesevenparityacrossmostofthesig
nalsduringatransactionbyusingthePARsignal.Thismeansthatifthenum
berofsetbitsduringanaddressordataphaseisodd,themasterdevicewillset
thePARsignaltomaketheparityeven.Thetargetdevicereceivestheaddress
ordataandchecksforerrors.Parityerrorsaredetectableonlyaslongasanodd
numberofsignalsareaffectedcausingthereceivednumberofonestobeodd.If
adevicedetectsadataphaseparityerror,itassertsPERR#(parityerror).Thisis
potentiallyarecoverableerrorsince,forcaseslikeamemoryread,justrepeat
ing the transaction may resolve the problem. PCI does not include any auto
matic or hardwarebased recovery mechanisms, though, so any attempts to
resolvetheerrorwouldbehandledbysoftware.

Figure19:PCIErrorHandling

NMI
Processor

FSB

Graphics
North
North Bridge
Bridge
(Intel
(Intel 440)
440) S
SDRAM
Address Port Data Port

PCI 33 MHz

Slots
IDE PERR#
CD HDD
Error
South Bridge Logic
USB SERR#

ISA
Ethernet SCSI
Boot Modem Audio Super
ROM Chip Chip I/O

COM1
COM2

However,itsadifferentmatterifaparityerrorisdetectedduringtheaddress
phase.Inthiscasetheaddresswascorruptedandthewrongtargetmayhave
recognized the address. Theres no way to tell what the corrupted address
becameorwhatdevicesonthebusdidinresponsetoit,sotheresalsonosim

24
PCIe 3.0.book Page 25 Sunday, September 2, 2012 11:25 AM

Chapter 1: Background

plerecovery.Asaresult,errorsofthistyperesultintheassertionoftheSERR#
(systemerror)pin,whichtypicallyresultsinacalltothesystemerrorhandler.
Inoldermachines,thiswouldoftenhaltthesystemasaprecaution,resultingin
thebluescreenofdeath.

Inoldermachines,bothPERR#andSERR#wereconnectedtotheerrorlogicin
the SouthBridge. For reasons of simplicity and cost, this typically resulted in
the assertion of an NMI signal (nonmaskable interrupt signal) to the CPU,
whichwouldoftensimplyhaltthesystem.

PCI Address Space Map


PCIarchitecturesupports3addressspacesasshowninFigure110onpage26:
memory,I/Oandconfigurationaddressspace.x86processorscanaccessmem
ory and IO space directly. A PCI device maps into the processors memory
address space and can either support 32 or 64 bit memory addressing. In I/O
addressspace,PCIsupports32bitaddressesbut,sincex86CPUsonlyused16
bitsforI/Ospace,manyplatformslimittheI/Ospaceto64KB(16bitsworth).

PCI also introduced a third address space called configuration space that the
CPUcouldonlyindirectlyaccess.Eachfunctioncontainsinternalregistersfor
configuration space that allow software visibility and control of its addresses
andresourcesinastandardizedway,providingatrueplugandplayenviron
mentinthePC.EachPCIfunctionmayhaveupto256Bytesofconfiguration
addressspace.GiventhatPCIsupportsupto8functions/device,32devices/bus
andupto256buses/system,thenthetotalamountofconfigurationspaceasso
ciatedwithasystemis256Bytes/functionx8functions/devicex32devices/bus
x256buses/system=16MBofconfigurationspace.

Sinceanx86CPUcannotaccessconfigurationspacedirectly,itmustdosoindi
rectly by indexing through IO registers (although with PCI Express a new
method to access configuration space was introduced by mapping it into the
memory address space). The legacy model, shown in Figure 110 on page 26,
uses an IO Port called Configuration Address Port located at address CF8h
CFBh and a Configuration Data Port mapped to address CFChCFFh. Details
regardingthismethodandthememorymappedmethodofaccessingconfigu
rationspaceareexplainedinthenextsection.

25
PCIe 3.0.book Page 26 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure110:AddressSpaceMapping

Memory Map
4GB / 16 EB

PCI PCI
Memory
Configuration
AGP Video
Space
16MB

PCI
Memory

DRAM Boundary
Extended
IO Map
Memory 64KB

1MB
Boot ROM PCI IO
Expansion ROM Space
Legacy Video
640KB Data Port CFCh-CFFh

Conventional Address Port CF8h-CFBh


256B
Memory 256B
1KB
Legacy IO 256B

PCI Configuration Cycle Generation


Since IO address space is limited, the legacy model was designed to be very
conservativewithaddresses.ThecommonwayofdoingthatinIOspacewasto
haveoneregisterforpointingtoaninternallocation,andasecondoneforread
ingorwritingthedata.InPCIconfigurationthatinvolvestwosteps.

Step1:TheCPUgeneratesanIOwritetotheAddressPortatIOaddressCF8h
in the North Bridge to give the address of the configuration register to be
accessed.Thisaddress,showninFigure111onpage27,consistsprimarilyof
thethreethingsthatlocateaPCIfunctionwithinthetopology:whichbuswe
wanttoaccessoutofthe256possible,whichdeviceonthatbusoutofthe32
possible,andwhichfunctionwithinthatdeviceoutofthe8possible.Theonly
otherinformationneededistoidentifywhichofthe64dwords(256bytes)in
thatfunctionsconfigurationspaceistobeaccessed.

26
PCIe 3.0.book Page 27 Sunday, September 2, 2012 11:25 AM

Chapter 1: Background

Step2:TheCPUgenerateseitheranIOreadorIOwritetotheDataPortatloca
tionCFChintheNorthBridge.Basedonthat,theNorthBridgethengeneratesa
configurationreadorconfigurationwritetransactiontothePCIbusspecifiedin
theAddressPort.

Figure111:ConfigurationAddressRegister

0CFBh 0CFAh 0CF9h 0CF8h


31 30 24 23 16 15 11 10 8 7 2 1 0
Reserved Bus Device Function
Number Number Number Doubleword 0 0

Register pointer (64 DW)


Should always be zeros
Enable Configuration Space Mapping
1 = enabled

PCI Function Configuration Register Space


EachPCIfunctioncontainsupto256bytesofconfigurationspace.Thefirst64
bytes of each functions configuration space contains a structure called the
Header,whiletheremaining192Bytes supportoptionalfunctionality.System
configurationisfirstperformedbyBootROMfirmware.AftertheOSloads,it
may reconfigure the system and rearrange resource assignments, with the
resultthattheprocessofsystemconfigurationmaybedonetwice.
TherearetwobasicclassesofPCIfunctionsasdefinedbytheirheadertypes.A
Type1headeridentifiesafunctionthatisabridge(asshowninFigure112on
page28)andcreatesanotherbusinthetopology,whileaType0headerindi
catesafunctionthatisNOTabridge(asshowninFigure113onpage29).This
headertypeinformationiscontainedinafieldbythesamenameindword3,
byte2,andshouldbeoneofthefirstthingssoftwarecheckswhendiscovering
whichfunctionsexistinthesystem(aprocesscalledenumeration).

27
PCIe 3.0.book Page 28 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure112:PCIConfigurationHeaderType1(Bridge)

Type 1 Header
Primary Bus 31 23 15 7 0

Device ID Vendor ID 00h


Configuration 04h
Status Command
Registers
Rev 08h
Class Code ID
Header
BIST Header Latency Cache 0Ch
Type Timer Line Size
Base Address 0 (BAR0) 10h

Base Address 1 (BAR1) 14h


Bridge Function Secondary Subordinate Secondary Primary
18h
Lat Timer Bus # Bus # Bus #
Secondary IO IO 1Ch
Status Limit Base
Secondary Bus (Non-Prefetchable) (Non-Prefetchable)
20h
Memory Limit Memory Base
Prefetchable Prefetchable 24h
Memory Limit Memory Base
Prefetchable Memory Base 28h
Upper 32 Bits
Prefetchable Memory Limit 2Ch
Upper 32 Bits
IO Limit IO Base
Upper 16 Bits Upper 16 Bits 30h

Reserved Capability
34h
Pointer
Expansion ROM Base Address 38h

Bridge Interrupt Interrupt 3Ch


Control Pin Line

28
PCIe 3.0.book Page 29 Sunday, September 2, 2012 11:25 AM

Chapter 1: Background

Figure113:PCIConfigurationHeaderType0(notaBridge)

Type 0 Header
31 23 15 7 0

Device ID Vendor ID 00h


Configuration 04h
Status Command
Registers
Rev 08h
Class Code ID
Header
BIST Header Latency Cache 0Ch
Type Timer Line Size
Base Address 0 (BAR0) 10h

Base Address 1 (BAR1) 14h

Device Base Address 2 (BAR2) 18h

Base Address 3 (BAR3) 1Ch

Base Address 4 (BAR4) 20h

Base Address 5 (BAR5) 24h

CardBus CIS Pointer 28h

Subsystem Subsystem
Vendor ID 2Ch
Device ID
Expansion ROM Base Address 30h

Reserved Capability
34h
Pointer
Reserved 38h

Max Lat Min Gnt Interrupt Interrupt 3Ch


Pin Line

Details of the configuration register space and the enumeration process are
describedlater.Fornowwesimplywantyoutobecomefamiliarwiththebig
pictureofhowallthepartsfittogether.

Higher-bandwidth PCI
To support higher bandwidth, the PCI specification was updated to support
bothwider(64bit)andfaster(66MHz)versions,achieving533MB/s.Figure1
14showsanexampleofa66MHz,64bitPCIsystem.

29
PCIe 3.0.book Page 30 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure114:66MHzPCIBusBasedPlatform

Processor Processor

FSB
AGP
4x
GFX
RDRAM
Memory Controller Hub
P64H (Intel 860 MCH)
Slot PCI-66MHz Hub Link RDRAM
P64H
Hub Link Slots
IDE PCI-33MHz
CD HDD
USB 2.0 IO Controller Hub
(ICH2) IEEE
LPC SCSI
1394

Super AC97
IO Link

COM1 Modem Audio Boot


COM2 Codec Codec Ethernet ROM

Limitations of 66 MHz PCI bus


While the throughput of the bus was doubled at this speed relative to the 33
MHzbus,thediagramillustratesoneofitsmajorshortcomings:usingthesame
reflectedwave switching model with only half the timing budget meant that
theloadingonthebushadtobegreatlyreduced.Theresultwasthatonlyone
addincardcouldbesupportedoneachbus.Addingmoredevicemeantadd
ingmorePCIbridgesandbuseswouldincreasesbothcostandboardrealestate
requirements. The 64bit PCI bus increases pin count, increasing system cost
andloweredsystemreliability.Incombination,itseasytoseewhythesefactors
limitedthepopularityof64bitor66MHzversionofPCIbus.

30
PCIe 3.0.book Page 31 Sunday, September 2, 2012 11:25 AM

Chapter 1: Background

Signal Timing Problems with the Parallel PCI Bus Model


beyond 66 MHz
PCIbusclockfrequencycannotbeincreasedbeyond66MHzgiventherealistic
loadsthatexistonaPCIbusandsignalflighttimes.Witha66MHzclock,the
clockperiodis15ns.Setuptimeallocatedatthereceiveris3ns.WiththePCI
nonregisteredinputsignalbusmodel,reducingsignalsetuptimebelowthis
3 ns value is not realistic. The rest of the 12 ns timing budget is allocated
towards output delays at the transmitter and signal flight time. Clocking PCI
busanyfasterthan66MHzimpliesreducingclockperiod.Atransmittedsignal
willnotbereceivedintimeenoughtobesampledatthereceiver.

ThePCIXbusintroducedinthenextsectiontakestheapproachofregistering
all input signals with a FlipFlop before using them. Doing so reduced signal
setuptimetobelow1ns.ThesetuptimesavingsofPCIsetuptimeallowsPCIX
bustoberunathigherfrequenciesof100MHzoreven133Mhz.Inthenextsec
tion,wedescribePCIXbusarchitecturebriefly.

Introducing PCI-X
PCIX is backward compatible with PCI in both hardware and software, but
provides betterperformanceandhigherefficiency.Itusesthe sameconnector
format,soPCIXdevicescanbepluggedintoPCIslotsandviceversa.Andit
uses the same configuration model, so device drivers, operating systems, and
applicationsthatrunonaPCIsystemalsorunonaPCIXsystem.

To achieve higher speeds without changing the PCI signaling model, PCIX
added a few tricks to improve the bus timing. First, they implement PLL
(phaselocked loop) clock generators that provide phaseshifted clocks inter
nally.Thatallowstheoutputstobedrivenalittleearlierandtheinputstobe
sampledalittlelater,improvingthetimingonthebus.Likewise,PCIXinputs
areregistered(latched)attheinputpinofthetargetdevice,resultinginshorter
setup times. The time gained by these means increased the time available for
signalpropagationonthebusandallowedhigherclockfrequencies.

PCI-X System Example


AnexampleofanIntel7500serverchipsetbasedsystemisshowninFigure115
on page 32. The MCH chip includes three additional highperformance Hub
Link2.0portsthatareconnectedtothreePCIXHub2bridges(P64H2).Each

31
PCIe 3.0.book Page 32 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

bridgesupportstwoPCIXbusesthatcanrunatfrequenciesupto133MHz.The
HubLink2.0cansustainthehigherbandwidthrequirementsforPCIXtraffic.
Note that we have the same loading problem that we did for 66MHz PCI,
resultinginalargenumberofbusesneededtosupportmoredevicesandarela
tivelyexpensivesolution.Thebandwidthismuchhighernow,though.

Figure115:66MHz/133MHzPCIXBusBasedPlatform

Processor Processor

FSB

PCI-X
P64H2
Hub Link 2 DDR SDRAM
Memory Controller Hub
P64H2 (Intel 7500 MCH)
Hub Link 2 DDR SDRAM

P64H2
64-bit,
66MHz or 100MHz or 133MHz
Hub Link 1
IDE
Slots
USB IO Controller Hub PCI-33MHz
(ICH3)
LPC
IEEE
SCSI
AC97 1394
Link
Boot
Ethernet ROM

PCI-X Transactions
Figure116onpage33showsanexampleofaPCIXburstmemoryreadtrans
action.NotethatPCIXdoesnotallowWaitStatesafterthefirstdataphase.This
ispossiblebecausethetransfersizeisnowprovidedtothetargetdeviceinthe
Attributephaseofthetransaction,sothetargetdevicesknowsexactlywhatis
goingtoberequiredofhim.Inaddition,mostPCIXbuscyclesareburstsand
data is generally transferred in blocks of 128 Bytes. These features allow for
moreefficientbusutilizationanddevicebuffermanagement.

32
PCIe 3.0.book Page 33 Sunday, September 2, 2012 11:25 AM

Chapter 1: Background

Figure116:ExamplePCIXBurstMemoryReadBusCycle
Idle
Address Attribute Response Data Data Data Data
Turnaround
Phase Phase Phase Phase Phase Phase Phase Cycle
1 2 3 4
1 2 3 4 5 6 7 8 9 10

CLK

la to
r
FRAME#

sfe
t
tran st
Nex
AD[31:0] Address ATTR Data-0 Data-1 Data-2 Data-3

C/BE#[3:0] Cmd ATTR

IRDY#

TRDY#
Decode
DEVSEL# A

PCI-X Features
Split-Transaction Model
InaconventionalPCIreadtransaction,theBusMasterinitiatesareadtoatarget
deviceonthebus.Asdescribedearlier,ifthetargetisunpreparedtofinishthe
transactionitcaneitherholdthebuswithWaitStateswhilefetchingthedata,or
issueaRetryintheprocessofaDelayedTransaction.

PCIXbususesaSplitTransactiontohandlethesecases,asillustratedinFigure
117onpage34.Tohelpkeeptrackofwhateachdeviceisdoing,thedeviceini
tiatingthereadisnowcalledtheRequester,andthedevicefulfillingtheread
requestiscalledtheCompleter.Ifthecompleterisunabletoservicetherequest
immediately, it memorizes the transaction (address, transaction type, byte
count,requesterID)andsignalsasplitresponse.Thistellstherequestertoput
thistransactionasideinaqueue,endthecurrentbuscycle,andreleasethebus
totheidlestate.Thatmakesthebusavailableforothertransactionswhilethe
completerisawaitingtherequesteddata.Therequesterisfreetodowhateverit

33
PCIe 3.0.book Page 34 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

likeswhileitwaitsforthecompleter,suchasinitiatingotherrequests,evento
thesamecompleter.Oncethecompleterhasgatheredtherequesteddata,itthen
arbitratesforownershipofthebusandinitiatesasplitcompletionduringwhich
it returns the requested data. The requester claims the split completion bus
cycleandacceptsthedatafromthecompleter.Thesplitcompletionlooksvery
muchlikeawritetransactiontothesystem.ThisSplitTransactionModelispos
sible because not only does the request indicate how much data they are
requesting in the Attribute phase, but they also indicate who they are (their
Bus:Device:Functionnumber)whichallowsthecompletertotargetthecorrect
devicewiththecompletion.

Two bus transactions are needed to complete the entire data transfer, but
betweenthereadrequestandthesplitcompletionthebusisavailableforother
work.Therequesterdoesnotneedtopollthedevicewithretriestolearnwhen
the data is ready. The completer simply arbitrates for the bus and drives the
requested data back when it is ready. This makes for a much more efficient
transactionmodelintermsofbusutilization.

TheseprotocolenhancementsmadetothePCIXbusarchitecturedescribedso
farcontributetowardsanincreasedtransferefficiencyofaround85%forPCIX
ascomparedto50%60%withthestandardPCIprotocol.

Figure117:PCIXSplitTransactionProtocol

1. Requester initiates
read transaction 2. Completer unable to
return data immediately

4. Completer issues 3. Completer


split response
Requester Completer memorizes
transaction

5. Later, Completer initiates split completion


bus cycle to return read data

Message Signaled Interrupts


PCIXdevicesrequireMSI(MessageSignaledInterrupt)capability,whichwas
developedasawaytoreduceoreliminatetheneedtoshareinterruptsacross
multipledevicesaswastypicallyrequiredinthelegacyinterruptarchitecture.

34
PCIe 3.0.book Page 35 Sunday, September 2, 2012 11:25 AM

Chapter 1: Background

TogenerateaninterruptrequestusingMSI,adeviceinitiatesamemorywrite
transactionusingapredefinedaddressrangethatisunderstoodtobeaninter
ruptwhichshouldbedeliveredtooneofmoreCPUs,andthedataisaunique
interruptvectorassociatedwiththatdevice.TheCPU,armedwiththeinterrupt
number, is able to immediately jump to the interrupt service routine for the
deviceandavoidstheoverheadassociatedwithfindingwhichdevicegenerated
theinterrupt.Inaddition,nosidebandpinsareneeded.

Transaction Attributes
Finally, PCIX also added another phase to the beginning of each transaction
called the Attribute Phase (see Figure 116 on page 33). In this time slot the
requesterdeliversinformationthatcanbeusedtohelpimprovetheefficiencyof
transactions on the bus, such as the byte count for this request and who the
requesteris(Bus:Device:Functionnumber).Inadditiontothoseitems,twonew
bitswereaddedtohelpcharacterizethistransaction:theNoSnoopbitandthe
RelaxedOrderingbit.

NoSnoop(NS):Normally, when a transaction moves data into or out of


memory, the CPUs internal caches need to be checked to see if that memory
locationhasbeencopiedintooneormoreCPUcaches.Ifso,thecachecontents
may need to be written back to memory or invalidated before the requested
transaction is allowed to access memory. Naturally, this snoop process takes
time and adds latency to a request. Sometimes the software is aware that a
requestedlocationwillneverbefoundintheCPUcaches(perhapsbecausethe
locationwasdefinedbythesystemasuncacheable),sosnoopingisunnecessary
andthatstepcouldbeskipped.TheNoSnoopbitwasaddedwithpreciselythat
caseinmind.

RelaxedOrdering(RO):Normally,transactionsarerequiredtoremainin
thesameorderthattheywereissuedonthebuswhiletheygothroughbuffers
inbridges.ThisisreferredtoastheStronglyOrderedmodel,andPCIandPCI
X generally follow that rule with a few exceptions. Thats because it helps
resolvedependenciesamongtransactionsthatarerelatedtoeachother,suchas
writingandthenreadingthesamelocation.However,notalltransactionsactu
ally have dependencies. If they dont, then forcing them to stay in order can
resultinlossofperformance,andthatswhatthisbitwasdesignedtoalleviate.
If the requester knows that a particular transaction is unrelated to the other
transactions that have gone before, it can set this bit to tell bridges that this
transactionisallowedtojumpaheadinthequeuetogivebetterperformance.

35
PCIe 3.0.book Page 36 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Higher Bandwidth PCI-X


Problems with the Common Clock Approach of PCI and PCI-X
1.0 Parallel Bus Model
An issue that becomes clear when trying to migrate a bus like PCI to higher
speedsisthatparallelbusdesignshavesomeinherentlimitations.Figure118
on page 36 helps illustrate these. These designs use a common or distributed
clock,inwhichdataisdrivenoutononeclockedgeandlatchedinonthenext
clockedgesothatthetotaltimingbudgetisthetimeforoneclockperiod.Natu
rally,thehigherthefrequency,thesmallertheclockperiodandthusthesmaller
thetimingbudget.

Figure118:InherentProblemsinaParallelDesign

Flight Time

Transmitter R eceiver

Transmission Incorrect
M edia sampling
due to skew

C ommon C lock C ommon C lock

Thefirstissuetonoteissignalskew.Whenmultipledatabitsaresentatonce,
theyexperienceslightlydifferentdelaysandarriveatslightlydifferenttimesat
thereceiver.Ifthatdifferenceistoolarge,incorrectsignalsamplingwithclock
mayoccuratthereceiverasshowninthediagram.Asecondissueisclockskew
betweenmultipledevices.Thearrivaltimeofthecommonclockatonedeviceis
notpreciselythesameasthearrivaltimeattheotherwhichfurtherreducesthe
timingbudget.Finally,athirdissuerelatestothetimeittakesforthesignalto

36
PCIe 3.0.book Page 37 Sunday, September 2, 2012 11:25 AM

Chapter 1: Background

propagate from a transmitter to a receiver, called the flight time. The clock
periodortimingbudgetmustbegreaterthanthesignalflighttime.Toensure
this, the board design is required to implement signal traces that are short
enoughsuchthatsignalpropagationdelaysaresmallerthantheclockperiod.
Inmanyboarddesigns,thisshortsignaltracesmaynotberealisticenoughto
designfor.

Tofurtherimproveperformanceinspiteoftheselimitations,acoupleoftech
niques can be used. First, the existing protocol can be streamlined and made
moreefficient.Andsecond,thebusmodelcanbechangedtoasourcesynchro
nousclockingmodelwherethebussignalandclock(strobe)aredrivenatthe
same time on signals that experience equal propagation delay. This is the
approachtakenbyPCIX2.0protocol.

PCI-X 2.0 Source-Synchronous Model


PCIX2.0furtherincreasedthebandwidthofPCIX.Asbefore,thedevicesand
connectors remained hardware and software backward compatible with PCI
devices and connectors. To achieve the higher speeds, the bus uses a source
synchronousdeliverymodeltosupporteitherDualDataRate(DDR)orQuad
DataRate(QDR).

The term source synchronous means that the device transmitting the data
also provides another signal that travels the same basic path as the data. As
illustratedinFigure119onpage38,thatsignalinPCIX2.0iscalledastrobe
andisusedbythereceiverforlatchingtheincomingdatabits.Thetransmitter
assignsthetimingrelationshipbetweenthedataandstrobeandaslongastheir
pathsaresimilarinlengthandothercharacteristicsthatcanaffecttransmission
latency,thatrelationshipwillbeaboutthesamewhentheyarriveatthereceiver
andthereceivercansimplyusetheStrobeasthesignaltolatchthedatainwith.
Thisallowshigherspeedsbecauseclockskewwithrespecttothecommonclock
isremovedasaseparatebudgetitemandbecausetheissueofflighttimegoes
away.ItnolongermattershowlongittakesforthedatatotravelfrompointA
topointBbecausethestrobethatlatchesitintakesaboutthesametimeandso
theirrelationshipwillbeunaffected.

Its important to note again that the very highspeed signal timing eliminates
thepossibilityofusingasharedbusmodelandforcesapointtopointdesign
instead.Asaresult,increasingthenumberofdevicesmeansmorebridgeswill
be needed to create more buses. A device could be designed to support this
withthreeinterfacesandaninternalbridgestructuretoallowthemalltocom
municate with each other. Such a device would have a very high pin count,
though,andahighercost,relegatingPCIX2.0totheveryhighendmarket.

37
PCIe 3.0.book Page 38 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Since it was recognized that this would be an expensive solution that would
appeal more to highend designers, PCIX 2.0 also supports ECC generation
and checking. ECC is much more robust and sophisticated than parity detec
tion, allowing automatic correction of singlebit errors on the fly, and robust
detectionofmultibiterrors.Thisimprovederrorhandlingaddscost,buthigh
endplatformsneedtheimprovedreliabilityitprovides,hencealogicalchoice.

Figure119:SourceSynchronousClockingModel

Data
D Q

Data
D Q

Data
D Q
Strobe

Source Device Receiving Device

Despite the improvements in bandwidth, efficiency and reliability that came


withPCIX(2.0),theparallelbusmodelwasapproachingitsendoflifeanda
newmodelwasneededtoaddresstherelentlessdemandforhigherbandwidth
andlowercost.Thenewmodelchosenwasaserialinterfacewhichisadrasti
callydifferentbusfromaphysicalperspective,butwasstillmadetobesoftware
backwardscompatible.WeknowthisnewmodelasPCIExpress.

38
PCIe 3.0.book Page 39 Sunday, September 2, 2012 11:25 AM

2 PCIeArchitecture
Overview
Previous Chapter
Thepreviouschapterprovidedhistoricalbackgroundtoestablishafoundation
forunderstandingPCIExpress.ThisincludedreviewingthebasicsofPCIand
PCIX1.0/2.0.ThegoalwastoprovideacontextfortheoverviewofPCIExpress
thatfollows.

This Chapter
ThischapterprovidesathoroughintroductiontothePCIExpressarchitecture
andisintendedtoserveasanexecutiveleveloverview,coveringallthebasics
ofthearchitectureatahighlevel.Itintroducesthelayeredapproachgivenin
the spec and describes the responsibilities of each layer. The various packet
types are introduced along with the protocol used to communicate them and
facilitatereliabletransmission.

The Next Chapter


ThenextchapterprovidesanintroductiontoconfigurationinthePCIExpress
environment.ThisincludesthespaceinwhichaFunctionsconfigurationregis
tersareimplemented,howaFunctionisdiscovered,howconfigurationtransac
tions are generated and routed, the difference between PCIcompatible space
andPCIeextendedspace,andhowsoftwaredifferentiatesbetweenanEndpoint
andaBridge.

Introduction to PCI Express


PCIExpressrepresentsamajorshiftfromtheparallelbusmodelofitspredeces
sors. As a serial bus, it has more in common with earlier serial designs like
InfiniBandorFibreChannel,butitremainsfullybackwardcompatiblewithPCI
insoftware.

39
PCIe 3.0.book Page 40 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Asistrueofmanyhighspeedserialtransports,PCIeusesabidirectionalcon
nectionandiscapableofsendingandreceivinginformationatthesametime.
Themodelusedisreferredtoasadualsimplexconnectionbecauseeachinter
facehasasimplextransmitpathandasimplexreceivepath,asshowninFigure
21onpage40.Sincetrafficisallowedinbothdirectionsatonce,thecommuni
cationpathbetweentwodevicesistechnicallyfullduplex,butthespecusesthe
termdualsimplexbecauseitsalittlemoredescriptiveoftheactualcommuni
cationchannelsthatexist.

Figure21:DualSimplexLink

Packet
PCIe PCIe
Device Link (1 to 32 lanes wide) Device
A B
Packet

ThetermforthispathbetweenthedevicesisaLink,andismadeupofoneor
moretransmitandreceivepairs.OnesuchpairiscalledaLane,andthespec
allowsaLinktobemadeup1,2,4,8,12,16,or32Lanes.Thenumberoflanesis
called the Link Width and is represented as x1, x2, x4, x8, x16, and x32. The
tradeoffregardingthenumberoflanestobeusedinagivendesignisstraight
forward: more lanes increase the bandwidth of the Link but add to its cost,
spacerequirement,andpowerconsumption.Formoreonthis,seeLinksand
Lanesonpage 46.

Figure22:OneLane

Transmitter Receiver

Receiver Transmitter

One lane

40
PCIe 3.0.book Page 41 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

Software Backward Compatibility


OneofthemostimportantdesigngoalsforPCIewasbackwardcompatibility
withPCIsoftware.Encouragingmigrationawayfromadesignthatisalready
installedandworkinginexistingsystemsrequirestwothings:First,acompel
lingimprovementthatmotivatesevenconsideringachangeand,second,mini
mizingthecost,risk,andeffortofchanging.Acommonwaytohelpthissecond
factorincomputersistomaintaintheviabilityofsoftwarewrittenfortheold
modelinthenewone.ToachievethisforPCIe,alltheaddressspacesusedfor
PCIarecarriedforwardeitherunchangedorsimplyextended.Memory,IO,and
Configurationspacesarestillvisibletosoftwareandprogrammedinexactlythe
samewaytheywerebefore.Consequently,softwarewrittenyearsagoforPCI
(BIOS code, device drivers, etc.) will still work with PCIe devices today. The
configurationspacehasbeenextendeddramaticallytoincludemanynewregis
ters to support new functionality, but the old registers are still there and still
accessibleintheregularway(seeSoftwareCompatibilityCharacteristicson
page 49).

Serial Transport
The Need for Speed
Ofcourse,aserialmodelmustrunmuchfasterthanaparalleldesigntoaccom
plishthesamebandwidthbecauseitmayonlysendonebitatatime.Thishas
notprovendifficult,though,andinthepastPCIehasworkedreliablyat2.5GT/
sand5.0GT/s.Thereasontheseandstillhigherspeeds(8GT/s)areattainableis
thattheserialmodelovercomestheshortcomingsoftheparallelmodel.

OvercomingProblems.Bywayofreview,thereareahandfulofproblems
thatlimittheperformanceofaparallelbusandthreeareillustratedinFigure2
3onpage42.Togetstarted,recallthatparallelbusesuseacommonclock;out
putsareclockedoutononeclockedgeandclockedintothereceiveronthenext
edge.Oneissuewiththismodelisthetimeittakestosendasignalfromtrans
mitter to receiver, called the flight time. The flight time must be less than the
clockperiodorthemodelwontwork,sogoingtosmallerclockperiodsischal
lenging.Tomakethispossible,tracesmustgetshorterandloadsreducedbut
eventually this becomes impractical. Another factor is the difference in the
arrivaltimeoftheclockatthesenderandreceiver,calledclockskew.Boardlay
outdesignersworkhardtominimizethisvaluebecauseitdetractsfromthetim
ingbudgetbutitcanneverbeeliminated.Athirdfactorissignalskew,whichis

41
PCIe 3.0.book Page 42 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

the difference in arrival times for all the signals needed on a given clock.
Clearly,thedatacantbelatcheduntilallthebitsarereadyandstable,soweend
upwaitingfortheslowestone.

Figure23:ParallelBusLimitations

Flight Time

Transmitter R eceiver

Transmission Incorrect
M edia sampling
due to skew

C ommon C lock C ommon C lock

HowdoesaserialtransportlikePCIegetaroundtheseproblems?First,flight
time becomes a nonissue because the clock that will latch the data into the
receiverisactuallybuiltintothedatastreamandnoexternalreferenceclockis
necessary. As a result, it doesnt matter how small the clock period is or how
longittakesthesignaltoarriveatthereceiverbecausetheclockarriveswithit
atthesametime.Forthesamereasontheresnoclockskew,againbecausethe
latchingclockisrecoveredfromthedatastream.Finally,signalskewiselimi
nated within a Lane because theres only one data bit being sent. The signal
skewproblemreturnsifamultilanedesignisused,butthereceivercorrectsfor
this automatically and can fix a generous amount of skew. Although serial
designsovercomemanyoftheproblemsofparallelmodels,theyhavetheirown
setofcomplications.Still,aswellseelater,the solutionsaremanageableand
allowforhighspeed,reliablecommunication.

Bandwidth.The combination ofhigh speed and wide Links that PCIesup


portscanresultinsomeimpressivebandwidthnumbers,asshowninTable 21
onpage 43.Thesenumbersarederivedfromthebitrateandbuscharacteristics.
One such characteristic is that, like many other serial transports, the first two
generationsofPCIeuseanencodingprocesscalled8b/10bthatgeneratesa10
bitoutputbasedonan8bitinput.Inspiteoftheoverheadthisintroduces,there
areseveralgoodreasonsfordoingitaswellseelater.Fornowitsenoughto

42
PCIe 3.0.book Page 43 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

knowthatsendingonebyteofdatarequirestransmitting10bits.Thefirstgen
eration(Gen1orPCIespecversion1.x)bitrateis2.5GT/sanddividingthatby
10 means that one lanewill beable to send 0.25GB/s. Since the Link permits
sendingandreceivingatthesametime,theaggregatebandwidthcanbetwice
thatamount,or0.5GB/sperLane.Doublingthefrequencyforthesecondgener
ation(Gen2orPCIe2.x)doubledthebandwidth.Thethirdgeneration(Gen3or
PCIe3.0)doublesthebandwidthyetagain,butthistimethespecwriterschose
nottodoublethefrequency.Instead,forreasonswelldiscusslater,theychose
to increase the frequency only to 8 GT/s and remove the 8b/10b encoding in
favorofanotherencodingmechanismcalled128b/130bencoding(formoreon
this,seethechapterPhysicalLayerLogical(Gen3)onpage 407).Table21
summarizesthebandwidthavailableforallthecurrentpossiblecombinations
andshowsthepeakthroughputtheLinkcoulddeliverinthatconfiguration.

Table21:PCIeAggregateGen1,Gen2andGen3BandwidthforVariousLinkWidths

LinkWidth x1 x2 x4 x8 x12 x16 x32

Gen1Bandwidth 0.5 1 2 4 6 8 16
(GB/s)

Gen2Bandwidth 1 2 4 8 12 16 32
(GB/s)

Gen3Bandwidth 2 4 8 16 24 32 64
(GB/s)

PCIe Bandwidth Calculation


Tocalculatethebandwidthnumbersincludedinthetableabove,seethecalcu
lationsoutlinedbelow.

Gen1PCIeBandwidth=(2.5Gb/sx2directions)/10bitspersymbol=0.5
GB/s.
Gen2PCIeBandwidth=(5.0Gb/sx2directions)/10bitspersymbol=1.0
GB/s.

Notethatintheabovecalculations,wedivideby10bitspersymbolnot8bits
per byte, because both Gen1 and Gen2 protocols require packet bytes to be
encodedusing8b/10bencodingschemesbeforepackettransmission.

43
PCIe 3.0.book Page 44 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Gen3PCIeBandwidth=(8.0Gb/sx2directions)/8bitsperbyte=2.0GB/s.
NotethatatGen3speed,wedivideby8bitsperbytenotby10bitspersymbol
becauseatGen3speed,packetsareNOT8b/10bencoded,rathertheyare128b/
130bencoded.Thereisanaddition2bitoverheadevery128bits,butitisnot
largeenoughtoaccountforinthecalculation.
These3calculatedbandwidthnumbersaremultipliedbyLinkwidthtoresult
intotalLinkbandwidthonmultiLaneLinks.

Differential Signals
EachLaneusesdifferentialsignaling,sendingbothapositiveandnegativever
sion(D+andD)ofthesamesignalasshowninFigure24onpage44.Thisdou
bles the pin count, of course, but thats offset by two clear advantages over
singleended signaling that are important for high speed signals: improved
noiseimmunityandreducedsignalvoltage.
The differential receiver gets both signals and subtracts the negative voltage
from the positive one to find the difference between them and determine the
valueofthebit.Noiseimmunityisbuiltintothedifferentialdesignbecausethe
pairedsignalsareonadjacentpinsofeachdeviceandtheirtracesmustalsobe
routed very near each other to maintain the proper transmission line imped
ance.Consequently,anythingthataffectsonesignalwillalsoaffecttheotherby
aboutthesameamountandinthesamedirection.Thereceiverislookingatthe
differencebetweenthemandthenoisedoesntreallychangethatdifference,so
the result is that most noise affecting the signals doesnt affect the receivers
abilitytoaccuratelydistinguishthebits.

Figure24:DifferentialSignaling

V+
D+
Vcm

Receiver subtracts
D- from D+ value to
arrive at differential
D- voltage.
Vcm

V-

44
PCIe 3.0.book Page 45 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

No Common Clock
Asmentionedearlier,acommonclockisnotrequiredforaPCIeLinkbecauseit
usesasourcesynchronousmodel,meaningthetransmittersuppliestheclockto
thereceivertouseinlatchingtheincomingdata.APCIeLinkdoesnotincludea
forwardedclock.Instead,thetransmitterembedstheclockintothedatastream
using 8b/10b encoding. The receiver then recovers the clock from the data
stream and uses it to latch the incoming data. As mysterious as this might
sound, the process by which this is done is actually fairly straightforward. In
thereceiver,aPLLcircuit(PhaseLockedLoop,seeFigure25onpage45)takes
theincomingbitstreamasareferenceclockandcomparesitstiming,orphase,
tothatofanoutputclockthatithascreatedwithaspecifiedfrequency.Basedon
the result of that comparison, the output clocks frequency is increased or
decreaseduntilamatchisobtained.AtthatpointthePLLissaidtobelocked,
andtheoutput(recovered)clockfrequencypreciselymatchestheclockthatwas
usedtotransmitthedata.ThePLLcontinuallyadjuststherecoveredclock,so
changes in temperature or voltage that affect the transmitter clock frequency
willalwaysbequicklycompensated.

OnethingtonoteregardingclockrecoveryisthatthePLLdoesneedtransitions
ontheinputinordertomakeitsphasecomparison.Ifalongtimegoesbywith
outanytransitionsinthedata,thePLLcouldbegintodriftawayfromthecor
rect frequency. To prevent that problem, one of the design goals of 8b/10b
encodingisensurenomorethan5consecutiveonesorzeroesinabitstream(to
learnmoreonthis,referto8b/10bEncodingonpage 380).

Figure25:SimplePLLBlockDiagram

Reference
(incoming Recovered
bitstream) Phase Voltage-Controlled Clock
Detector Loop Filter
Oscillator

Divide by N Counter
(to create multiples of
reference frequency)

45
PCIe 3.0.book Page 46 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Oncetheclockhasbeenrecovereditsusedtolatchthebitsoftheincomingdata
stream into the deserializer. Sometimes students wonder whether this recov
eredclockcanbeusedtoclockallthelogicinthereceiver,butitturnsoutthat
the answer is no. One reason is that a receiver cant count on this reference
alwaysbeingpresent,becauselowpowerstatesontheLinkinvolvestopping
datatransmission.Consequently,thereceivermustalsohaveitsowninternal
clockthatcanbelocallygenerated.

Packet-based Protocol
Movingfromaparalleltoaserialtransportgreatlyreducesthepinsneededto
carrydata.PCIe,likemostotherserialbasedprotocols,alsoreducespincount
byeliminatingmostsidebandcontrolsignalstypicallyfoundinparallelbuses.
However,iftherearenocontrolsignalsindicatingthetypeofinformationbeing
received,howcanthereceiverinterprettheincomingbits?Alltransactionsin
PCIearesentindefinedstructurescalledpackets.Thereceiverfindsthepacket
boundariesand,knowingthepatterntoexpect,decodesthepacketstructureto
determinewhatitshoulddo.

ThedetailsofthepacketbasedprotocolarecoveredinthechaptercalledTLP
Elementsonpage 169,butanoverviewofthevariouspackettypesandtheir
usescanbefoundinthischapter;seeDataLinkLayeronpage 72.

Links and Lanes


Asmentionedearlier,aphysicalconnectionbetweentwoPCIedevicesiscalled
aLinkandismadeupofoneormoreLanes.EachLaneconsistsofadifferential
sendandreceivesignalpair,asshowninFigure22onpage40.Onelaneissuf
ficient for all communications between devices and no other signals are
required.

Scalable Performance
However, using more Lanes will increase the performance of a Link, which
depends on its speed and Link width. For example, using multiple Lanes
increasesthenumberofbitsthatcanbesentwitheachclockandthusimproves
thebandwidth.AsnotedearlierinTable 21onpage 43,thenumberofLanes
supportedbythespecincludespowersof2upto32Lanes.Ax12Linkisalso
supported,whichmayhavebeenintendedtosupportthex12Linkwidthused
byInfiniBand,anearlierserialdesign.AllowingavarietyofLinkwidthsper
mits a platform designer to make the appropriate tradeoff between cost and
performance,easilyscalingupordownbasedonthenumberofLanes.

46
PCIe 3.0.book Page 47 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

Flexible Topology Options


ALinkmustbeapointtopointconnection,ratherthanasharedbuslikePCI,
becauseoftheveryhighspeedsituses.SinceaLinkcanthereforeonlyconnect
twointerfaces,ameansforfanningouttheconnectionsisneededforbuildinga
nontrivialsystem.ThisisaccomplishedinPCIewiththeuseofSwitchesand
Bridges,whichallowflexibilityinconstructingthesystemtopologythesetof
connectionsbetweentheelementsinthesystem.Definitionsoftheelementsina
systemandsometopologyexamplesaregiveninthefollowingsection.

Some Definitions
AsimplePCIetopologyexampleisshowninFigure26onpage47,andwill
helpillustratesomedefinitionsatthispoint.

Figure26:ExamplePCIeTopology

CPU

Root Complex
Memory

PCIe PCIe
Switch Endpoint
Bridge
to PCI
or PCI-X

PCIe PCIe Legacy PCI/PCI-X


Endpoint Endpoint Endpoint

47
PCIe 3.0.book Page 48 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Topology Characteristics

AtthetopofthediagramisaCPU.ThepointtomakehereisthattheCPUis
consideredthetopofthePCIehierarchy.JustlikePCI,onlysimpletreestruc
turesarepermittedforPCIe,meaningnoloopsorothercomplextopologiesare
allowed. Thats done to maintain backward compatibility with PCI software,
which used a simple configuration scheme to track the topology and did not
supportcomplexenvironments.

Tomaintainthatcompatibility,softwaremustbeabletogenerateconfiguration
cyclesinthesamewayasbeforeandthebustopologymustappearthesameas
itdidbefore.Consequently,alltheconfigurationsregisterssoftwareexpectsto
findarestillthereandbehaveinthesamewaytheyalwayshave.Wellcome
back to this discussion a little later, after weve had a chance to define some
moreterms.

Root Complex
TheinterfacebetweentheCPUandthePCIebusesmaycontainseveralcompo
nents (processor interface, DRAM interface, etc.) and possibly even several
chips.Collectively,thisgroupisreferredtoastheRootComplex(RCorRoot).
TheRCresidesattherootofthePCIinvertedtreetopologyandactsonbehalf
oftheCPUtocommunicatewiththerestofthesystem.Thespecdoesnotcare
fullydefineit,though,givinginsteadalistofrequiredandoptionalfuntional
ity. In broad terms, the Root Complex can be understood as the interface
between the system CPU and the PCIe topology, with PCIe Ports labeled as
RootPortsinconfigurationspace.

Switches and Bridges


Switchesprovideafanoutoraggregationcapabilityandallowmoredevicesto
be attached to a single PCIe Port. They act as packet routers and recognize
whichpathagivenpacketwillneedtotakebasedonitsaddressorotherrout
inginformation.

Bridges provide an interface to other buses, such as PCI or PCIX, or even


another PCIe bus. The bridge shown in the Example PCIe Topology on
page 47issometimescalledaforwardbridgeandallowsanolderPCIorPCI
Xcardtobepluggedintoanewsystem.Theoppositetypeorreversebridge
allowsanewPCIecardtobepluggedintoanoldPCIsystem.

48
PCIe 3.0.book Page 49 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

Native PCIe Endpoints and Legacy PCIe Endpoints


EndpointsaredevicesinaPCIetopologythatarenotSwitchesorbridgesand
actasinitiatorsandCompletersoftransactionsonthebus.Theyresideatthe
bottom of the branches of the tree topology and only implement a single
Upstream Port (facing toward the Root). By comparison, a Switch may have
severalDownstreamPortsbutcanonlyhaveoneUpstreamPort.Devicesthat
weredesignedfortheoperationofanolderbuslikePCIXbutnowhaveaPCIe
interfacedesignatethemselvesasLegacyPCIeEndpointsinaconfiguration
registerandthistopologyincludesone.Theymakeuseofthingsthatarepro
hibitedinnewerPCIedesigns,suchasIOspaceandsupportforIOtransactions
or Locked requests. In contrast, Native PCIe Endpoints would be PCIe
devicesdesignedfromscratchasopposedtoaddingaPCIeinterfacetooldPCI
device designs. Native PCIe Endpoints device are memory mapped devices
(MMIOdevices).

Software Compatibility Characteristics


Onewaycompatibilitywitholdersoftwareismaintainedisthattheconfigura
tion headers for Endpoints and bridges, shown in Figure 27 on page 50, are
unchangedfromPCI.Onedifferencenowisthatbridgesareoftenaggregated
intoSwitchesandRoots,butlegacysoftwareisunawareofthatdistinctionand
willstillsimplyseethemasbridges.Atthispointwejustwanttogetfamiliar
withtheconcepts,sowewontgetintothedetailsoftheregistershere.Anintro
ductiontotheratherlargetopicofconfigurationcanbefoundinConfigura
tionOverviewonpage 85.

49
PCIe 3.0.book Page 50 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure27:ConfigurationHeaders

Header Type 0 Header Type 1


256-Byte (used by endpoints) DW (used by bridges) DW
Configuration Space
(per function) Device Vendor 00 Device Vendor 00
ID ID ID ID
Status Command 01 Status Command 01
Register Register Register Register
02 02
64-Byte Class Code Revision
ID
Class Code Revision
ID
PCI Configuration BIST Header Latency Cache
Line
03 Header Latency Cache
Line
03
Type Timer BIST Type Timer
Header Space Size
04
Size
Base Address 0 Base Address 0 04

05 05
Base Address 1 Base Address 1
06 Secondary Subordinate Secondary Primary 06
Base Address 2 Latency Timer Bus Number Bus Number Bus Number

07 Secondary I/O I/O 07


Base Address 3 Status Limit Base
08 Memory Memory 08
Base Address 4 Limit Base
09 Prefetchable Prefetchable 09
Base Address 5 Memory Limit Memory Base
192-Byte 10
CardBus CIS Pointer Prefetchable Base 10
Function-Specific Upper 32 Bits
Configuration Subsystem 11 Prefetchable Limit 11
Subsystem ID
Vendor ID Upper 32 Bits
Header Space Expansion ROM 12 I/O Limit I/O Base 12
Base Address Upper 16 Bits Upper 16 Bits
Capabilities 13 Capability 13
Reserved Pointer Reserved Pointer
14 14
Reserved Expansion ROM Base Address
15 Bridge Interrupt Interrupt 15
Max_Lat Min_Gnt Interrupt Interrupt Control Pin
Pin Line Line

To illustrate the way the system appears to software, consider the example
topologyshowninFigure28onpage51.Asbefore,theRootresidesatthetop
of the hierarchy. The Root can be quite complex internally, but it will usually
implementaninternalbusstructureandseveralbridgestofanoutthetopology
toseveralports.ThatinternalbuswillappeartoconfigurationsoftwareasPCI
bus number zero and the PCIe Ports will appear as PCItoPCI bridges. This
internal structure is not likely to be an actual PCI bus, but it will appear that
waytosoftwareforthispurpose.SincethisbusisinternaltotheRoot,itsactual
logicaldesigndoesnthavetoconformtoanystandardandcanbevendorspe
cific.

50
PCIe 3.0.book Page 51 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

Figure28:TopologyExample

Host
CPU Bridge

Internal Bus 0

Root Complex
Memory
PCI-PCI PCI-PCI PCI-PCI
Bridge Bridge Bridge

PCIe
Endpoint PCIe
Switch
Bridge
to PCI
or PCI-X

PCIe PCIe Legacy


Endpoint Endpoint Endpoint PCI/PCI-X

Inasimilarway,theinternalorganizationofaSwitch,showninFigure29on
page52,willappeartosoftwareassimplyacollectionofbridgessharingacom
monbus.Amajoradvantageofthisapproachisthatitallowstransactionrout
ingtotakeplaceinthesamewayitdidforPCI.Enumeration,theprocessby
which configuration software discovers the system topology and assigns bus
numbersandsystemresources,worksthesameway,too.Wellseesomeexam
plesofhowenumerationworkslater,butonceitsbeencompletedthebusnum
bersinthesystemwillhaveallbeenassignedinamannerlikethatshownin
Figure29onpage52.

51
PCIe 3.0.book Page 52 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure29:ExampleResultsofSystemEnumeration

PCI-PCI
Bridge

Internal Bus 2

PCI-PCI PCI-PCI PCI-PCI


Bridge Bridge Bridge

CPU

Root Complex
(internal bus 0) Memory

Bus 1 Bus 6 Bus 7

Bus 3 PCIe PCIe


PCIe Switch Endpoint Bridge
Endpoint to PCI
Bus 4 Bus 5 or PCI-X

PCIe Legacy
Endpoint Endpoint PCI/PCI-X
Bus 8
Legend
Downstream port
Upstream port

System Examples
Figure210onpage53illustratesanexampleofaPCIebasedsystemdesigned
foralowcostapplicationlikeaconsumerdesktopmachine.AfewPCIePorts
areimplemented,alongwithafewaddincardsslots,butthebasicarchitecture
doesntdiffermuchfromtheoldstylePCIsystem.

Bycontrast,thehighendserversystemshowninFigure211onpage54shows
othernetworkinginterfacesbuiltintothesystem.IntheearlydaysofPCIesome
thought was given to making it cable of operating as a network that could
replacethoseoldermodels.Afterall,ifPCIeisbasicallyasimplifiedversionof
other networking protocols, couldnt it fill all the needs? For a variety of rea
sons,thisconceptneverreallyachievedmuchmomentumandPCIebasedsys
temsstillgenerallyconnecttoexternalnetworksusingothertransports.

52
PCIe 3.0.book Page 53 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

Thisalsogivesusanopportunitytorevisitthequestionofwhatconstitutesthe
RootComplex.Inthisexample,theblocklabeledasIntelProcessorcontainsa
numberofcomponents,asistrueofmostmodernCPUarchitectures.Thisone
includesax16PCIePortforaccesstographics,and2DRAMchannels,which
meansthememorycontrollerandsomeroutinglogichasbeenintegratedinto
the CPU package. Collectively, these resources are often called the Uncore
logictodistinguishthemfromtheseveralCPUcoresandtheirassociatedlogic
in the package. Since we previously described the Root as being the interface
betweentheCPUandthePCIetopology,thatmeansthatpartoftheRootmust
beinsidetheCPUpackage.AsshownbythedashedlineinFigure211onpage
54,theRoothereconsistsofpartofseveralpackages.Thiswilllikelybethecase
formanyfuturesystemdesigns.

Figure210:LowCostPCIeSystem

PCIe
Graphics
DDR3
GFX Intel Processor

DDR3
DMI (very similar to PCIe)
Serial ATA HiDef Audio
HDD
USB 2.0 P55 PCH Video
Ibex Peak
SPI
BIOS

Gb
Add-in Add-in Add-in
Ethernet

PCIe ports

53
PCIe 3.0.book Page 54 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure211:ServerPCIeSystem

Intel Processor

PCIe DDR3
Uncore
GFX
DDR3
QPI

IOH Root Complex

10 Gb
LAN Switch Ethernet Switch Fibre
Endpoint Channel
Endpoint Endpoint

10 Gb PCI Express
Add-In Switch SAS/SATA
Ethernet to-PCI
RAID
Endpoint Endpoint Endpoint
PCI

Gb Slots
Add-In IEEE
Ethernet
1394
Endpoint Endpoint

Introduction to Device Layers


PCIedefinesalayeredarchitectureasillustratedinFigure212onpage56.The
layerscanbeconsideredasbeinglogicallysplitintotwopartsthatoperateinde
pendently because they each have a transmit side for outbound traffic and a
receivesideforinboundtraffic.Thelayeredapproachhassomeadvantagesfor
hardwaredesignersbecause,ifthelogicispartitionedcarefully,itcanbeeasier
to migrate to new versions of the spec by changing one layer of an existing
designwhileleavingtheothersunaffected.Evenso,itsimportanttonotethat
thelayerssimplydefineinterfaceresponsibilitiesandadesignisnotrequiredto
bepartitionedaccordingtothelayerstobecompliantwiththespec.Thegoalin

54
PCIe 3.0.book Page 55 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

thissectionistodescribetheresponsibilitiesofeachlayerandtheflowofevents
involvedinaccomplishingadatatransfer.

ThedevicelayersasshowninFigure212onpage56consistof:

DevicecoreandinterfacetoTransactionLayer.Thecoreimplementsthe
mainfunctionalityofthedevice.Ifthedeviceisanendpoint,itmayconsist
of up to 8 functions, each function implementing its own configuration
space. If the device is a switch, the switch core consists of packet routing
logicandaninternalbusforaccomplishingthisgoal.Ifthedeviceisaroot,
therootcoreimplementsavirtualPCIbus0onwhichresidesallthechipset
embeddedendpointsandvirtualbridges.
Transaction Layer. This layer is responsible for Transaction Layer Packet
(TLP)creationonthetransmitsideandTLPdecodingonthereceiveside.
ThislayerisalsoresponsibleforQualityofServicefunctionality,FlowCon
trol functionality and Transaction Ordering functionality. All these four
TransactionLayerfunctionsaredescribedinbookParttwo.
Data Link Layer. This layer is responsible for Data Link Layer Packet
(DLLP)creationonthetransmitsideanddecodingonthereceiveside.This
layerisalsoresponsibleforLinkerrordetectionandcorrection.ThisData
LinkLayerfunctionisreferredtoastheAck/Nakprotocol.BoththeseData
LinkLayerfunctionsaredescribedinbookPartThree.
PhysicalLayer.ThislayerisresponsibleforOrderedSetpacketcreationon
thetransmitsideandOrderedSetpacketdecodingonthereceiveside.This
layerprocessesallthreetypesofpackets(TLPs,DLLPsandOrderedSets)
to be transmitted on the Link and processes all types of packets received
fromtheLink.Packetsareprocessedonthetransmitsidebybytestriping
logic,scramblers,8b/10bencoders(associatedwithGen1/Gen2protocol)or
128b/130bencoders(associatedwithGen3protocol)andpacketserializers.
ThepacketisfinallydifferentiallyclockingoutonallLanesatthetrained
Link speed. On the receive Physical Layer, packet processing consists of
serially receivingdifferentially encodedbitsandconvertingto digitalfor
matandthendeserializingtheincomingbitstream.Theisdoneataclock
ratederivedfromarecoveredclockfromtheCDR(ClockandDataRecov
ery) circuit. The received packets are processed by elastic buffers, 8b/10b
decoders (associated with Gen1/Gen2 protocol) or 128b/130b decoders
(associatedwithGen3protocol),descramblersandbyteunstripinglogic.
Finally,theLinkTrainingandStatusStateMachine(LTSSM)ofthePhysical
LayerisresponsibleforLinkInitializationandTraining.AllthesePhysical
LayerfunctionsaredescribedinbookPartFour.

55
PCIe 3.0.book Page 56 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure212:PCIExpressDeviceLayers

PCIe Device A PCIe Device B


Device Core Device Core

PCIe Core PCIe Core


Hardware/Software Hardware/Software
Interface Interface

Transaction Layer Transaction Layer

Data Link Layer Data Link Layer

Physical Layer Physical Layer


(RX) (TX) (RX) (TX)

Link

EveryPCIeinterfacesupportsthefunctionalityoftheselayers,includingSwitch
Ports,asshowninFigure213onpage57.Aquestionoftencameupinearlier
classesastowhetheraSwitchPortneedstoimplementallthelayers,sinceits
typically only forwarding packets. The answer is yes, and the reason is that
evaluatingthecontentsofpacketstodeterminetheirroutingrequireslooking
intotheinternaldetailsofapacket,andthattakesplaceintheTransactionLayer
logic.

56
PCIe 3.0.book Page 57 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

Figure213:SwitchPortLayers

Transaction Layer
Data Link Layer
Physical Layer

Switch
Core

In principle, each layer communicates with the corresponding layer in the


deviceontheotherendoftheLink.Theuppertwolayersdosobyorganizinga
stringofbitsintoapacket,creatingapatternthatisrecognizablebythecorre
spondinglayerinthereceiver.Thepacketsareforwardedthroughtheotherlay
ers along the way to get to or from the Link. The Physical Layer also
communicatesdirectlywiththatlayerintheotherdevicebutitdoesdifferently.

Beforewegodeeper,letsfirstwalkthroughanoverviewtoseehowthelayers
interact. In broad terms, the contents of an outgoing request or completion
packetfromthedeviceareassembledintheTransactionLayerbasedoninfor
mation presented by the device core logic, which we also sometimes call the
Software Layer (although the spec doesnt use that term). That information
wouldusuallyincludethetypeofcommanddesired,theaddressofthetarget
device, attributes of the request, and so on. The newly created packet is then
storedinabuffercalledaVirtualChanneluntilitsreadyforpassingtothenext
layer.WhenthepacketispasseddowntotheDataLinkLayer,additionalinfor
mation is added to the packet for error checking at the neighboring receiver,
and a copy is stored locally so we can send it again if a transmission error
occurs.WhenthepacketarrivesatthePhysicalLayeritsencodedandtransmit
teddifferentiallyusingalltheavailableLanesoftheLink.

57
PCIe 3.0.book Page 58 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure214:DetailedBlockDiagramofPCIExpressDevicesLayers

Memory, I/O, Configuration Requests, Message Requests or Completions


Software (Software layer sends / receives address, transaction type, data)
Layer
Transmit Receive

Transaction Layer Packet (TLP) Transaction Layer Packet (TLP)


Header Data Payload ECRC Header Data Payload ECRC

Transaction Transmit Flow Control Receive


Layer Buffers Buffers
(VCs) Virtual Channel
Management (VCs)

VC Arbitration Ordering

Link Packet DLLPs e.g. DLLPs Link Packet


Sequence TLP LCRC Ack/Nak CRC Ack/Nak CRC Sequence TLP LCRC

Data Link De-mux


TLP Retry
Layer Buffer
TLP Error
Mux Check

Physical Packet Physical Packet


Start Link Packet End Start Link Packet End

Physical Encode Decode


Layer
Parallel-to-Serial Serial-to-Parallel
Link
Differential Driver Training Differential Receiver

Port
Link

ThereceiverdecodestheincomingbitsinthePhysicalLayer,checksforerrors
thatcanbeseenatthisleveland,iftherearenone,forwardstheresultingpacket
uptotheDataLinkLayer.Herethepacketischeckedfordifferenterrorsand,if
therearenoerrors,isforwardeduptotheTransactionLayer.Thepacketisbuff
ered,checkedforerrors,anddisassembledintotheoriginalinformation(com
mand,attributes,etc.)sothecontentscanbedeliveredtothedevicecoreofthe
receiver.Next,letsexploreingreaterdepthwhateachofthelayersmustdoto
makethisprocesswork,usingFigure214onpage58.Westartatthetop.

58
PCIe 3.0.book Page 59 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

Device Core / Software Layer


Thisisthecorefunctionalityofthedevice,suchasanetworkinterfaceorhard
drivecontroller.ThisisntdefinedasalayerinthePCIespec,butcanbethought
ofinthatwaysinceitresidesabovetheTransactionLayerandwillbeeitherthe
sourceordestinationofallRequests.ItprovidesthetransmitsideoftheTrans
action Layer with requests that include information like the transaction type,
theaddress,amountofdatatotransfer,andsoon.Itsalsothedestinationfor
informationforwardedupfromtheTransactionLayerwhenincomingpackets
havebeenreceived.

Transaction Layer
InresponsetorequestsfromtheSoftwareLayer,theTransactionLayergener
ates outbound packets. It also examines inbound packets and forwards the
information contained in them up to the Software Layer. It supports the split
transaction protocol for nonposted transactions and associates an inbound
CompletionwithanoutboundnonpostedRequestthatwastransmittedearlier.
ThetransactionshandledbythislayeruseTLPs(TransactionLayerPackets)and
canbegroupedintofourrequestcategories:
1. Memory
2. IO
3. Configuration
4. Messages
ThefirstthreeofthesewerealreadysupportedinPCIandPCIX,butmessages
are a new type for PCIe. A Transaction is defined as the combination of a
Requestpacketthatadeliversacommandtoatargeteddevice,togetherwith
any Completion packets the target sends back in reply. A list of the request
typesisgiveninTable 22onpage 59.

Table22:PCIExpressRequestTypes

RequestType NonPostedorPosted

MemoryRead NonPosted

MemoryWrite Posted

MemoryReadLock NonPosted

59
PCIe 3.0.book Page 60 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table22:PCIExpressRequestTypes(Continued)

RequestType NonPostedorPosted

IORead NonPosted

IOWrite NonPosted

ConfigurationRead(Type0andType1) NonPosted

ConfigurationWrite(Type0andType1) NonPosted

Message Posted

Therequestsalsofallintooneoftwocategoriesasshownintherightcolumnof
thetable:nonpostedandposted.Fornonpostedrequests,aRequestersendsa
packetforwhichaCompletershouldgeneratearesponseintheformofaCom
pletionpacket.Thereadermayrecognizethisasthesplittransactionprotocol
inherited from PCIX. For example, any read request will be nonposted
becausetherequesteddatawillneedtobereturnedin acompletion.Perhaps
unexpectedly, IO writes and Configuration writes are also nonposted. Even
thoughtheyaredeliveringthedataforthecommand,theserequestsstillexpect
toreceiveacompletionfromthetargettoconfirmthatthewritedatahasinfact
madeittothedestinationwithouterror.

In contrast, Memory Writes and Messages are posted, meaning the targeted
devicedoesnotreturnacompletionTLPtotheRequester.Postedtransactions
improveperformancebecausetheRequesterdoesnthavetowaitforareplyor
incurtheoverheadofacompletion.Thetradeoffisthattheygetnofeedback
aboutwhetherthewritehasfinishedorencounteredanerror.Thisbehavioris
inheritedfromPCIandisstillconsideredagoodthingtodobecausethelikeli
hood of a failure is small and the performance gain is significant. Note that,
eventhoughtheydontrequireCompletions,PostedWritesdostillparticipate
in the Ack/Nak protocol in the Data Link Layer that ensures reliable packet
delivery.Formoreonthis,seeChapter10,entitledAck/NakProtocol,onpage
317.

TLP (Transaction Layer Packet) Basics


AlistofallofthePCIerequestandcompletionpackettypesisgiveninTable 2
3onpage 61.

60
PCIe 3.0.book Page 61 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

Table23:PCIExpressTLPTypes

Abbreviated
TLPPacketTypes
Name

MemoryReadRequest MRd

MemoryReadRequestLockedaccess MRdLk

MemoryWriteRequest MWr

IORead IORd

IOWrite IOWr

ConfigurationRead(Type0andType1) CfgRd0,
CfgRd1

ConfigurationWrite(Type0andType1) CfgWr0,
CfgWr1

MessageRequestwithoutData Msg

MessageRequestwithData MsgD

CompletionwithoutData Cpl

CompletionwithData CplD

CompletionwithoutDataassociatedwithLockedMemoryRead CplLk
Requests

CompletionwithDataassociatedwithLockedMemoryRead CplDLk
Requests

TLPs originate at the Transaction Layer of a transmitter and terminate at the


TransactionLayerofareceiver,asshowninFigure215onpage62.TheData
LinkLayerandPhysicalLayeraddpartstothepacketasitmovesthroughthe
layers of the transmitter, and then verify at the receiver that those parts were
transmittedcorrectlyacrosstheLink.

61
PCIe 3.0.book Page 62 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure215:TLPOriginandDestination

PCIe Device A PCIe Device B


Device Core Device Core

PCIe Core PCIe Core


Hardware/Software Hardware/Software
Interface Interface

TLP TLP
Transmitted Transaction Layer Transaction Layer
Received

Data Link Layer Data Link Layer

Physical Layer Physical Layer


(RX) (TX) (RX) (TX)

Link

TLPPacketAssembly.AnillustrationofthepartsofafinishedTLPasitis
sentovertheLinkisshowninFigure216onpage63,whereitcanbeseenthat
differentpartsofthepacketareaddedineachofthelayers.Tomakeiteasierto
recognize how the packet gets constructed, the different parts of the TLP are
colorcodedtoindicatewhichlayerisresponsibleforthem:redforTransaction
Layer,blueforDataLinkLayer,andgreenforthePhysicalLayer.

Thedevicecoresendstheinformationrequiredtoassemblethecoresectionof
theTLPintheTransactionLayer.EveryTLPwillhaveaheader,althoughsome,
like a read request, wont contain data. An optional EndtoEnd CRC (ECRC)
field may be calculated and appended to the packet. CRC stands for Cyclic
RedundancyCheck(orCode)andisemployedbyalmostallserialarchitectures
for the simple reason that its simple to implement and provides very robust
error detection capability. The CRC also detects burst errors, or string of
repeated mistaken bits, up to the length of the CRC value (32 bits for PCIe).
Sincethistypeoferrorislikelytobeencounteredwhensendingalongstringof
bits, this characteristic is very useful for serial transports. The ECRC field is
passedunchangedthroughanyservicepoints(servicepointusuallyrefersto
a Switch or Root Port that has TLP routing options) between the sender and
receiverofthepacket,makingitusefulforverifyingatthedestinationthatthere
werenoerrorsanywherealongtheway.

62
PCIe 3.0.book Page 63 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

For transmission, the core section of the TLP is forwarded to the Data Link
Layer, which is responsible to append a Sequence Number and another CRC
field called the Link CRC (LCRC). The LCRC is used by the neighboring
receivertocheckforerrorsandreporttheresultsofthatcheckbacktothetrans
mitterfor everypacket sent onthatLink.Thethoughtful readermay wonder
whytheECRCwouldbehelpfulifthemandatoryLCRCcheckalreadyverifies
errorfreetransmissionacrosstheLink.Thereasonisthatthereisstillaplace
wheretransmissionerrorsarentchecked,andthatiswithindevicesthatroute
packets.Apacketarrivesandischeckedforerrorsononeport,theroutingis
checked,andwhenitssentoutonanotherportanewLCRCvalueiscalculated
and added to it. The internal forwarding between ports could encounter an
error that isnt checked as part of the normal PCIe protocol, and thats why
ECRCishelpful.

Finally, the resulting packet is forwarded to the Physical Layer where other
charactersareaddedtothepackettoletthereceiverknowwhattoexpect.For
the first two generations of PCIe, these were control characters added to the
beginningandendofthepacket.Forthethirdgeneration,controlcharactersare
nolongerusedbutotherbitsareappendedtotheblocksthatgivetheneeded
information about the packets. The packet is then encoded and differentially
transmittedontheLinkusingalloftheavailablelanes.

Figure216:TLPAssembly

Information in core section of TLP comes


from Software Layer / Device Core

Bit transmit direction

Sequence
Start Header Data ECRC LCRC End
Number

Created by Transaction Layer

Appended by Data Link Layer

Appended by PHY Layer

63
PCIe 3.0.book Page 64 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

TLPPacketDisassembly.Whentheneighboringreceiverseestheincom
ingTLPbitstream,itneedstoidentifyandremovethepartsthatwereaddedto
recovertheoriginalinformationrequestedbythecorelogicofthetransmitter.
As shown in Figure 217 on page 64, the Physical Layer will verify that the
proper Start and End or other characters are present and remove them, for
wardingtheremainderoftheTLPtotheDataLinkLayer.Thislayerfirstchecks
forLCRCandSequenceNumbererrors.Ifnoerrorsarefound,itremovesthose
fieldsfromtheTLPandforwardsittotheTransactionLayer.Ifthereceiverisa
Switch, the packet is evaluated in the Transaction Layer to find the routing
informationintheheaderoftheTLPanddeterminetowhichportthepacket
shouldbeforwarded.Evenwhenitsnottheintendeddestination,aSwitchis
allowed to check and report an ECRC error if it finds one. However, its not
allowedtomodifytheECRC,sothetargeteddevicewillbeabletodetectthe
ECRCerroraswell.

ThetargetdevicecancheckECRCerrorsifitscapableandwasenabled.Ifthis
isthetargetdeviceandtherewasnoerror,theECRCfieldisremoved,leaving
theheaderanddataportionofthepackettobeforwardedtotheSoftwareLayer.

Figure217:TLPDisassembly

Information in core section of TLP is


sent to Software Layer / Device Core

Bit receive direction

Sequence
Start Header Data ECRC LCRC End
Number

Stripped by Transaction Layer

Stripped by Data Link Layer

Stripped by PHY Layer

64
PCIe 3.0.book Page 65 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

Non-Posted Transactions

OrdinaryReads.Figure 218 on page 65 shows an example of a Memory


ReadRequestsentfromanEndpointtosystemmemory.Adetaileddiscussion
oftheTLPcontentscanbefoundinChapter5,entitledTLPElements,onpage
169, but an important part of any memory read request is the target address.
The address for a memory Request can be 32 or 64 bits, and determines the
packetrouting.Inthisexample,therequestgetsroutedthroughtwoSwitches
thatforwardituptothetarget,whichistheRootinthiscase.WhentheRoot
decodestherequestandrecognizesthattheaddressinthepackettargetssys
temmemory,itfetchestherequesteddata.ToreturnthatdatatotheRequester,
the Transaction Layer of the Root Port creates as many Completions as are
neededtodeliveralltherequesteddatatotheRequester.Thelargestpossible
datapayloadforPCIeis4KBperpacket,butdevicesareoftendesignedtouse
smallerpayloadsthanthat,soseveralcompletionsmaybeneededtoreturna
largeamountofdata.

Figure218:NonPostedReadExample

Completer
Processor
Step 2: Root receives MRd
Step 3: Root fetches data,
returns CplD

Root Complex

CplD MRd System


Memory

Switch A Switch C
CplD

MRd

Switch B Endpoint Endpoint Endpoint

CplD MRd
Requester
Endpoint Endpoint Step 1: Endpoint initiates MRd
Step 4: Endpoint receives CplD

65
PCIe 3.0.book Page 66 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ThoseCompletionpacketsalsocontainroutinginformationtodirectthemback
totheRequester,andtheRequesterincludesitsreturnaddressforthispurpose
in the original request. This return address is simply the Device ID of the
RequesterasitwasdefinedforPCI,whichisacombinationofthreethings:its
PCIBusnumberinthesystem,itsDevicenumberonthatbus,anditsFunction
numberwithinthatdevice.ThisBus,Device,andFunctionnumberinformation
(sometimes abbreviated as BDF) is the routing information that Completions
willusetogetbacktotheoriginalRequester.AswastrueforPCIX,aRequester
canhaveseveralsplittransactionsinprogressatthesametimeandmustbeable
to associate incoming completions withthe correctrequests. To facilitate that,
anothervaluewasaddedtotheoriginalrequestcalledaTagthatisuniqueto
eachrequest.TheCompletercopiesthistransactionTagandusesitintheCom
pletionsotheRequestercanquicklyidentifywhichRequestthisCompletionis
servicing.

Finally, a Completer can also indicate error conditions by setting bits in the
completionstatusfield.ThatgivestheRequesteratleastabroadideaofwhat
mighthavegonewrong.HowtheRequesterhandlesmostoftheseerrorswill
bedeterminedbysoftwareandisoutsidethescopeofthePCIespec.

LockedReads.Locked Memory Reads are intended to support what are


calledAtomicReadModifyWriteoperations,atypeofuninterruptabletransac
tionthatprocessorsusefortasksliketestingandsettingasemaphore.Whilethe
testandsetisinprogress,nootheraccesstothesemaphorecantakeplaceora
race condition could develop. To prevent this, processors use a lock indicator
(suchasaseparatepinontheparallelFrontSideBus)thatpreventsothertrans
actionsonthebusuntilthelockedoneisfinished.Whatfollowshereisjusta
highlevelintroduction tothetopic.FormoreinformationonLocked transac
tions, refer to Appendix D called Appendix D: Locked Transactions on
page 963.

As a bit of history, in the early days of PCI the spec writers anticipated cases
wherePCIwouldactuallyreplacetheprocessorbus.Consequently,supportfor
thingsthataprocessorwouldneedtodoonthebuswereincludedinthePCI
spec,suchaslockedtransactions.However,PCIwasonlyrarelyeverusedthis
wayand,intheend,muchofthisprocessorbussupportwasdropped.Locked
cyclesremained,though,tosupportafewspecialcases,andPCIecarriesthis
mechanismforwardforlegacysupport.Perhapstospeedmigrationawayfrom
its use, new PCIe devices are prohibited from accepting locked requests; its
onlylegalforthosethatselfidentifyasLegacyDevices.Intheexampleshown
inFigure219onpage67,aRequesterbeginstheprocessbysendingalocked
request(MRdLk).Bydefinition,sucharequestisonlyallowedtocomefromthe
CPU,soinPCIeonlyaRootPortwilleverinitiateoneofthese.

66
PCIe 3.0.book Page 67 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

The locked request is routed through the topology using the target memory
addressandeventuallyreachestheLegacyEndpoint.Asthepacketmakesits
way through each routing device (called a service point) along the way, the
EgressPortforthepacketislocked,meaningnootherpacketswillbeallowedin
thatdirectionuntilthepathisunlocked.

Figure219:NonPostedLockedReadTransactionProtocol

CPU

Root Complex
Memory

MRdLk
CplDLk

PCIe Bridge
Switch Endpoint
to PCI
Cp
lD
Lk
M
Rd
PCI
Lk

PCIe PCIe Legacy


Endpoint Endpoint Endpoint

WhentheCompleterreceivesthepacketanddecodesitscontents,itgathersthe
data and creates one or more Locked Completions with data. These Comple
tionsareroutedbacktotheRequesterusingtheRequesterID,andeachEgress
Porttheypassthroughisthenlocked,too.

If the Completer encounters a problem, it returns a locked completion packet


withoutdata(theoriginalreadshouldhaveresultedindatasoifthereisntany
weknowtheres been a problem)and the status field will indicatesomething
abouttheerror.TheRequesterwillunderstandthattomeanthatthelockdid
notsucceedandsothetransactionwillbecancelledandsoftwarewillneedto
decidewhattodonext.

67
PCIe 3.0.book Page 68 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

IOandConfigurationWrites.Figure 220 on page 68 illustrates a non


postedIOwritetransaction.Likealockedrequest,anIOcyclecanalsolegally
target only a Legacy Endpoint. The request is routed through the Switches
basedontheIOaddressuntilitreachesthetargetEndpoint.WhentheCompl
eter receives the request, it accepts the data and returns a single completion
packetwithoutdatathatconfirmsreceptionofthepacket.Thestatusfieldinthe
completion would report whether an error had occurred and, if so, the
Requesterssoftwarewouldhandleit.

IfthecompletionreportsnoerrorstheRequesterknowsthatthewritedatahas
beensuccessfullydeliveredandthenextstepinthesequenceofinstructionsfor
that Completer is now permitted. And that really summarizes the motivation
forthenonpostedwrite:unlikeamemorywrite,itsnotenoughtoknowthat
thedatawillgettothedestinationsometimeinthefuture.Instead,thenextstep
cantlogicallytakeplaceuntilweknowthatithasgottenthere.Aswithlocked
cycles,nonpostedwritescanonlycomefromtheprocessor.

Figure220:NonPostedWriteTransactionProtocol

Processor

Requester
Step 1: Root Initiates IOWr
Step 4: Root receives Cpl Root Complex

IOWr Cpl System


Memory

Switch A Switch C
IOWr

Cpl

Switch B Endpoint Endpoint Endpoint

IOWr Cpl
Completer
Legacy Step 2: Endpoint receives IOWr
Endpoint Endpoint Step 3: Endpoint writes data, returns Cpl

68
PCIe 3.0.book Page 69 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

Posted Writes

MemoryWrites.Memorywritesarealwayspostedandneverreceivecom
pletions. Once the request has been sent, the Requester doesnt wait for any
feedbackbeforegoingontothenextrequest,andnotimeorbandwidthisspent
returningacompletion.Asaresult,postedwritesarefasterandmoreefficient
thannonpostedrequestsandimprovesystemperformance.AsshowninFig
ure 221 on page 69, the packet is routed through the system using its target
memory address to the Completer. Once a Link has successfully sent the
request,thattransactionisfinishedonthatLinkanditsavailableforotherpack
ets.Eventually,theCompleteracceptsthedataandthetransactionistrulyfin
ished.Ofcourse,onetradeoffwiththisapproachisthat,sincenoCompletion
packets are sent, theres also no means for reporting errors back to the
Requester.IftheCompleterencountersanerror,itcanlogitandsendaMessage
totheRoottoinformsystemsoftwareabouttheerror,buttheRequesterwont
seeit.

Figure221:PostedMemoryWriteTransactionProtocol

Processor
Requester:
Step 1: Root Complex
initiates MWr request

Root Complex
DDR
SDRAM
MWr

Switch A Switch C
MWr

Switch B Endpoint Endpoint Endpoint

MWr

Completer:
Endpoint Endpoint
Step 2: Endpoint receives MWr

69
PCIe 3.0.book Page 70 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

MessageWrites.Interestingly,unliketheotherrequestswevelookedatso
far,thereareseveralpossibleroutingmethodsformessages,andafieldwithin
the message indicates which type to use. For example, some messages are
postedwriterequeststhattargetaspecificCompleter,othersarebroadcastfrom
theRoottoallEndpoints,whilestillotherssentfromanEndpointareautomati
callyroutedtotheRoot.Tolearnmoreaboutthedifferenttypesofroutingrefer
toChapter4,entitledAddressSpace&TransactionRouting,onpage121.
MessagesareusefulinPCIetohelpachieveadesigngoalofloweringthepin
count.TheyeliminatetheneedforthesidebandsignalsthatPCIusedtoreport
thingslikeinterrupts,powermanagementevents,anderrorsbecausetheycan
reportthatinformationinapacketoverthenormaldatapath.

Quality of Service (QoS)


PCIewasdesignedfromitsinceptiontobeabletosupporttimesensitivetrans
actionsforapplicationslikestreamingaudioorvideowheredatadeliverymust
betimelyinordertobeuseful.ThisisreferredtoasprovidingQualityofSer
viceandisaccomplishedbytheadditionofafewthings.First,eachpacketis
assigned a priority by software by setting a 3bit field within it called Traffic
Class(TC).Generallyspeaking,assigningahighernumberedTCtoapacketis
expected to give it a higher priority in the system. Second, multiple buffers,
called Virtual Channels (VC), are built into the hardware for each port and a
packetisplacedintotheappropriatebufferbasedonitsTC.Third,sinceaport
now has multiple buffers with packets available for transmission at a given
time,arbitrationlogicisneededtoselectamongtheVCs.Finally,Switchesmust
select between competing input ports for access to the VCs of a given output
port.ThisiscalledPortArbitrationandcanbehardwareassignedorsoftware
programmable.Allofthesehardwarepiecesmustbeinplacetoallowasystem
to prioritize packets. If properly programmed and set up, such a system can
evenprovideguaranteedserviceforagivenpath.

Toillustratetheconcept,considerFigure222onpage71,inwhichavideocam
eraandSCSIdevicebothneedtosenddatatosystemDRAM.Thedifferenceis
thatthecameradataistimecritical;ifthetransmissionpathtothetargetdevice
isunabletokeepupwithitsbandwidth,frameswillgetdropped.Thesystem
needstobeabletoguaranteeabandwidththatsatleastashighasthecameraor
thecapturedvideomayappearchoppy.Atthesametime,theSCSIdataneeds
tobedeliveredwithouterrors,buthowlongittakesisnotasimportant.Clearly,
then,whenbothavideodatapacketandaSCSIpacketneedtobesentatthe
sametime,thevideotrafficshouldhaveahigherpriority.QoSreferstotheabil
ityofthesystemtoassigndifferentprioritiestopacketsandroutethemthrough
the topology with deterministic latencies and bandwidth. For more detail on
QoS,refertoChapter7,entitledQualityofService,onpage245.

70
PCIe 3.0.book Page 71 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

Figure222:QoSExample

Intel Processor
System
Memory

PCIe Uncore
GFX

QPI

IOH Root Complex

10 Gb
LAN Switch Ethernet Switch Fibre
Endpoint Channel
Endpoint Endpoint

10 Gb PCI Express
Add-In Switch SAS/SATA
Ethernet to-PCI
RAID
Endpoint Endpoint Endpoint
PCI

Gb Slots
Add-In IEEE
Ethernet
Isochronous Ordinary 1394
Endpoint Endpoint
Traffic Traffic

Transaction Ordering
WithinaVC,thepacketsnormallyallflowthroughinthesameorderinwhich
theyarrived,butthereareexceptionstothisgeneralrule.PCIExpressprotocol
inherits the PCI transactionordering model, including support for relaxed
orderingcasesaddedwiththePCIXarchitecture.Theseorderingrulesguaran
teethatpacketsusingthesametrafficclasswillberoutedthroughthetopology
inthecorrectorder,preventingpotentialdeadlockorlivelockconditions.An
interestingpointtonoteisthat,sinceorderingrulesonlyapplywithinaVCand
packetsthatusedifferentTCsmaynotgetmappedintothesameVC,packets
using different TCs are understood by software to have no ordering relation
ship.ThisorderingismaintainedintheVCswithinthetransactionlayer.

71
PCIe 3.0.book Page 72 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Flow Control
Atypicalprotocolusedbyserialtransportsistorequirethatatransmitteronly
sendapackettoitsneighborifthereissufficientbufferspacetoreceiveit.That
cutsdownonperformancewastingeventsonthebuslikethedisconnectsand
retriesthatPCIallowedandthusremovesthatclassofproblemsfromthetrans
port.Thetradeoffisthatthereceivermustreportitsbufferspaceoftenenough
to avoid unnecessary stalls and that reporting takes a little bandwidth of its
own.InPCIethisreportingisdonewithDLLPs(DataLinkLayerPackets),as
wellseeinthenextsection.Thereasonistoavoidapossibledeadlockcondi
tionthatmightoccurifTLPswereused,inwhichatransmittercantgetabuffer
sizeupdatebecauseitsownreceivebufferisfull.DLLPscanalwaysbesentand
receivedregardlessofthebuffersituation,sothatproblemisavoided.Thisflow
control protocol is automatically managed at the hardware level and is trans
parenttosoftware.

Figure223:FlowControlBasics

Buffer space available

Transmitter TLP Receiver

Transmitter VC Buffer
Receiver

Flow Control DLLP

AsshowninFigure223onpage72,theReceivercontainstheVCBuffersthat
hold received TLPs. The Receiver advertises the size of those buffers to the
Transmitters using Flow Control DLLPs. The Transmitter tracks the available
spaceintheReceiversVCBuffersandisnotallowedtosendmorepacketsthan
the Receiver can hold. As the Receiver processes the TLPs and removes them
fromthebuffer,itperiodicallysendsFlowControlUpdateDLLPstokeepthe
Transmitteruptodateregardingtheavailablespace.Tolearnmoreaboutthis,
seeChapter6,entitledFlowControl,onpage215.

Data Link Layer


ThislogicisresponsibleforLinkmanagementandperformsthreemajorfunc
tions:TLPerrorcorrection,flowcontrol,andsomeLinkpowermanagement.It
accomplishesthesebygeneratingDLLPsasshowninFigure224onpage73.

72
PCIe 3.0.book Page 73 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

DLLPs (Data Link Layer Packets)


DLLPs are transferred between Data Link Layers of the two neighboring
devices on a Link. The Transaction Layer is not even aware of these packets,
which only travel between neighboring devices and are not routed anywhere
else.Theyaresmall(alwaysjust8bytes)comparedtoTLPs,andthatsagood
thingbecausetheyrepresentoverheadformaintainingLinkprotocol.

Figure224:DLLPOriginandDestination

Device A Device B
Device Device
Core Core

Transaction Transaction
Flow Control, Layer Layer
Ack/Nak, Etc.
(1) Data Data (4)
DLLP Core CRC Link Layer Link Layer DLLP Core CRC

(2) (2) (3) (3)


SDP DLLP Core CRC END Physical Physical SDP DLLP Core CRC END
Layer Layer
(RX) (TX) (RX) (TX)

DLLPAssembly.AsshowninFigure224onpage73,aDLLPoriginatesat
theDataLinkLayerofthetransmitterandisconsumedbytheDataLinkLayer
ofthereceiver.A16bitCRCisaddedtotheDLLPCoretocheckforerrorsat
the receiver. The DLLP contents are forwarded to the Physical Layer which
appendsaStartandEndcharactertothepacket(forthefirsttwogenerationsof
PCIe),andthenencodesanddifferentiallytransmitsitovertheLinkusingall
theavailablelanes.

DLLPDisassembly. When a DLLP is received by the Physical Layer, the


bitstreamisdecodedandtheStartandEndframecharactersareremoved.The
restofthepacketisforwardedtotheDataLinkLayer,whichchecksforCRC
errorsandthentakestheappropriateactionbasedonthepacket.TheDataLink
LayeristhedestinationfortheDLLP,soitisntforwardeduptotheTransaction
Layer.

73
PCIe 3.0.book Page 74 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Ack/Nak Protocol
Theerrorcorrectionfunction,illustratedinFigure225onpage74,isprovided
throughahardwarebasedautomaticretrymechanism.AsshowninFigure226
onpage75,anLCRCandSequenceNumberareaddedtoeachoutgoingTLP
and checked at the receiver. The transmitters Replay Buffer holds a copy of
everyTLPthathasbeensentuntilreceiptattheneighboringdevicehasbeen
confirmed.ThatconfirmationtakestheformofanAckDLLP(positiveacknowl
edgement)sentbytheReceiverwiththeSequenceNumberofthelastgoodTLP
it has seen. When the Transmitter sees the Ack, it flushes the TLP with that
SequenceNumberoutoftheReplayBuffer,alongwithalltheTLPsthatwere
sentbeforetheonethatwasacknowledged.

IftheReceiverdetectsaTLPerror,itdropstheTLPandreturnsaNaktothe
Transmitter,whichthenreplaysallunacknowledgedTLPsinhopesofabetter
resultthenexttime.Sincedetectederrorsarealmostalwaystransientevents,a
replaywillveryoftencorrecttheproblem.Thisprocessisoftenreferredtoas
theAck/Nakprotocol.

Figure225:DataLinkLayerReplayMechanism

From To
Transaction Layer Transaction Layer
Tx Rx
Data Link Layer
Link Packet DLLP DLLP Link Packet
ACK / ACK /
Sequence TLP LCRC NAK NAK Sequence TLP LCRC

Replay
Buffer De-mux

Error
Mux Check

Tx Rx

Link

74
PCIe 3.0.book Page 75 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

Figure226:TLPandDLLPStructureattheDataLinkLayer

Transaction Layer Packet (TLP)


Sequence ID Header Data Payload ECRC LCRC

AND
DLLP
DLLP Type Misc. CRC

ThebasicformofaDLLPisalsoshowninFigure226onpage75,andconsists
ofa4byteDLLPtypefieldthatmayincludesomeotherinformationanda2
byteCRC.
Figure 227 on page 76 shows an example of a memory read going across a
Switch.Ingeneral,thestepsforthiscasewouldbeasfollows:
1. Step 1a: Requester sends a memory read request and saves a copy in its
Replay Buffer. Switch receives the MRd TLP and checks the LCRC and
SequenceNumber.
Step1b:Noerrorisseen,sotheSwitchreturnsanAckDLLPtoRequester.
Inresponse,RequesterdiscardsitscopyoftheTLPfromtheReplayBuffer.
2. Step 2a: Switch forwards the MRd TLP to the correct Egress Port using
memoryaddressforitsroutingandsavesacopyintheEgressPortsReplay
Buffer.TheCompleterreceivestheMRdTLPandchecksforerrors.
Step 2b: No error is seen, so the Completer returns an Ack DLLP to the
Switch.SwitchPortpurgesitscopyoftheMRdTLPfromitsReplayBuffer.
3. Step 3a: As the final destination of the request, the Completer checks the
optionalECRCfieldinMRdTLP.Noerrorsareseensotherequestispassed
tothecorelogic.Basedonthecommand,thedevicefetchestherequested
dataandreturnsaCompletionwithDataTLP(CplD)whilesavingacopy
initsReplayBuffer.SwitchreceivesCplDTLPandchecksforerrors.
Step3b:Noerrorisseen,sotheSwitchreturnsanAckDLLPtotheCompl
eter.CompleterdiscardsitscopyoftheCplDTLPfromitsReplayBuffer.
4. Step4a:SwitchdecodestheRequesterIDfieldinCplDTLPandroutesthe
packettothecorrectEgressPort,savingacopyintheEgressPortsReplay
Buffer.RequesterreceivesCplDTLPandchecksforerrors.
Step 4b: No error is seen, so the Requester returns Ack DLLP to Switch.
SwitchdiscardsitscopyoftheCplDTLPfromitsReplayBuffer.Requester
checkstheoptionalECRCfieldandfindsnoerror,sodataispassedupto
thecorelogic.

75
PCIe 3.0.book Page 76 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure227:NonPostedTransactionwithAck/NakProtocol

1a. Request 2a. Request


4b. Ack 3b. Ack

Requester Switch Completer

1b. Ack 2b. Ack


4a. Completion 3a. Completion

Flow Control
ThesecondmajorLinkLayerfunctionisFlowControl.Followingpowerupor
Reset, this mechanism is initialized by the Data Link Layer automatically in
hardwareandthenupdatedduringruntime.Anoverviewofthiswasalready
presentedinthesectiononTLPssothatwontberepeatedhere.Tolearnmore
aboutthistopic,seeChapter6,entitledFlowControl,onpage215.

Power Management
Finally, the Link Layer participates in power management, as well, because
DLLPsareusedtocommunicatetherequestsandhandshakesassociatedwith
Linkandsystempowerstates.Foradetaileddiscussiononthistopic,referto
Chapter16,entitledPowerManagement,onpage703.

Physical Layer
General
ThePhysicalLayeristhelowesthierarchicallayerforPCIeasshowninFigure
214onpage58.BothTLPandDLLPtypepacketsareforwardeddownfromthe
DataLinkLayertothePhysicalLayerfortransmissionovertheLinkandfor
wardeduptotheDataLinkLayerattheReceiver.ThespecdividesthePhysical
Layer discussion into two portions: a logical part and an electrical part, and
well preserve that split here as well. The Logical Physical Layer contains the
digitallogicassociatedwithpreparingthepacketsforserialtransmissiononthe
Link and reversing that process for inbound packets. The Electrical Physical
LayeristheanaloginterfaceofthePhysicalLayerthatconnectstotheLinkand
consistsofdifferentialdriversandreceiversforeachlane.

76
PCIe 3.0.book Page 77 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

Physical Layer - Logical


TLPsandDLLPsfromtheDataLinkLayerareclockedintoabufferinthePhys
ical Layer, where Start and End characters are added to facilitate detection of
thepacketboundariesatthereceiver.SincetheStartandEndcharactersappear
onbothendsofapackettheyarealsocalledframingcharacters.Theframing
charactersareshownappendedtoaTLPandDLLPinFigure228onpage77,
whichalsoshowsthesizeofeachfield.

Figure228:TLPandDLLPStructureatthePhysicalLayer

Transaction Layer Packet (TLP)


Start Sequence Header Data Payload ECRC LCRC End
1B 2B 3-4 DW 0-1024 DW 1DW 1DW 1B

DLLP
Start DLLP Type Misc. CRC End
1B 1DW 2B 1B

Withinthislayer,eachbyteofapacketissplitoutacrossallofthelanesinuse
fortheLinkinaprocesscalledbytestriping.Effectively,eachlaneoperatesas
anindependentserialpathacrosstheLinkandtheirdataisallaggregatedback
togetheratthereceiver.Eachbyteisscrambledtoreducerepetitivepatternson
the transmission line and reduce EMI (electromagnetic interference) seen on
theLink.ForthefirsttwogenerationsofPCIe(Gen1andGen2PCIe),the8bit
charactersareencodedinto10bitsymbolsusingwhatiscalled8b/10bencod
ing logic. This encoding adds overhead to the outgoing data stream, but also
adds a number of useful characteristics (for more on this, see 8b/10b Encod
ingonpage 380).Gen3PhysicalLayerlogicwhentransmittingatGen3speed,
doesnotencodethepacketbytesusing8b/10bencoding.Ratheranotherencod
ingschemereferredtoas128b/130bencodingisemployedwiththepacketbytes
scrambledtransmitted.The10bsymbolsoneachLane(Gen1andGen2)orthe
packetbytesoneachLane(Gen3)arethenserializedandclockedoutdifferen
tiallyoneachLaneoftheLinkat2.5GT/s(Gen1),or5GT/s(Gen2)or8GT/s
(Gen3).

77
PCIe 3.0.book Page 78 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Receiversclockinthepacketbitsatthetrainedclockspeedsastheyarriveonall
lanes.If8b/10bisinuse(atGen1andGen2mode),theserialbitstreamofthe
packetisconvertedinto10bitsymbolsusingadeserializersoitsreadyfor8b/
10bdecoding.However,beforedecoding,thesymbolspassthroughanelastic
buffer, a clever device that compensates for the slight difference in frequency
betweentheinternalclocksoftwoconnecteddevices.Next,the10bitsymbol
stream is decoded back to the proper 8bit characters via an 8b/10b decoder.
Gen3 Physical Layer logic, when receiving serial bit stream of the packet at
Gen3speed,willconvertitintoabytestreamusingadeserializerthathasestab
lished block lock. The byte stream is passed through an elastic buffer which
doesclocktolerancecompensation.The8b/10bdecoderstageisskippedgiven
packetsclockedatGen3speedsarenot8b/10bencoded.The8bitcharacterson
alllanesaredescrambled,thebytesfromallthelanesareunstripedbackintoa
singlecharacterstreamand,finally,theoriginaldatastreamfromtheTransmit
terisrecovered.

Link Training and Initialization


Another responsibility of the Physical Layer is the initialization and training
processontheLink.Inthisfullyautomaticprocess,severalstepsaretakento
preparetheLinkfornormaloperation,whichinvolvesdeterminingthestatusof
severaloptionalconditions.Forexample,theLinkwidthcanbefromonelane
to32lanes,and multiplespeeds mightbeavailable.Thetrainingprocess will
discovertheseoptionsandgothroughastatemachinesequencetoresolvethe
bestcombination.Inthatprocess,severalthingsarecheckedorestablishedto
ensureproperandoptimaloperation,suchas:

Linkwidth
Linkdatarate
LanereversalLanesconnectedinreverseorder
PolarityinversionLanepolarityconnectedbackward
BitlockperLaneRecoveringthetransmitterclock
SymbollockperLaneFindingarecognizablepositioninthebitstream
LanetoLanedeskewwithinamultiLaneLink.

Physical Layer - Electrical


ThephysicalsenderandreceiveronaLinkareconnectedwithanACcoupled
LinkasshowninFigure229onpage79.ThetermACcoupledsimplymeans
thatacapacitorresidesphysicallyinthepathbetweenthedevicesandservesto
passthehighfrequency(AC)componentofthesignalwhileblockingthelow
frequency(DC)part.Manyserialtransportsusethisapproachbecauseitallows
thecommonmodevoltage(thelevelatwhichthepositiveandnegativeversions

78
PCIe 3.0.book Page 79 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

of the signal cross) to be different at the transmitter and receiver, meaning


theyrenotrequiredtohavethesamereferencevoltage.Thisisntabigissueif
the two devices are nearby and in the same box, but if they were in different
buildingsitwouldbeverydifficultforthemtohaveacommonreferencevolt
agethatwaspreciselythesame.

Figure229:PhysicalLayerElectrical

CTX ZTX
+ +
ZTX ZRX

Transmitter Link Receiver


CTX ZTX ZRX
- -
ZTX
Vtt
Zvtt

Transmitter is AC coupled to receiver


DC common-mode impedance is 50 Ohms
Differential impedance is 100 Ohms
Coupling capacitor is between 75-200 nF

Ordered Sets
The last type of traffic sent between devices uses only the Physical Layers.
Althougheasilyrecognizedbythereceiver,thisinformationisnottechnicallyin
theformofapacketbecauseitdoesnthaveStartandEndcharacters,forexam
ple.Instead,itsorganizedintowhatarecalledOrderedSetsthatoriginateatthe
Transmitters Physical Layer terminate at the Receivers Physical Layer, as
showninFigure230onpage80.ForGen1andGen2datarates,anOrderedSet
startswithasingleCOMcharacterfollowedbythreeormoreothercharacters
thatdefinetheinformationtobesent.Thenomenclatureforthetypeofcharac
ters used in PCIe is discussed in more detail in Character Notation on
page 382;fornowitsenoughtosaythattheCOMcharacterhascharacteristics
thatmakeitworkwellforthispurpose.OrderedSetsarealwaysamultipleof4
bytesinsize,andanexampleisshowninFigure231onpage80.InGen3mode
of operation, the Ordered Set format is different from Gen1/Gen2 described
above.DetailstobecoveredinChapter14,entitledLinkInitialization&Train
ing,onpage505.OrderedSetsalwaysterminateattheneighboringdeviceand
arenotroutedthroughthePCIefabric.

79
PCIe 3.0.book Page 80 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure230:OrderedSetsOriginandDestination

PCIe Device A PCIe Device B


Device Core Device Core

PCIe Core PCIe Core


Hardware/Software Hardware/Software
Interface Interface

Transaction Layer Transaction Layer

Data Link Layer Data Link Layer

Ordered Set Physical Layer Physical Layer Ordered Set


Transmitted Received
(RX) (TX) (RX) (TX)

Link

OrderedSetsareusedintheLinkTrainingprocess,asdescribedinChapter14,
entitledLinkInitialization&Training,onpage505.Theyrealsousedtocom
pensatefortheslightdifferencesbetweentheinternalclocksofthetransmitter
and receiver, a process called clock tolerance compensation. Finally, Ordered
SetsareusedtoindicateentryintoorexitfromalowpowerstateontheLink.

Figure231:OrderedSetStructure

COM Identifier Identifier Identifier

80
PCIe 3.0.book Page 81 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

Protocol Review Example


Atthispoint,letsreviewtheoverallLinkprotocolbyusinganexampletoillus
tratethestepsthattakeplacefromthetimeaRequesterinitiatesamemoryread
requestuntilitobtainstherequesteddatafromaCompleter.

Memory Read Request


For the first part of the discussion, refer to Figure 232 on page 81. The
RequestersDevice CoreorSoftware Layer sends a request totheTransaction
Layerandincludesthefollowinginformation:32bitor64bitmemoryaddress,
transactiontype,amountofdatatoreadcalculatedindwords,trafficclass,byte
enables,attributesetc.

Figure232:MemoryReadRequestPhase

Requester Completer
Send Memory Read Request
Software layer Receive Memory Read Request

Transaction Layer Packet (TLP) Transaction Layer Pack et (TLP)


Header ECRC Header ECRC

Flow Control Transaction layer Flow Control


Virtual Channel Transmit Virtual Channel Receive
Management Buffers Management Buffers
per VC per VC
Ordering Ordering

Link Packet DLLP


Link Packet
Sequence TLP LCRC Nak
Sequence TLP LCRC
Data Link layer
Retry Buffer DLLP. Error
Ack/Nak CRC Check

Physical Pack et Physical Packet


Start Link Packet End Start Link Packet End

Encode Decode
Physical layer
Parallel-to-Serial Serial-to-Parallel
Differential Driver Differential Receiver

Port Port
Ack or Nak
Link
MRd TLP

81
PCIe 3.0.book Page 82 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

TheTransactionlayerusesthisinformationtobuildaMRdTLP.Thedetailsof
theTLPpacketformataredescribedlater,butfornowitsenoughtosaythata3
DWor4DWheaderiscreateddependingonaddresssize(32bitor64bit).In
addition,theTransactionLayeraddstheRequesterID(bus#,device#,function#)
totheheadersotheCompletercanusethattoreturnthecompletion.TheTLPis
placed in the appropriate virtual channel buffer to wait its turn for transmis
sion.OncetheTLPhasbeenselected,theFlowControllogicconfirmsthereis
sufficient space available in the neighboring devices receive buffer (VC), and
thenthememoryreadrequestTLPissenttotheDataLinkLayer.

TheDataLinkLayeraddsa12bitSequenceNumberanda32bitLCRCvalue
tothepacket.AcopyoftheTLPwithSequenceNumberandLCRCisstoredin
theReplayBufferandthepacketisforwardedtothePhysicalLayer.

In the Physical Layer the Start and End characters are added to the packet,
which is then byte striped across the available Lanes, scrambled, and 8b/10b
encoded.Finallythebitsareserializedoneach laneandtransmitteddifferen
tiallyacrosstheLinktotheneighbor.

The Completer deserializes the incoming bit stream back into 10bit symbols
and passes them through the elastic buffer. The 10bit symbols are decoded
backtobytesandthebytesfromallLanesaredescrambledandunstriped.The
StartandEndcharactersaredetectedandremoved.TherestoftheTLPisfor
wardeduptotheDataLinkLayer.

TheCompletersDataLinkLayerchecksforLCRCerrorsinthereceivedTLP
andcheckstheSequenceNumberformissingoroutofsequenceTLPs.Iftheres
no error, it creates an Ack that contains the same Sequence Number that was
usedinthereadrequest.A16bitCRCiscalculatedandappendedtotheAck
contentstocreateaDLLPthatissentbacktothePhysicalLayerwhichaddsthe
properframingsymbolsandtransmitstheAckDLLPtotheRequester.

TheRequesterPhysicalLayerreceivestheAckDLLP,checksandremovesthe
framingsymbols,andforwardsituptotheDataLinkLayer.IftheCRCisvalid,
itcomparestheacknowledgedSequenceNumberwiththeSequenceNumbers
oftheTLPsstoredintheReplayBuffer.ThestoredmemoryreadrequestTLP
associatedwiththeAckreceivedisrecognizedandthatTLPisdiscardedfrom
the Replay Buffer. If a Nak DLLP was received by the Requester instead, it
wouldresendacopyofthestoredmemoryreadrequestTLP.SincetheDLLP
onlyhasmeaningtotheDataLinkLayer,nothingisforwardedtotheTransac
tionLayer.

82
PCIe 3.0.book Page 83 Sunday, September 2, 2012 11:25 AM

Chapter 2: PCIe Architecture Overview

InadditiontogeneratingtheAck,theCompletersLinkLayeralsoforwardsthe
TLPuptoitsTransactionLayer.IntheCompletersTransactionLayer,theTLP
is placed in the appropriate VC receive buffer to be processed. An optional
ECRC check can be performed, and if no error is found, the contents of the
header(address,RequesterID,memoryreadtransactiontype,amountofdata
requested,trafficclassetc.)areforwardedtotheCompletersSoftwareLayer.

Completion with Data


Forthesecondhalfofthisdiscussion,refertoFigure233onpage83.Toservice
thememory readrequest,the CompleterDeviceCore/SoftwareLayersendsa
completion with data (CplD) request down to its Transaction Layer that
includes the Requester ID and Tag copied from the original memory read
request,transactiontype,otherpartsofthecompletionheadercontentsandthe
requesteddata.

Figure233:CompletionwithDataPhase

Requester Completer
Receive Completion with Data
Software layer Send Completion with Data

Transaction Layer Packet (TLP) Transaction Layer Packet (TLP)


Header Data Payload ECRC Header Data Payload ECRC

Flow Control Transaction layer Flow Control


Virtual Channel Receive Virtual Channel Transmit
Management Buffers Management Buffers
per VC per VC
Ordering Ordering

Link Packet DLLP


Link Packet
Sequence TLP LCRC Nak
Sequence TLP LCRC
Data Link layer
DLLP Retry Buffer
Error
Ack/Nak CRC Check

Physical Packet Physical Packet


Start Link Packet End Start Link Packet End

Decode Encode
Physical layer
Serial-to-Parallel Parallel-to-Serial
Differential Receiver Differential Driver

Port Port
CplD TLP
Ack or Nak
Link

83
PCIe 3.0.book Page 84 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

The Transaction layer uses this information to build the CplD TLP, which
alwayshasa3DWheader(itusesIDroutingandneverneedsa64bitaddress).
ItalsoaddsitsownCompleterIDtotheheader.Thispacketisalsoplacedinto
the appropriate VC transmit buffer and, once selected, the flow control logic
verifiesthatsufficientspaceisavailableattheneighboringdevicetoreceivethis
packetand,onceconfirmed,forwardsthepacketdowntotheDataLinkLayer.

As before, the Data Link Layer adds a 12bit Sequence Number and a 32bit
LCRC to the packet. A copy of the TLP with Sequence Number and LCRC is
storedintheReplayBufferandthepacketisforwardedtothePhysicalLayer.

Asbefore,thePhysicalLayeraddsaStartandEndcharactertothepacket,byte
stripesitacrosstheavailablelanes,scramblesit,and8b/10bencodesit.Finally,
theCplDpacketisserializedonalllanesandtransmitteddifferentiallyacross
theLinktotheneighbor.

The Requester converts the incoming serial bit stream back to 10bit symbols
and passes them through the elastic buffer. The 10bit symbols are decoded
back to bytes, descrambled and unstriped. The Start and End characters are
detectedandremovedandtheresultantTLPissentuptotheDataLinkLayer.

As before, the Data Link Layer checks for LCRC errors in the received CplD
TLPandcheckstheSequenceNumberformissingoroutofsequenceTLPs.If
therearenoerrors,itcreatesanAckDLLPwhichcontainsthesameSequence
NumberastheCplDTLPused.A16bitCRCisaddedtotheAckDLLPandits
sent back to the Physical Layer which adds the proper framing symbols and
transmitstheAckDLLPtotheCompleter.

TheCompleterPhysicalLayerchecksandremovestheframingsymbolsfrom
theAckDLLPandsendstheremainderuptotheDataLinkLayerwhichchecks
the CRC. If there are no errors, it compares the Sequence Number with the
SequenceNumbersfortheTLPsstoredintheReplayBuffer.ThestoredCplD
TLPassociatedwiththeAckreceivedisrecognizedandthatTLPisdiscarded
fromtheReplayBuffer.IfaNakDLLPwasreceivedbytheCompleterinstead,it
wouldresendacopyofthestoredCplDTLP.

Inthemeantime,theRequesterTransactionLayerreceivestheCplDTLPinthe
appropriatevirtualchannelbuffer.Optionally,theTransactionlayercancheck
for anECRCerror. If thereare no errors, itforwards the header contents and
data payload, including the Completion Status, to the Requester Software
Layer,andweredone.

84
PCIe 3.0.book Page 85 Sunday, September 2, 2012 11:25 AM

3 Configuration
Overview
The Previous Chapter
The previous chapter provides a thorough introduction to the PCI Express
architectureandisintendedtoserveasanexecutiveleveloverview.Itintro
ducesthelayeredapproachtoPCIeportdesigndescribedinthespec.Thevari
ouspackettypesareintroducedalongwiththetransactionprotocol.

This Chapter
This chapter provides an introduction to configuration in the PCIe environ
ment.ThisincludesthespaceinwhichaFunctionsconfigurationregistersare
implemented,howaFunctionisdiscovered,howconfigurationtransactionsare
generated and routed, the difference between PCIcompatible configuration
spaceandPCIeextendedconfigurationspace,andhowsoftwaredifferentiates
betweenanEndpointandaBridge.

The Next Chapter


Thenextchapterdescribesthepurposeandmethodsofafunctionrequesting
memoryorIOaddressspacethroughBaseAddressRegisters(BARs)andhow
softwareinitializesthem.ThechapterdescribeshowbridgeBase/Limitregisters
areinitialized,thusallowingswitchestorouteTLPsthroughthePCIefabric.

Definition of Bus, Device and Function


Just as in PCI, every PCIe Function is uniquely identified by the Device it
resideswithinandtheBustowhichtheDeviceconnects.Thisuniqueidentifier
is commonly referred to as a BDF. Configuration software is responsible for
detectingeveryBus,DeviceandFunction(BDF)withinagiventopology.The
following sections discuss the primary BDF characteristics in the context of a
samplePCIetopology.Figure31onpage87depictsaPCIetopologythathigh

85
PCIe 3.0.book Page 86 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

lightstheBuses,DevicesandFunctionsimplementedinasamplesystem.Later
inthischaptertheprocessofassigningBusandDeviceNumbersisexplained.

PCIe Buses
Upto256BusNumberscanbeassignedbyconfigurationsoftware.Theinitial
BusNumber,Bus0,istypicallyassignedbyhardwaretotheRootComplex.Bus
0consistsofaVirtualPCIbuswithintegratedendpointsandVirtualPCItoPCI
Bridges(P2P)whicharehardcodedwithaDevicenumberandFunctionnum
ber.EachP2PbridgecreatesanewbusthatadditionalPCIedevicescanbecon
nected to. Each bus must be assigned a unique bus number. Configuration
softwarebeginstheprocessofassigningbusnumbersbysearchingforbridges
starting with Bus 0, Device 0, Function 0. When a bridge is found, software
assignsthenewbusabusnumberthatisuniqueandlargerthanthebusnum
berthebridgeliveson.Oncethenewbushasbeenassignedabusnumber,soft
warebeginslookingforbridgesonthenewbusbeforecontinuingscanningfor
morebridgesonthecurrentbus.Thisisreferredtoasadepthfirstsearchand
isdescribedindetailinEnumerationDiscoveringtheTopologyonpage 104.

PCIe Devices
PCIe permits up to 32 device attachments on a single PCI bus, however, the
pointtopoint nature of PCIe means only a single device can be attached
directlytoaPCIelinkandthatdevicewillalwaysendupbeingDevice0.Root
Complexes and Switches have Virtual PCI buses which do allow multiple
Devicesbeingattachedtothebus.EachDevicemustimplementFunction0
andmaycontainacollectionofuptoeightFunctions.WhentwoormoreFunc
tionsareimplementedtheDeviceiscalledamultifunctiondevice.

PCIe Functions
AspreviouslydiscussedFunctionsaredesignedintoeveryDevice.TheseFunc
tions may include hard drive interfaces, display controllers, ethernet control
lers,USBcontrollers,etc.DevicesthathavemultipleFunctionsdonotneedto
be implemented sequentially. For example, a Device might implement Func
tions0,2,and7.Asaresult,whenconfigurationsoftwaredetectsamultifunc
tiondevice,eachofthepossibleFunctionsmustbecheckedtolearnwhichof
themarepresent.EachFunctionalsohasitsownconfigurationaddressspace
thatisusedtosetuptheresourcesassociatedwiththeFunction.

86
PCIe 3.0.book Page 87 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

Figure31:ExampleSystem

CPU

Root Complex
Host/PCI
Bridge
Bus 0

Virtual Bus 0 Bus 0 Virtual Bus 0 Integr.


Dev 0 Dev 1 Dev 2
P2P Func 0 Func 0 P2P Func 0 EP

Bus 1 Bus 1 Bus 5 Bus 5


Dev 0 Bus 2 Bus 6 Dev 0
Func 0 Dev 2 Dev 1 Func 0 Bus 6
Func 0 Func 0 Dev 2
Bus 2 Func 0
Dev 1 Virtual
Func 0 Virtual
P2P
P2P
Bus 2 Bus 6 Bus 6
Virtual Virtual Virtual Dev 3
Virtual Virtual
P2P
Func 0
P2P P2P P2P P2P

Bus 3
Bus 4 Bus 7 Bus 8 Bus 10

Function 0 Function 1 Function 0 Function 0 Function 0

Dev 0 Dev 0 Dev 0 Dev 0


Bus 8
Dev 0
Express Func 0
PCI
Bridge

PCI Bus Bus 9

PCI PCI PCI


Device Device Device

Dev 1 Dev 2 Dev 3


Func 0 Func 0 Func 0

87
PCIe 3.0.book Page 88 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Configuration Address Space


ThefirstPCsrequireduserstosetswitchesandjumperstoassignresourcesfor
eachcardinstalledandthisfrequentlyresultedinconflictingmemory,IOand
interrupt settings. The subsequent IO architectures, Extended ISA (EISA) and
theIBMPS2systems,werethefirsttoimplementedplugandplayarchitectures.
Inthesearchitecturesconfigurationfileswereshippedwitheachplugincard
thatallowedsystemsoftwaretoassignbasicresources.PCIextendedthiscapa
bilitybyimplementingstandardizedconfigurationregistersthatpermitgeneric
shrinkwrappedOSstomanagevirtuallyallsystemresources.Havingastan
dard way to enable error reporting, interrupt delivery, address mapping and
more, allows one entity, the configuration software, to allocate and configure
thesystemresourceswhichvirtuallyeliminatesresourceconflicts.

PCIdefinesadedicatedblockofconfigurationaddressspaceforeachFunction.
Registers mapped into the configuration space allow software to discover the
existenceofaFunction,configureitfornormaloperationandcheckthestatusof
theFunction.Mostofthebasicfunctionalitythatneedstobestandardizedisin
the header portion of the configuration register block, but the PCI architects
realizedthatitwouldbeneficialtostandardizeoptionalfeatures,calledcapabil
ity structures (e.g. Power Management, Hot Plug, etc.). The PCICompatible
configurationspaceincludes256bytesforeachFunction.

PCI-Compatible Space
RefertoFigure32onpage89duringthefollowingdiscussion.The256bytesof
PCIcompatible configuration space was so named because it was originally
designedforPCI.Thefirst16dwords(64bytes)ofthisspacearetheconfigura
tionheader(HeaderType0orHeaderType1).Type0headersarerequiredfor
every Function except for the bridge functions that use a Type 1 header. The
remaining 48 dwords are used for optional registers including PCI capability
structures. For PCIe Functions, some capability structures are required. For
example,PCIeFunctionsmustimplementthefollowingCapabilityStructures:

PCIExpressCapability
PowerManagement
MSIand/orMSIX

88
PCIe 3.0.book Page 89 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

Figure32:PCICompatibleConfigurationRegisterSpace

256-Byte Type 0 Header Type 1 Header


Configuration Register Byte Doubleword Byte Doubleword
Space (per Function) 3 2 1 0 3 2 1 0
Device ID Vendor ID 00 Device ID Vendor ID 00
Status Command 01 Status Command 01
64-Bytes
Revision Revision
PCI Configuration Class Code ID 02 Class Code ID 02
Header Space BIST Header Latency Cache Line 03 BIST
Header Latency Cache Line
03
Type Timer Size Type Timer Size

Base Address 0 04 Base Address 0 04


Base Address 1 05 Base Address 1 05
Secondary Subordinate Secondary Primary
Base Address 2 06 Latency Timer Bus Number Bus Number Bus Number 06
Base Address 3 07 Secondary Status I/O Limit I/O Base 07
Base Address 4 08 Memory Limit Memory Base 08
Prefetchable Prefetchable
Base Address 5 09 09
192-Bytes Memory Limit Memory Base
Capability CardBus CIS Pointer 10 Prefetchable Base - Upper 32-bits 10
Structures Subsystem
Subsystem ID
Vendor ID
11 Prefetchable Limit - Upper 32-bits 11
I/O Limit I/O Base
Expansion ROM Base Address 12 Upper 16-bits Upper 16-bits
12
Capabilities Capabilities
Reserved Pointer 13 Reserved Pointer 13
Reserved 14 Expansion ROM Base Address 14
Interrupt Interrupt Interrupt Interrupt
Max_Lat Min_Gnt Pin Line 15 Bridge Control Pin Line 15

Required Config Registers

Extended Configuration Space


Refer to Figure 33 on page 90 during this discussion. When PCIe was intro
duced,therewasnotenoughroomintheoriginal256byteconfigurationregion
tocontainallthenewcapabilitystructuresneeded.Sothesizeofconfiguration
space was expanded from 256 bytes per function to 4KB, called the Extended
Configuration Space. The 960dword Extended Configuration area is only
accessible using the Enhanced configuration mechanism and is therefore not
visibletolegacyPCIsoftware.ItcontainsadditionaloptionalExtendedCapabil
ityregistersforPCIesuchasthoselistedinFigure33(notacompletelist).

89
PCIe 3.0.book Page 90 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure33:4KBConfigurationSpaceperPCIExpressFunction

Config Header Byte Dword


3 2 1 0
Device ID Vendor ID 00
PCI Config Hdr Offset 000h Status Command 01
16 DWs Revision
Class Code 02
ID
PCI-Compatible space is Offset 040h Header Latency Cache Line
PCI Device-specific BIST
Type Timer Size 03
accessible by legacy
& New Capability Base Address 0 04
PCI software or PCIe Base Address 1 05
register sets
Enhanced Configuration Base Address 2 06

Access Mechanism Base Address 3 07

48 DWs Base Address 4 08

Offset 100h Base Address 5 09


CardBus CIS Pointer 10
PCIe Extended Subsystem ID
Subsystem
11
Vendor ID
Configuration Expansion ROM Base Address 12
Register Space Reserved
Capabilities
Pointer 13
Reserved 14
Optional Extended Interrupt Interrupt
Max_Lat Min_Gnt Pin Line 15
Capability registers
implemented in this space,

PCIe Extended space is such as: PCIe Capability Structure


only accessible by PCIe must be implemented in
- Advanced Error Reporting this register space
Enhanced Configuration
- Virtual Channels
Access Mechanism
- Device Serial Number
- Power Budgeting

960 DWs Offset FFFh

Host-to-PCI Bridge Configuration Registers

General
The HosttoPCI bridges configuration registers dont have to be accessible
using either of the configuration mechanisms mentioned in the previous sec
tion.Instead,itstypicallyimplementedasdevicespecificregistersinmemory
addressspace,whichisknownbytheplatformfirmware.However,itsconfigu
rationregisterlayoutandusagemustadheretothestandardType0template
definedbythePCI2.3specification.

90
PCIe 3.0.book Page 91 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

Only the Root Sends Configuration Requests


The specification states that only the Root Complex is permitted to originate
Configuration Requests. It acts as the system processors liaison to inject
Requests into the fabric and pass Completions back. The ability to originate
configurationtransactionsisrestrictedtotheprocessorthroughtheRootCom
plex to avoid the anarchy that could result if any device had the ability to
changetheconfigurationofotherdevices.

SinceonlytheRootcaninitiatetheserequests,theyalsocanonlymovedown
stream,whichmeansthatpeertopeerConfigurationRequestsarenotallowed.
TheRequestsareroutedbasedonthetargetdevicesID,meaningitsBDF(Bus
number in the topology, Device number on that bus, and Function number
withinthatDevice).

Generating Configuration Transactions


Processors are generally unable to perform configuration read and write
requestsdirectlybecausetheycanonlygeneratememoryandIOrequests.That
means the Root Complex will need to translate certain of those accesses into
configuration requests in support of this process. Configuration space can be
accessedusingeitheroftwomechanisms:
ThelegacyPCIconfigurationmechanism,usingIOindirectaccesses.
Theenhancedconfigurationmechanism,usingmemorymappedaccesses.

Legacy PCI Mechanism


ThePCIspecdefinedanIOindirectmethodforinstructingthesystem(theRoot
Complex or its equivalent) to perform PCI configuration accesses. As it hap
pened,thedominantPCprocessors(Intelx86)wereonlydesignedtoaddress
64KBofIOaddressspace.BythetimePCIwasdefined,thislimitedIOspace
hadbecomebadlyclutteredandonlyafewaddressrangesremainedavailable:
0800h08FFhand0C00h0CFFh.Consequently,itwasntfeasibletomapthe
configurationregistersforallthepossibleFunctionsdirectlyintoIOspace.At
thesametime,memoryaddressspacewasalsolimitedinsizeandmappingall
ofconfigurationspaceintomemoryaddressspacewasnotseenasagoodsolu
tioneither.Sothespecwriterschoseacommonlyusedsolutiontothisproblem,
useindirectaddressmappinginstead.Todothis,oneregisterholdsthetarget

91
PCIe 3.0.book Page 92 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

address,whileasecondholdsthedatagoingtoorcomingfromthetarget.A
write to the address register, followed by a read or write to the data register,
causesasinglereadorwritetransactiontothecorrectinternaladdressforthe
targetfunction.Thissolvestheproblemoflimitedaddressspacenicely,butit
meansthattwoIOaccessesareneededtocreateoneconfigurationaccess.

ThePCICompatiblemechanismusestwo32bitIOportsintheHostbridgeof
theRootComplex.TheyaretheConfigurationAddressPort,atIOaddresses
0CF8h 0CFBh, and the Configuration Data Port, at IO addresses 0CFCh
CFFh.

AccessingaFunctionsPCIcompatibleconfigurationregistersisaccomplished
byfirstwritingthe targetBus,Device,Functionand dword numbers intothe
Configuration Address Port, setting its Enable bit in the process. Secondly, a
one,two,orfourbyteIOreadorwriteissenttotheConfigurationDataPort.
ThehostbridgeintheRootComplexcomparesthespecifiedtargetbustothe
rangeofbusesthatexistdownstreamofthebridge.Ifthetargetbusiswithin
thatrange,thebridgeinitiatesaconfigurationreadorwriterequest(depending
onwhethertheIOaccesstotheConfigurationDataPortwasareadorawrite).

Configuration Address Port


The Configuration Address Port only latches information when the processor
performsafull32bitwritetotheport,asshowninFigure34,anda32bitread
fromtheportreturnsitscontents.TheinformationwrittentotheConfiguration
AddressPortmustconformtothefollowingtemplate(illustratedinFigure34)
anddescribedonthefacingpage.

Figure34:ConfigurationAddressPortat0CF8h

31 30 24 23 16 15 11 10 8 7 2 1 0
Reserved Bus Device Function
Number Number Number Doubleword 0 0

Register pointer (64 DW)


Should always be zeros
Enable Configuration Space Mapping
1 = enabled

92
PCIe 3.0.book Page 93 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

Bits[1:0]arehardwired,readonlyandmustreturnzeroswhenread.The
locationisdwordalignedandnobytespecificoffsetisallowed.
Bits[7:2]identifythetargetdword(alsocalledtheRegisterNumber)inthe
target Functions PCIcompatible configuration space. This mechanism is
limitedtothecompatibleconfigurationspace(i.e.,thefirst64doublewords
ofaFunctionsconfigurationspace).
Bits [10:8] identify the target Function number (0 7) within the target
device.
Bits[15:11]identifythetargetDevicenumber(031).
Bits[23:16]identifythetargetBusnumber(0255).
Bits[30:24]arereservedandmustbezero.
Bit[31]mustbesetto1btoenabletranslationofthesubsequentIOaccessto
theConfigurationDataPortintoaconfigurationaccess.Ifbit31iszeroand
anIOreadorwriteissenttotheConfigurationDataPort,thetransactionis
treatedasanordinaryIORequest.

Bus Compare and Data Port Usage


The Host Bridge within the Root Complex, shown in Figure 35 on page 95,
implementsaSecondaryBusNumberregisterandaSubordinateBusNumber
register.TheSecondaryBusNumberisthebusnumberofthebusimmediately
beneaththebridge.TheSubordinateBusNumberisthetargetbusnumberthat
livesdownstreamofthebridge.
InasingleRootComplexsystem,thebridgemayhaveaSecondaryBusNum
berregisterthatishardwiredto0,aread/writeregisterthatresetforcesto0,or
itmayjustimplicitlyknowthatthefirstaccessiblebuswillbeBus0.Ifbit31in
the Configuration Address Port (see Figure 34 on page 92) is set to 1b, the
bridge will compare the target bus number to the range of buses that exists
downstream.
WhenaRequestisseen,theBridgeevaluateswhetherthetargetbusnumberis
withintherangeofbusnumbersdownstream,fromthevalueoftheSecondary
BusnumbertotheSubordinateBusnumber,inclusive.Ifthetargetbusmatches
theSecondaryBus,thenthatbusistargetedandtheRequestispassedthrough
as a Type 0 Configuration Request. When devices see a Type 0 Request, they
knowthatadevicelocaltothatbusisthetargetdevice(ratherthanoneona
subordinatebusdownstream).
IfthetargetbusislargerthanthebridgesSecondaryBusnumber,butlessthan
orequaltothebridgesSubordinateBusnumber,theRequestwillbeforwarded
asaType1configurationrequestonthebridgessecondarybus.AType1con
figurationaccessisunderstoodtomeanthat,eventhoughtheRequesthastogo
acrossthisbus,itdoesnottargetadeviceonthisbus.Instead,therequestwill

93
PCIe 3.0.book Page 94 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

beforwardeddownstreambyoneoftheBridgesonthisbus,whoseSecondary
and Subordinate bus number range contains the target bus number. For that
reason,onlyBridgedevicespayattentiontoType1configurationRequests.See
Configuration Requests on page 99 for additional information regarding
Type0andType1configurationRequests.

Single Host System


The information written to the Configuration Address Port is latched by the
Host/PCIbridgewithintheRootComplex,asshowninFigure31onpage87.If
bit31is1bandthetargetbusiswithinthedownstreamrangeofbusnumbers,
thebridgetranslatesasubsequentprocessoraccesstargetingitsConfiguration
DataPortintoaconfigurationrequestonbus0.Theprocessortheninitiatesan
IO read or write transaction to the Configuration Data Port at 0CFCh. This
causesthebridgetogenerateaConfigurationRequestthatisareadwhenthe
IOaccesstotheConfigurationDataPortwasaread,oraConfigurationwriteif
theIOaccesswasawrite.ItwillbeaType0configurationtransactionifthetar
getbusisbus0,oraType1foranotherbuswithintherange,ornotforwarded
atallifthetargetbusisoutsideoftherange.

94
PCIe 3.0.book Page 95 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

Figure35:SingleRootSystem

Processor

Root Complex
Host/PCI
Bus 0 Sec = 0 Bridge
Sub = 9

Pri = 0 Pri = 0
P2P Sec = 1 Device 0 Device 1 Sec = 5 P2P
Sub = 4 Sub = 9

Bus 1 Bus 1 Bus 5 Bus 5


Device 0 Device 0

Pri = 1 Pri = 5
Sec = 2 P2P P2P Sec = 6
Sub = 4 Sub = 9
Bus 2 P2P Bus 6 P2P
Pri = 2 P2P Pri = 2 Pri = 6 P2P Pri = 6 Pri = 6
Sec = 3 Sec = 4 Sec = 7 Sec = 8 Sec = 9
Sub = 3 P2P Sub = 4 Sub = 7 Sub = 8 Sub = 9

Bus 3 Bus 4 Bus 7 Bus 8 Bus 9

Function 0 Function 0 Function 0 Function 0

Bus 3 Bus 4 Bus 7 Bus 9


Device 0 Device 0 Device 0 Device 0

95
PCIe 3.0.book Page 96 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Multi-Host System
IftherearemultipleRootComplexes(refertoFigure36onpage97),theCon
figurationAddressandDataportscanbeduplicatedatthesameIOaddresses
in each of their respective Host/PCI bridges. In order to prevent contention,
onlyoneofthebridgesrespondstotheprocessorsaccessestotheconfiguration
ports.
1. When the processor initiates the IO write to the Configuration Address
Port,thehostbridgesareconfiguredsothatonlyonewillactivelypartici
pateinthetransaction.
2. Duringenumeration,softwarediscoversandnumbersallthebusesunder
theactivebridge.Whenthatsdone,itenablestheinactivehostbridgeand
assignsabusnumbertoitthatisoutsidetherangealreadyassignedtothe
activebridgeandcontinuestheenumerationprocess.Bothhostbridgessee
theRequests,butsincetheyhavenonoverlappingbusnumberstheyonly
respondtotheappropriatebusnumberrequestsandsotheresnoconflict.
3. Accesses to the Configuration Address Port go to both host bridges after
that,andasubsequentreadorwriteaccesstotheConfigurationDataPortis
onlyacceptedbythehost/PCIbridgethatisthegatewaytothetargetbus.
Thisbridgerespondstotheprocessorstransactionandtheotherignoresit.
o IfthetargetbusistheSecondaryBus,thebridgeconvertstheaccesstoa
Type0configurationaccess.
o Otherwise,itconvertsitintoaType1configurationaccess.

Enhanced Configuration Access Mechanism


General
WhenthespecwriterswerechoosinghowPCIXand,later,PCIe,wouldaccess
Configuration space, there were two concerns. First, the 256byte space per
Functionlimitedvendorswhowantedtoputproprietaryinformationthere,as
wellasfuturespecwriterswhowouldneedroomformorestandardizedcapa
bilitystructures.Tosolvethatproblem,thespacewassimplyextendedfrom256
bytesto4KBperFunction.Secondly,whenPCIwasdevelopedtherewerefew
multiprocessorsystemsinuse.WhentheresonlyoneCPUanditsonlyrun
ning one thread, the fact that the old model takes two steps to generate one
access isnt a problem. But newer machines using multicore, multithreaded
CPUs present a problem for the IOindirect model because theres nothing to
stop multiple threads from trying to access Configuration space at the same
time.Consequently,thetwostepmodelwillnolongerworkwithoutsomelock
ingsemantics.Withnolockingsemantics,oncethreadAwritesavalueintothe

96
PCIe 3.0.book Page 97 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

ConfigurationAddressPort(CF8h),thereisnothingtopreventthreadBfrom
overwritingthatvaluebeforethreadAcanperformitscorrespondingaccessto
theConfigurationDataPort(CFCh).

Figure36:MultiRootSystem

Inter-Processor
Communications Processor
Processor

Root Complex Root Complex

Sec = 0 Host/PCI Sec = 64 Host/PCI


Sub = 9 Bridge Sub = 65 Bridge
Bus 0 Bus 64
Pri = 0 Pri = 0 Pri = 64
Device 0 Device 1 Sec = 5 P2P Sec = 65
P2P Sec = 1 Sub = 9 Device 0 Sub = 65 P2P
Sub = 4

Bus 1 Bus 5 Bus 65

Function 0
Pri = 1 Pri = 5
Sec = 2 P2P P2P Sec = 6
Sub = 4 Sub = 9
Bus 2 P2P Bus 6 P2P Bus 65
Pri = 2 P2P Pri = 2 Pri = 6 P2P Pri = 6 Pri = 6 Device 0
Sec = 3 Sec = 4 Sec = 7 Sec = 8 Sec = 9
Sub = 3 P2P Sub = 4 Sub = 7 Sub = 8 Sub = 9

Bus 3 Bus 4 Bus 7 Bus 8 Bus 9

Function 0 Function 0 Function 0 Function 0

Bus 3 Bus 4 Bus 7 Bus 9


Device 0 Device 0 Device 0 Device 0

97
PCIe 3.0.book Page 98 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

To solve this new problem, the spec writers decided to take a different
approach.Ratherthantrytoconserveaddressspace,theywouldcreateasingle
step,uninterruptableprocessbymappingallofconfigurationspaceintomem
ory addresses. That allows a single command sequence, since one memory
requestinthespecifiedaddressrangewillgenerateoneConfigurationRequest
onthebus.Thetradeoffnowisaddresssize.Mapping4KBperFunctionforall
the possible implementations requires allocating 256MB of memory address
space.Thedifferenceinthatregardtodayisthatmodernarchitecturestypically
supportanywherebetween36and48bitsofphysicalmemoryaddressspace.
Withthesememoryaddressspacesizes,256MBisinsignificant.

To handle this mapping, each Functions 4KB configuration space starts at a


4KBalignedaddresswithinthe256MBmemoryaddressspacesetasideforcon
figuration access, and the address bits now carry the identifying information
aboutwhichFunctionistargeted(refertoTable 31onpage 98).

Some Rules
ARootComplexisnotrequiredtosupportanaccesstoenhancedconfiguration
memoryspaceifitcrossesadwordaddressboundary(straddlestwoadjacent
memory dwords). Nor are they required to support the bus locking protocol
that some processor types use for an atomic, or uninterrupted series of com
mands.Softwareshouldavoidbothofthesesituationswhenaccessingconfigu
rationspaceunlessitisknownthattheRootComplexdoessupportthem.

Table31:EnhancedConfigurationMechanismMemoryMappedAddressRange

MemoryAddressBitField Description

A[63:28] Upperbitsofthe256MBalignedbaseaddressofthe
256MBmemorymappedaddressrangeallocated
fortheEnhancedConfigurationMechanism.
Themannerinwhichthebaseaddressisallocatedis
implementationspecific.ItissuppliedtotheOSby
systemfirmware(typicallythroughtheACPI
tables).

A[27:20] TargetBusNumber(0255).

A[19:15] TargetDeviceNumber(031).

A[14:12] TargetFunctionNumber(07).

98
PCIe 3.0.book Page 99 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

Table31:EnhancedConfigurationMechanismMemoryMappedAddressRange(Continued)

MemoryAddressBitField Description

A[11:2]thisrangecanaddressoneof1024dwords,
A[11:2] whereasthelegacymethodislimitedtoonly
addressoneof64dwords.

A[1:0] DefinestheaccesssizeandtheByteEnablesetting.

Configuration Requests
Tworequesttypes,Type0orType1,maybegeneratedbybridgesinresponse
to a configuration access. The type used depends on whether the target Bus
numbermatchesthebridgesSecondaryBusNumber,asdescribedbelow.

Type 0 Configuration Request


IfthetargetbusnumbermatchestheSecondaryBusNumber,aType0configu
rationreadorwriteisforwardedtothesecondarybusand:

1. DevicesonthatBuschecktheDeviceNumbertoseewhichofthemisthe
target device. Note that Endpoints on an external Link will always be
Device0.
2. TheselectedDevicecheckstheFunctionNumbertoseewhichFunctionis
selectedwithinthedevice.
3. The selected Function uses the Register Number field to select the target
dword in its configuration space, and uses the First Dword Byte Enable
fieldtoselectwhichbytestoreadorwritewithintheselecteddword.
Figure 37 illustrates the Type 0 configuration read and write Request header
formats.Inbothcases,theTypefield=00100,whiletheFormatfieldindicates
whetheritsareadorawrite.

99
PCIe 3.0.book Page 100 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure37:Type0ConfigurationReadandWriteRequestHeaders

Type 0 Configuration Read

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
000 00100 0 0 0 tr H D P 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Last BE 1st DW
Byte 4 Requester ID Tag 0000 BE
Byte 8 Bus Number Device Function R Register Number R
Number Number

Type 0 Configuration Write

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
010 00100 0 0 0 tr H D P 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Last BE 1st DW
Byte 4 Requester ID Tag 0000 BE
Byte 8 Bus Number Device Function R Register Number R
Number Number

Type 1 Configuration Request


Whenabridgeseesaconfigurationaccesswhosetargetbusnumberdoesnot
matchitsSecondaryBusNumberbutisintherangebetweenitsSecondaryand
SubordinateBusNumbers,itforwardsthepacketasaType1RequesttoitsSec
ondary Bus. Devices that are not bridges (Endpoints) know to ignore Type 1
Requestssincethetargetresidesonadifferentbus,butbridgesthatseeitwill
make the same comparison of the target bus number to the range of buses
downstream(seeFigure31onpage87andFigure36onpage97).

100
PCIe 3.0.book Page 101 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

IfthetargetbusmatchestheBridgessecondarybus,thepacketisconverted
fromType1toType0andpassedtothesecondarybus.Deviceslocaltothat
busthencheckthepacketheaderaspreviouslydescribed.
IfthetargetbusisnottheBridgessecondarybusbutiswithinitsrange,the
packetisforwardedtotheBridgessecondarybusasaType1Request.

Figure 38 illustrates the Type 1 configuration read and write request header
formats. In both cases, the Type field = 00101, while the Fmt field indicates
whetheritsareadorawrite.

Figure38:Type1ConfigurationReadandWriteRequestHeaders

Type 1 Configuration Read

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
000 00101 0 0 0 tr H D P 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Last BE 1st DW
Byte 4 Requester ID Tag 0000 BE
Byte 8 Bus Number Device Function
R Register Number R
Number Number

Type 1 Configuration Write

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
010 00101 0 0 0 tr H D P 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Last BE 1st DW
Byte 4 Requester ID Tag 0000 BE
Byte 8 Bus Number Device Function R Register Number R
Number Number

101
PCIe 3.0.book Page 102 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Example PCI-Compatible Configuration Access


RefertoFigure39onpage104.ToillustratetheconceptofgeneratingaConfig
urationRequestusingthelegacyCF8h/CFChmechanism,considerthefollow
ingx86assemblycodesample,whichwillcausetheRootComplextoperforma
2bytereadfromBus4,Device0,Function0,Register0(VendorID).
mov dx,0CF8h ;set dx = config address port address
mov eax,80040000h;enable=1, bus 4, dev 0, func 0, DW 0
out dx,eax ;IO write to set up address port
mov dx,0CFCh ; set dx = config data port address
in ax,dx ;2-byte read from config data port
1. TheoutinstructiongeneratesanIOwritefromtheprocessortargetingthe
ConfigurationAddressPortintheRootComplexHostbridge(0CF8h),as
showninFigure34onpage92.
2. TheHostbridgecomparesthetargetbusnumber(4)specifiedintheCon
figuration Address Port to the range of buses (0through10) that reside
downstream.Thetargetbusfallswithintherange,sothebridgeisprimed
withthedestinationofthenextconfigurationrequest.
3. Theininstruction,generatesanIOreadtransactionfromtheprocessortar
getingtheConfigurationDataPortintheRootComplexHostbridge.Itsa
2bytereadfromthefirsttwolocationsintheConfigurationDataPort.
4. Sincethetargetbusisnotbus0,theHost/PCIbridgeinitiatesaType1Con
figurationreadonbus0.
5. Allofthedevicesonbus0latchthetransactionrequestandseethatitsa
Type 1 Configuration Request. As a result, both of the virtual PCItoPCI
bridgesintheRootComplexcomparethetargetbusnumberintheType1
requesttotherangeofbusesdownstreamfromeachofthem.
6. Thedestinationbus(4)iswithintherangeofbusesdownstreamoftheleft
handbridge,soitpassesthepacketthroughtoitssecondarybus,butasa
Type 1 request because the destination bus doesnt match the Secondary
BusNumber.
7. Theupstreamportonthelefthandswitchreceivesthepacketanddelivers
ittotheupstreamPCItoPCIbridge.
8. Thebridgedeterminesthatthedestinationbusresidesbeneathit,butisnot
targeting its secondary bus, so it passes the packet to bus 2 as a Type 1
request.
9. Bothofthebridgesonbus2receivetheType1requestpacket.Theright
handbridgedeterminesthatthedestinationbusmatchesitsSecondaryBus
Number.

102
PCIe 3.0.book Page 103 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

10. Thebridgepassestheconfigurationreadrequestthroughtobus4,butcon
verts into a Type 0 Configuration Read request because the packet has
reachedthedestinationbus(targetbusnumbermatchesthesecondarybus
number).
11. Device0onbus4receivesthepacketanddecodesthetargetDevice,Func
tion,andRegisterNumberfieldstoselectthetargetdwordinitsconfigura
tionspace(seeFigure33onpage90).
12. Bits0and1intheFirstDwordByteEnablefieldareasserted,sotheFunc
tionreturnsitsfirsttwobytes,(VendorIDinthiscase)intheCompletion
packet. The Completion packet is routed to the Host bridge using the
RequesterIDfieldobtainedfromtheType0requestpacket.
13. Thetwobytesofreaddataaredeliveredtotheprocessor,thuscompleting
theexecutionoftheininstruction.TheVendorIDisplacedintheproces
sorsAXregister.

Example Enhanced Configuration Access


RefertoFigure39onpage104.Thefollowingx86codesamplecausestheRoot
ComplextoperformareadfromBus4,Device0,Function0,Register0(Vendor
ID). Before this will work, the Host Bridge must have been assigned a base
address value. This example assumes that the 256MBaligned base address of
theEnhancedConfigurationmemorymappedrangeisE0000000h:
mov ax,[E0400000h];memory-mapped Config read
Address bits 63:28 indicate the upper 36 bits of the 256MBaligned base
addressoftheoverallEnhancedConfigurationaddressrange(inthiscase,
00000000E0000000h).
Addressbits27:20selectthetargetbus(inthiscase,4).
Addressbits19:15selectthetargetdevice(inthiscase,0)onthebus.
Address bits 14:12 select the target Function (in this case, 0) within the
device.
Address bits 11:2 selects the target dword (in this case, 0) within the
selectedFunctionsconfigurationspace.
Addressbits1:0definethestartbytelocationwithintheselecteddword(in
thiscase,0).
The processor initiates a 2byte memory read starting from memory location
E0400000h, and this is latched by the Host Bridge in the Root Complex. The
HostBridgerecognizesthattheaddressmatchestheareadesignatedforCon
figurationandgeneratesaConfigurationreadRequestforthefirsttwobytesin
dword0,Function0,device0,bus4.Theremainderoftheoperationisthesame
asthatdescribedintheprevioussection.

103
PCIe 3.0.book Page 104 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure39:ExampleConfigurationReadAccess

Processor

Root Complex
Host/PCI
Bus = 0 Bridge
Sub = 10

Bus 0
Pri = 0 Pri = 0
P2P Sec = 1 Device 0 Device 1 Sec = 5 P2P
Sub = 4 Sub = 10

Bus 1 Bus 5

Pri = 1 Pri = 5
Sec = 2 P2P P2P Sec = 6
Sub = 4 Sub = 10
Bus 2 P2P Bus 6 P2P
Pri = 2 P2P Pri = 2 Pri = 6 P2P Pri = 6 Pri = 6
Sec = 3 Sec = 4 Sec = 7 Sec = 8 Sec = 10
Sub = 3 P2P Sub = 4 Sub = 7 Sub = 8 Sub = 10

Bus 3 Bus 4 Bus 7 Bus 8 Bus 10

Function 0 Function 0 Function 0 Function 0

Pri = 8 Express
Sec = 9 PCI
Sub = 9 Bridge

PCI Bus Bus 9


PCI PCI PCI
Device Device Device

Enumeration - Discovering the Topology


Afterasystemresetorpowerup,configurationsoftwarehastoscanthePCIe
fabrictodiscoverthemachinetopologyandlearnhowthefabricispopulated.
Beforethathappens,asshowninFigure310onpage105,theonlythingthat
softwarecanknowforsureisthattherewillbeaHost/PCIbridgeandthatbus

104
PCIe 3.0.book Page 105 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

number0willbeonthesecondarysideofthatbridge.Notethattheupstream
sideofabridgedeviceiscalleditsprimarybus,whilethedownstreamsideis
referredtoasitssecondarybus.TheprocessofscanningthePCIExpressfabric
todiscoveritstopologyisreferredtoastheenumerationprocess.

Figure310:TopologyViewAtStartup

Root Complex has bus


number zero assigned.
Processor The remaining topology
have yet to be discovered
and numbered.

Host/PCI
Bridge

Bus 0

? ? ? ? ? ? ? ?

Discovering the Presence or Absence of a Function


Theconfigurationsoftwareexecutingontheprocessornormallydiscoversthe
existenceofaFunctionbyreadingfromitsVendorIDregister.Aunique16bit
valueisassignedtoeachvendorbythePCISIGandishardwiredintotheVen
dorIDregisterofeachFunctiondesignedbythatvendor.Byreadingthisregis
terinallofthepossiblecombinationsofBus,Device,andFunctionnumbersin
the system, enumeration software can search through the entire topology to
learnwhichdevicesarepresent.Thisprocessisfairlysimple,buttherearetwo
problems that can arise: a targeted device may not be present, or it may be
presentbutunpreparedtorespond.Handlingthesetwocasesisdescribednext.

Device not Present


It can happen several times during the process of discovery that the targeted
devicedoesntactuallyexistinthesystemandwhenthathappensitneedstobe
understoodcorrectly.InPCI,theConfigurationReadRequestwouldtimeouton
thebusandgenerateaMasterAborterrorcondition.Sincenodevicewasdriv
ingthebusandallthesignalswerepulledup,thedatabitsonthebuswouldbe

105
PCIe 3.0.book Page 106 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

seenasallonesandthatwouldbecomethedatavalueseen.TheresultingVen
dor ID of FFFFh is reserved. If enumeration software saw that result for the
read, it understood that the device wasnt present. Since this wasnt really an
errorcondition,theMasterAbortwouldnotbereportedasanerrorduringthe
enumerationprocess.
ForPCIe,aConfigurationReadRequesttoanonexistentdevicewillresultin
thebridgeabovethetargetdevicereturningaCompletionwithoutdatathathas
astatusofUR(UnsupportedRequest).Forbackwardcompatibilitywiththeleg
acyenumerationmodel,theRootComplexreturnsallones(FFFFh)tothepro
cessorforthedatawhenthisCompletionisseenduringenumeration.Notethat
enumerationsoftwaredependsonreceivingavalueofall1sforaConfiguration
ReadRequestthatreturnsanUnsupportedRequestwhenprobingfortheexist
enceofFunctionsinthesystem.
Itsimportanttoavoidaccidentallyreportinganerrorforthiscase.Eventhough
this timeout or UR result would be seen as an error during runtime, its an
expectedresultthatisntconsideredanerrorduringenumeration.Tohelpavoid
confusiononthis,devicesareusuallynotenabledtosignalerrorsuntillater.For
PCIeitmaystillbeusefultomakeanoteofthisevent,andthatswhyafourth
errorstatusbit,calledUnsupportedRequestStatusisgiveninthePCIeCapa
bilityregisterblock(refertoEnabling/DisablingErrorReportingonpage 678
formoreonthis).Thatallowsthisconditiontobenotedwithoutmarkingitas
anerror,andthatsimportantbecauseadetectederrormightstoptheenumera
tionprocesstocallthesystemerrorhandler.Theerrorhandlingsoftwaremight
haveonlylimitedcapabilitiesduringthistimeandthushavetroubleresolving
theproblem.Theenumerationsoftwarecouldfailinthatcase,sinceitstypically
writtentoexecutebeforetheOSorothererrorhandlingsoftwareisavailable.To
avoidthisrisk,errorsshouldnotnormallybereportedduringenumeration.

Device not Ready


Another problem that can arise is that the targeted device is present but isnt
readytorespondtoaconfigurationaccess.Thereisatimingconsiderationfor
configuration because of the time it takes devices to prepare for access. If the
datarateis5.0GT/sorless,softwaremustwait100msafterresetbeforeinitiat
ing a Configuration Request. If the rate is higher than 5.0 GT/s (Gen3 speed),
softwaremustwaituntil100msafterLinktrainingcompletesbeforeattempting
this. The reason for the longer delay for the higher speeds is that the Gen3
EqualizationProcessduringLinktrainingcantakealongtime(ontheorderof
50ms;seeLinkEqualizationOverviewonpage 577formoreonthistopic).
AsdefinedinthePCI2.3spec,InitializationTime(TrhfaTimefromResetHigh
toFirstAccess)beginswhenRST#isdeassertedandcompletes225PCIclocks

106
PCIe 3.0.book Page 107 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

later.ThatworksouttoonefullsecondduringwhichtheFunctionispreparing
for its first configuration access and that value has been carried forward for
PCIeas1.0s(+50%/0%).AFunctioncouldusethattimetopopulateitsconfigu
ration registers by loading the contents from an external serial EEPROM, for
example. That might take a while to load and the Function would be unpre
paredforasuccessfulaccessuntilitfinished.InPCI,ifaconfigurationaccess
was seen before the Function was ready, it had three choices: ignore the
Request, Retry the Request, or accept the Request but postpone delivering its
responseuntilitwasfullyready.ThatlastresponsecouldcausetroubleforHot
plugsystemsbecausethesharedbuscouldendupbeingstalledforonesecond
untiltheRequestresolved.

InPCIewehavethesameproblem,buttheprocessisalittledifferentnow.First,
PCIeFunctionsmustalwaysgiveaCompletionwithaspecificstatuswhenthey
aretemporarilyunabletorespondtoaconfigurationaccess,whichistheCon
figurationRequestRetryStatus(CRS).Thisstatusisonlylegalinresponsetoa
configuration request and may optionally be considered a Malformed Packet
errorifseeninresponsetootherRequests.Thisresponseisalsoonlyvalidfor
theonesecondafterresetbecausetheFunctionissupposedtorespondbythen
andcanbeconsideredbrokenifitwont.

ThewaytheRootComplexhandlesaCRSCompletioninresponsetoaConfig
urationReadRequestisimplementationspecific,exceptfortheperiodfollow
ing a system reset. During that time, there are two options for what the Root
willdonext,basedonthesettingoftheCRSSoftwareVisibilitybitinitsRoot
ControlRegister,showninFigure311onpage108:

IfthebitissetandtheRequestwasaConfigurationReadtobothbytesof
theVendorIDregister(asanenumerationaccesswoulddotodiscoverthe
presence of a Function), the Root must give the host an artificial value of
0001hforthisregister,andall1sforanyadditionalbytesinthisRequest.
ThisVendorIDisnotusedforanyrealdevicesandwillbeinterpretedby
software as an indication of a potentially lengthy delay in accessing this
device. This can be helpful because software could choose to go on to
anothertaskandmakebetteruseofthetimethatwouldotherwisebespent
waitingforthedevicetorespond,returningtoquerythisdevicelater.For
thistowork,softwaremustensurethatitsfirstaccesstoaFunctionaftera
resetconditionisaConfigurationReadofbothbytesoftheVendorID.
For configuration writes or any other configuration reads, the Root must
automaticallyreissuetheConfigurationRequestagainasanewrequest.

107
PCIe 3.0.book Page 108 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure311:RootControlRegisterinPCIeCapabilityBlock

15 5 4 3 2 1 0

RsvdP

CRS Software Visibility Enable


PME Interrupt Enable

System Error on Fatal Error Enable

System Error on Non-Fatal Error Enable

System Error on Correctable Error Enable

Determining if a Function is an Endpoint or Bridge


Acriticalpartoftheenumerationprocessisbeingabletodetermineifafunc
tionisabridgeoranendpoint.AsseeninFigure312onpage108,thelower7
bitsoftheHeaderTyperegister(offset0Ehinconfigspaceheader)identifythe
basiccategoryoftheFunction,andthreevaluesaredefined:

0=notabridge(EndpointinPCIe)
1=PCItoPCIbridge(abbreviatedasP2P)connectingtwobuses
2=CardBusbridge(legacyinterfacenotoftenusedtoday)
InFigure31onpage87,theHeaderTypefield(DW3,byte2)ineachoftheVir
tual P2Ps would return a value of 1, as would the PCI ExpresstoPCI bridge
(Bus8,Device0),whiletheEndpointswouldreturnaHeaderTypeofzero.

Figure312:HeaderTypeRegister

7 6 0
Header Type

Configuration Header Format


0 = single-function device
1 = multi-fuction device

108
PCIe 3.0.book Page 109 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

Single Root Enumeration Example


Nowthatwevediscussedthebasicelementsinvolvedintheenumerationpro
cess,letswalkthroughanexampleoftheprocess.Figure313onpage113illus
trates an example system after the buses and devices have been enumerated.
Thediscussionthatfollowsassumesthattheconfigurationsoftwareuseseither
ofthetwoconfigurationaccessmechanismsdefinedinthischaptertoachieve
thisresult.Atstartuptime,theconfigurationsoftwareexecutingontheproces
sorperformsenumerationasdescribedbelow.
1. SoftwareupdatestheHost/PCIbridgeSecondaryBusNumbertozeroand
the Subordinate Bus Number to 255. Setting this to the max value means
that it wont have to be changed again until all the bus numbers down
streamhavebeenidentified.Forthemoment,buses0through255areiden
tifiedasbeingdownstream.
2. Starting with Device 0 (bridge A), the enumeration software attempts to
readtheVendorIDfromFunction0ineachofthe32possibledeviceson
bus0.IfavalidVendorIDisreturnedfromBus0,Device0,Function0,the
deviceexistsandcontainsatleastoneFunction.Ifnot,goontoprobebus0,
device1,Function0.
3. TheHeaderTypefieldinthisexample(Figure312onpage108)contains
thevalueone(01h)indicatingthisisaPCItoPCIbridge.TheMultifunction
bit(bit7)intheHeaderTyperegisteris0,indicatingthatFunction0isthe
onlyFunctioninthisbridge.Thespecdoesntprecludeimplementingmultiple
FunctionswithinthisDeviceandeachoftheseFunctions,inturn,couldrepresent
othervirtualPCItoPCIbridgesorevennonbridgefunctions.
4. Now that software hasfound abridge,performsa seriesofconfiguration
writestosetthebridgesbusnumberregistersasfollows:
PrimaryBusNumberRegister=0
SecondaryBusNumberRegister=1
SubordinateBusNumberRegister=255
The bridge is now aware that the number of the bus directly attached
downstreamis1(SecondaryBusNumber=1)andthatthelargestbusnum
berdownstreamofitis255(SubordinateBusNumber=255).
5. Enumeration software must perform a depthfirst search. Before proceed
ing to discover additional Devices/Functions on bus 0, it must proceed to
searchbus1.
6. SoftwarereadstheVendorIDofBus1,Device0,Function0,whichtargets
bridge C in our example. A valid Vendor ID is returned, indicating that
Device0,Function0existsonBus1.
7. The Header Type field in the Header register contains the value one
(0000001b)indicatinganotherPCItoPCIbridge.Asbefore,bit7isa0,indi

109
PCIe 3.0.book Page 110 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

catingthatbridgeCisasinglefunctiondevice.
8. SoftwarenowperformsaseriesofconfigurationwritestosetbridgeCsbus
numberregistersasfollows:
PrimaryBusNumberRegister=1
SecondaryBusNumberRegister=2
SubordinateBusNumberRegister=255
9. Continuingthedepthfirstsearch,areadisperformedfrombus2,device0,
Function 0s Vendor ID register. The example assumes that bridge D is
Device0,Function0onBus2.
10. AvalidVendorIDisreturned,indicatingbus2,device0,Function0exists.
11. The Header Type field in the Header register contains the value one
(0000001b)indicatingthatthisisaPCItoPCIbridge,andbit7isa0,indi
catingthatbridgeDisasinglefunctiondevice.
12. SoftwarenowperformsaseriesofconfigurationwritestosetbridgeDsbus
numberregistersasfollows:
PrimaryBusNumberRegister=2
SecondaryBusNumberRegister=3
SubordinateBusNumberRegister=255
13. Continuingthedepthfirstsearch,areadisperformedfrombus3,device0,
Function0sVendorIDregister.
14. AvalidVendorIDisreturned,indicatingbus3,device0,Function0exists.
15. The Header Type field in the Header register contains the value zero
(0000000b)indicatingthatthisisanEndpointfunction.Sincethisisanend
pointandnotabridge,ithasaType0headerandtherearenoPCIcompat
ible buses beneath it. This time, bit 7 is a 1, indicating that this is a
multifunctiondevice.
16. EnumerationsoftwareperformsaccessestotheVendorIDofall8possible
functionsinbus3,device0 anddeterminesthatonlyFunction1exists in
additiontoFunction0.Function1isalsoanEndpoint(Type0header),so
therearenoadditionalbusesbeneaththisdevice.
17. Enumerationsoftwarecontinuesscanningacrossonbus3tolookforvalid
functionsondevices131butdoesnotfindanyadditionalfunctions.
18. Having found every function there was to find downstream of bridge D,
enumeration software updates bridge D, with the real Subordinate Bus
Numberof3.Thenitbacksuponelevel(tobus2)andcontinuesscanning
across on that bus looking for valid functions. The example assumes that
bridgeEisdevice1,Function0onbus2.
19. AvalidVendorIDisreturned,indicatingthatthisFunctionexists.
20. TheHeaderTypefieldinbridgeEsHeaderregistercontainsthevalueone
(0000001b)indicatingthatthisisaPCItoPCIbridge,andbit7isa0,indi
catingasinglefunctiondevice.

110
PCIe 3.0.book Page 111 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

21. SoftwarenowperformsaseriesofconfigurationwritestosetbridgeEsbus
numberregistersasfollows:
PrimaryBusNumberRegister=2
SecondaryBusNumberRegister=4
SubordinateBusNumberRegister=255
22. Continuingthedepthfirstsearch,areadisperformedfrombus4,device0,
Function0sVendorIDregister.
23. AvalidVendorIDisreturned,indicatingthatthisFunctionexists.
24. The Header Type field in the Header register contains the value zero
(0000000b)indicatingthatthisisanEndpointdevice,andbit7isa0,indi
catingthatthisisasinglefunctiondevice.
25. Enumerationsoftwarescansbus4tolookforvalidfunctionsondevices1
31butdoesnotfindanyadditionalfunctions.
26. Having reached the bottom of this tree branch, enumeration software
updatesthebridgeabovethatbus,Einthiscase,withtherealSubordinate
BusNumberof4.Itthenbacksuponelevel(tobus2)andmovesontoread
the Vendor ID of the next device (device 2). The example assumes that
devices231arenotimplementedonbus2,sonoadditionaldevicesare
discoveredonbus2.
27. Enumerationsoftwareupdatesthebridgeabovebus2,Cinthiscase,with
the real Subordinate Bus Number of 4 and backs up to the previous bus
(bus1)andattemptstoreadtheVendorIDofthenextdevice(device1).The
exampleassumesthatdevices131arenotimplementedonbus1,sono
additionaldevicesarediscoveredonbus1.
28. Enumerationsoftwareupdatesthebridgeabovebus1,Ainthiscase,with
the real subordinate Bus Number of 4. and backs up to the previous bus
(bus0)andmovesontoreadtheVendorIDofthenextdevice(device1).
TheexampleassumesthatbridgeBisdevice1,function0onbus0.
29. Inthesamemanneraspreviouslydescribed,theenumerationsoftwaredis
coversbridgeBandperformsaseriesofconfigurationwritestosetbridge
Bsbusnumberregistersasfollows:
PrimaryBusNumberRegister=0
SecondaryBusNumberRegister=5
SubordinateBusNumberRegister=255
30. Bridge F is then discovered and a series of configuration writes are per
formedtosetitsbusnumberregistersasfollows:
PrimaryBusNumberRegister=5
SecondaryBusNumberRegister=6
SubordinateBusNumberRegister=255
31. Bridge G is then discovered and a series of configuration writes are per
formedtosetitsbusnumberregistersasfollows:

111
PCIe 3.0.book Page 112 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

PrimaryBusNumberRegister=6
SecondaryBusNumberRegister=7
SubordinateBusNumberRegister=255
32. AsinglefunctionEndpointdeviceisdiscoveredatbus7,device0,function
0,sotheSubordinateBusNumberofBridgeGisupdatedto7.
33. Bridge H is then discovered and a series of configuration writes are per
formedtosetitsbusnumberregistersasfollows:
PrimaryBusNumberRegister=6
SecondaryBusNumberRegister=8
SubordinateBusNumberRegister=255
34. BridgeJisdiscoveredandaseriesofconfigurationwritesareperformedto
setbridgeitsbusnumberregistersasfollows:
PrimaryBusNumberRegister=8
SecondaryBusNumberRegister=9
SubordinateBusNumberRegister=255
35. AlldevicesandtheirrespectiveFunctionsonbus9arediscoveredandnone
ofthemarebridges,sotheSubordinateBusNumberofbridgesHandJare
updatedto9.
36. Bridge I is then discovered and a series of configuration writes are per
formedtosetitsbusnumberregistersasfollows:
PrimaryBusNumberRegister=6
SecondaryBusNumberRegister=10
SubordinateBusNumberRegister=255
37. AsinglefunctionEndpointdeviceisdiscoveredatbus10,device0,func
tion0.
38. Sincesoftware hasreachedthebottomofthisbranchofthetreestructure
required for PCIe topologies, the Subordinate Bus Number registers for
bridgesB,F,andIareupdatedto10,andsoistheHost/PCIbridgesSubor
dinateBusNumberregister.

The final values encoded into each bridges Primary, Secondary and Subordi
nateBusNumberfieldscanbefoundinFigure39onpage104.

112
PCIe 3.0.book Page 113 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

Figure313:SingleRootSystem

Processor

Root Complex

Host/PCI
Bridge
Bus 0

Virtual Bus 0 Bus 0 Virtual


A P2P
Dev 0
Func 0
Dev 1
Func 0 P2P
B

Bus 1
Bus 5

Virtual Virtual
P2P C P2P F
Bus 2 Bus 6
Virtual Virtual Virtual
D Virtual
P2P E Virtual
P2P
G P2P H P2P I P2P

Bus 3
Bus 4 Bus 7 Bus 8 Bus 10

Function 0 Function 1 Function 0 Function 0 Function 0

Dev 0 Dev 0 Dev 0 Dev 0


Bus 8
Dev 0
Express Func 0
J PCI
Bridge

PCI Bus Bus 9

PCI PCI PCI


Device Device Device

Dev 1 Dev 2 Dev 3


Func 0 Func 0 Func 0

113
PCIe 3.0.book Page 114 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Multi-Root Enumeration Example

General
ConsidertheMultiRootSystemshowninFigure314onpage116.Inthissys
tem,eachRootComplex:
Implements the Configuration Address Port and the Configuration Data
PortatthesameIOaddresses(anx86basedsystem).
ImplementstheEnhancedConfigurationMechanism.
ContainsaHost/PCIbridge.
ImplementstheSecondaryBusNumberandSubordinateBusNumberreg
istersatseparateaddressesknowntotheconfigurationsoftware.
Intheillustration,eachRootComplexisachipsetmemberandoneofthemis
designatedasthebridgetobus0(theprimaryRootComplex)whiletheotheris
designatedasthebridgetobus255(secondaryRootComplex).

Multi-Root Enumeration Process


DuringenumerationofthelefthandtreestructureinFigure314onpage116,
the Host/PCI bridgein thesecondary Root Complexignoresallconfiguration
accesses because the targeted bus number is no greater than 9. Note that,
althoughdetectedandnumbered,Bus8hasnodeviceattached.Oncethatenu
merationprocesshasbeencompleted,theenumerationsoftwaretakesthefol
lowingstepstoenumeratethesecondaryRootComplex:
1. The enumeration software changes the Secondary and Subordinate Bus
NumbervaluesinthesecondaryRootComplexsHost/PCIbridgetobus64
inthisexample.(Thevaluesof64and128arecommonlyusedasthestart
ingbusnumberinmultirootsystems,butthisisjustasoftwareconvention.
TherearenoPCIorPCIerulesrequiringthatconfiguration.Therewouldbe
nothingwrongwithstartingthesecondaryRootComplexsbusnumbersat
10inthisexample.)
2. Enumeration software then starts searching on bus 64 and discovers the
bridgeattachedtothedownstreamRootPort.
3. Aseriesofconfigurationwritesareperformedtosetitsbusnumberregis
tersasfollows:
PrimaryBusNumberRegister=64
SecondaryBusNumberRegister=65
SubordinateBusNumberRegister=255

114
PCIe 3.0.book Page 115 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

Thebridgeisnowawarethatthenumberofthebusdirectlyattachedtoits
downstreamsideis65(SecondaryBusNumber=65)andthenumberofthe
busfarthestdownstreamofitis65(SubordinateBusNumber=65).
4. Device 0 is discovered on Bus 65 that implements a only Function 0, and
further searching reveals no other Devices are present on Bus 65, so the
searchprocessmovesbackuponeBuslevel.
5. Enumerationcontinuesonbus64andnoadditionaldevicesarediscovered,
sotheHost/PCIsSubordinateBusNumberisupdatedto65.
6. Thiscompletestheenumerationprocess.

115
PCIe 3.0.book Page 116 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure314:MultiRootSystem

Inter-Processor
Communications Processor
Processor

Root Complex Root Complex

Sec = 0 Host/PCI Sec = 64 Host/PCI


Sub = 9 Bridge Sub = 65 Bridge
Bus 0 Bus 64
Pri = 0 Pri = 0 Pri = 64
Sec = 1 Device 0 Device 1 Sec = 5 P2P Sec = 65
P2P Sub = 9 Device 0 Sub = 65 P2P
Sub = 4

Bus 1 Bus 5 Bus 65

Function 0
Pri = 1 Pri = 5
Sec = 2 P2P P2P Sec = 6
Sub = 4 Sub = 9
Bus 2 P2P Bus 6 P2P Bus 65
Pri = 2 P2P Pri = 2 Pri = 6 P2P Pri = 6 Pri = 6 Device 0
Sec = 3 Sec = 4 Sec = 7 Sec = 8 Sec = 9
Sub = 3 P2P Sub = 4 Sub = 7 Sub = 8 Sub = 9

Bus 3 Bus 4 Bus 7 Bus 8 Bus 9

Function 0 Function 0 Function 0 Function 0

Bus 3 Bus 4 Bus 7 Bus 9


Device 0 Device 0 Device 0 Device 0

Hot-Plug Considerations
Inahotplugenvironment,meaningoneinwhichaddincardscanbeaddedor
removedduringruntime,thesituationillustratedbyBusnumber8inFigure3

116
PCIe 3.0.book Page 117 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

14onpage116canpotentiallycausetrouble.Aproblemcanoccurifthesystem
hasbeenenumeratedandisupandrunningandthenacardispluggedintoBus
8thathasabridgeonit.Thebridgewouldneedtohavebusnumbersassigned
for its Secondary and Subordinate Bus Numbers that are higher than the bus
numberonitsprimarybusandcompletelyinclusive.Thereasonisthatthebus
numbershavetobewithintheSecondaryandSubordinateBusNumbersofthe
bridgeupstreamofthenewcard.

OneapproachistoassigntheBusnumber(s)requiredforthebridgeresidingon
Busnumber8andincrementthecurrentBusnumber9toanumberthanisone
greaterthanthepreviousbusnumber,therebymakingroomforthenewbus(s).
Swizzling the bus numbers around during runtime can be done, but experi
encedpeoplesayitshardtogetittoworkverywell.

Thereisasimplersolutiontothispotentialproblem:simplyleaveabusnumber
gap whenever an unpopulated slot is found. For example, when Bus 8 is
assignedbutthenanopenslotisseenbelowit,givethenextdiscoveredbusa
highernumber,like19insteadof9,soastoleaveroomfortheseaddinsitua
tionstoberesolvedeasily.Then,ifacardwithabridgeisadded,thenewbus
number canbe assigned as Bus9 without causing any trouble.In mostcases,
leavingabusnumbergapwillnotbeanissuesincethesystemcanassignupto
256busnumbersintotal.

MindShare Arbor: Debug/Validation/Analysis and Learning


Software Tool

General
MindShareArborisacomputersystemdebug,validation,analysisandlearning
tool that allows the user to read and write any memory, IO or configuration
spaceaddress.Thedatafromtheseaddressspacescanbeviewedinacleanand
informativestyle.

The book authors made a decision to not include detailed descriptions of all
configuration registers summarized in a signal chapter. Rather, registers are
describedthroughoutthebookinassociatedchapterswheretheyarerelevant.

Inlieuofaconfigurationregisterspacedescriptionchapterinthisbook,Mind
ShareArborisanexcellentreferencelearningtooltoquicklyunderstandconfig
uration registers and structures implemented in PCI, PCIX and PCI Express

117
PCIe 3.0.book Page 118 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

devices.Alltheregisterandfielddefinitionsareuptodatewiththelatestver
sionofthePCIExpressspec.Severalothertypesofstructures(e.g.x86MSRs,
ACPI,USB,NVMExpress)canalsobeviewedwithMindShareArbor(orwill
becomingsoon).

Visitwww.mindshare.com/arbortodownloadafreetrialversionofMindShare
Arbor.

Figure315:PartialScreenshotofMindShareArbor

118
PCIe 3.0.book Page 119 Sunday, September 2, 2012 11:25 AM

Chapter3:ConfigurationOverview

MindShare Arbor Feature List


DescriptionofallconfigregistersincludedinthePCIe3.0spec
ScanconfigspaceforallPCIvisiblefunctionsinsystemandadescriptionof
everyoneoftheseregistersdisplayedinaneasilyreadableformat
DirectlyaccessanymemoryorIOaddress
Writetoanyconfigspacelocation,memoryaddressorIOaddress
Viewstandardandnonstandardstructuresinadecodedformat
o Decodeinfo included for standardPCI,PCIXand PCIExpressstruc
tures
o Decodeinfoincludedforsomex86basedstructuresanddevicespecific
registers
CreateyourownXMLbaseddecodefilestodriveArborsdisplay
o Create decode files for structures in config space, memory address
spaceandIOspace
Savesystemscansforviewinglateroronothersystems
o SavedsystemscansareXMLbasedandopenformat
Newfeaturesthatareeitheralreadyinorcomingsoon:
o Differencecheckingbetweenscans
o Postprocessingscansforillegalornonoptimalsettings
o Scriptingsupportforautomation
o Decode for x86 structures (MSRs, paging, segmentation, interrupt
tables,etc.)
o DecodeforACPIstructures
o DecodeforUSBstructures
o DecodeforNVMExpressstructures

119
PCIe 3.0.book Page 120 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

120
PCIe 3.0.book Page 121 Sunday, September 2, 2012 11:25 AM

4 AddressSpace&
TransactionRouting
The Previous Chapter
The previous chapter provides an introduction to configuration in the PCI
Expressenvironment.ThisincludesthespaceinwhichaFunctionsconfigura
tionregistersareimplemented,howaFunctionisdiscovered,howconfigura
tion transactions are generated and routed, the difference between PCI
compatible configuration space and PCIe extended configuration space, and
howsoftwaredifferentiatesbetweenanEndpointandaBridge.

This Chapter
This chapter describes the purpose and methods of a function requesting
addressspace(eithermemoryaddressspaceorIOaddressspace)throughBase
AddressRegisters(BARs) andhow software must setupthe Base/Limitregis
ters in all bridges to route TLPs from a source port to the correct destination
port. The general concepts of TLP routing in PCI Express are also discussed,
includingaddressbasedrouting,IDbasedroutingandimplicitrouting.

The Next Chapter


ThenextchapterdescribesTransactionLayerPacket(TLP)contentindetail.We
describetheuse,format,anddefinitionoftheTLPpackettypesandthedetails
oftheirrelatedfields.

I Need An Address
Almostalldeviceshaveinternalregistersorstoragelocationsthatsoftware(and
potentiallyotherdevices)needtobeabletoaccess.Theseinternallocationsmay
controlthedevicesbehavior,reportthestatusofthedevice,ormaybealoca
tion to hold data for the device to process. Regardless of the purpose of the
internalregisters/storage,itisimportanttobeabletoaccessthemfromoutside

121
PCIe 3.0.book Page 122 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

thedeviceitself.Thismeanstheseinternallocationsneedtobeaddressable.Soft
ware must be able to perform a read or write operation with an address that
willaccesstheappropriateinternallocationwithinthetargeteddevice.Inorder
tomakethiswork,theseinternallocationsneedtobeassignedaddressesfrom
oneoftheaddressspacessupportedinthesystem.

PCIExpresssupportstheexactsamethreeaddressspacesthatweresupported
inPCI:

Configuration
Memory
IO

Configuration Space
AswesawinChapter1,configurationspacewasintroducedwithPCItoallow
softwaretocontrolandcheckthestatusofdevicesinastandardizedway.PCI
ExpresswasdesignedtobesoftwarebackwardscompatiblewithPCI,soconfig
urationspaceisstillsupportedandusedforthesamereasonasitwasinPCI.
Moreinfoaboutconfigurationspace(purposeof,howtoaccess,size,contents,
etc.)canbefoundinChapter3.

Even though configuration space was originally meant to hold standardized


structures(PCIdefinedheaders,capabilitystructures,etc.),itisverycommon
for PCIe devices to have devicespecific registers mapped into their config
space.Inthesecases,thedevicespecificregistersmappedintoconfigspaceare
oftencontrol,statusorpointerregistersasopposedtodatastoragelocations.

Memory and IO Address Spaces


General
In the early days of PCs, the internal registers/storage in IO devices were
accessedviaIOaddressspace(asdefinedbyIntel).However,becauseofseveral
limitationsandundesirableeffectsrelatedtoIOaddressspace,thatwewillnot
be going into here, that address space quickly lost favor with software and
hardwarevendors.Thisresultedintheinternalregisters/storageofIOdevices
beingmappedintomemoryaddressspace(commonlyreferredtoasmemory
mappedIO,orMMIO).However,becauseearlysoftwarewaswrittentouseIO
addressspacetoaccessinternalregisters/storageonIOdevices,itbecamecom
mon practice to map the same set of devicespecific registers in memory

122
PCIe 3.0.book Page 123 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

addressspaceaswellasinIOaddressspace.Thisallowsnewsoftwaretoaccess
theinternallocationsofadeviceusingmemoryaddressspace(MMIO),while
allowinglegacy(old)softwaretocontinuetofunctionbecauseitcanstillaccess
theinternalregistersofdevicesusingIOaddressspace.

Newerdevicesthatdonotrelyonlegacysoftwareorhavelegacycompatibility
issues typically just map internal registers/storage through memory address
space (MMIO), with no IO address space being requested. In fact, the PCI
ExpressspecificationactuallydiscouragestheuseofIOaddressspace,indicat
ing that it is only supported for legacy reasons and may be deprecated in a
futurerevisionofthespec.

AgenericmemoryandIOmapisshowninFigure41onpage125.Thesizeof
thememorymapisafunctionoftherangeofaddressesthatthesystemcanuse
(oftendictatedbytheCPUaddressablerange).ThesizeoftheIOmapinPCIeis
limited to 32 bits (4GB), although in many computers using Intelcompatible
(x86)processors,onlythelower16bits(64KB)areused.PCIecansupportmem
oryaddressesupto64bitsinsize.

ThemappingexampleinFigure41isonlyshowingMMIOandIOspacebeing
claimedbyEndpoints,butthatabilityisnotexclusivetoEndpoints.Itisvery
commonforSwitchesandRootComplexestoalsohavedevicespecificregisters
accessedviaMMIOandIOaddresses.

Prefetchable vs. Non-prefetchable Memory Space


Figure41showstwodifferenttypesofMMIObeingclaimedbyPCIedevices:
Prefetchable MMIO (PMMIO) and NonPrefetchable MMIO (NPMMIO). Its
important to describe the distinction between prefetchable and nonprefetch
ablememoryspace.Prefetchablespacehastwoverywelldefinedattributes:

Readsdonothavesideeffects
Writemergingisallowed

DefiningaregionofMMIOasprefetchableallowsthedatainthatregiontobe
speculatively fetchedahead in anticipation that aRequestermight needmore
datainthenearfuturethanwasactuallyrequested.Thereasonitssafetodo
thisminorcachingofthedataisthatreadingthedatadoesntchangeanystate
infoatthetargetdevice.Thatistosaytherearenosideeffectsfromtheactof
readingthelocation.Forexample,ifaRequesteraskstoread128bytesfroman
address,theCompletermightprefetchthenext128bytesaswellinaneffortto
improveperformancebyhavingitonhandwhenitsrequested.However,ifthe
Requesterneverasksfortheextradata,theCompleterwilleventuallyhaveto

123
PCIe 3.0.book Page 124 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

discardittofreeupthebufferspace.Iftheactofreadingthedatachangedthe
valueatthataddress(orhadsomeothersideeffect),itwouldbeimpossibleto
recover the discarded data. However, for prefetchable space, the read had no
sideeffects,soitisalwayspossibletogobackandgetitlatersincetheoriginal
datawouldstillbethere.

You may be wondering what sort of memory space might have read side
effects? One example would be a memorymapped status register that was
designed to automatically clear itself when read to save the programmer the
extrastepofexplicitlyclearingthebitsafterreadingthestatus.

MakingthisdistinctionwasmoreimportantforPCIthanitisforPCIebecause
transactionsinthatbusprotocoldidnotincludeatransfersize.Thatwasnta
problem when the devices exchanging data were on the same bus, because
there was a realtime handshake to indicate when the requester was finished
and did not need anymore data, therefore knowing the byte count wasnt so
important.Butwhenthetransferhadtocrossabridgeitwasntaseasybecause
forreads,thebridgewouldneedtoguessthebytecountwhengatheringdata
ontheotherbus.Guessingwrongonthetransfersizewouldaddlatencyand
reduce performance, so having permission to prefetch could be very helpful.
Thatswhythenotionofmemoryspacebeingdesignatedasprefetchablewas
helpfulinPCI.SincePCIerequestsdoincludeatransfersizeitslessinteresting
thanitwas,butitscarriedforwardforbackwardcompatibility.

124
PCIe 3.0.book Page 125 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Figure41:GenericMemoryAndIOAddressMaps

CPU

Root Complex System


Memory
(DRAM)

Switch Memory Map


232 or 264

MMIO
Legacy PCIe (Prefetchable)
Endpoint Endpoint

MMIO (NP) MMIO (P)


MMIO
IO MMIO (NP) (Non-Prefetchable)

PCIe Functions may have registers


and buffers mapped into IO and
Memory address space
IO Map System
232
Memory
(DRAM)
IO
Ports
216

0 0

125
PCIe 3.0.book Page 126 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Base Address Registers (BARs)

General
Each device in a system may have different requirements in terms of the
amountandtypeofaddressspaceneeded.Forexample,onedevicemayhave
256bytesworthofinternalregisters/storagethatshouldbeaccessiblethrough
IOaddressspaceandanotherdevicemayhave16KBofinternalregisters/stor
agethatshouldbeaccessiblethroughMMIO.

PCIbased devices are not allowed to decide on their own, which addresses
shouldbeusedtoaccesstheirinternallocations,thatisthejobofsystemsoft
ware(i.e.BIOSandOSkernel).Sothedevicesmustprovideawayforsystem
software to determine the address space needs of the device. Once software
knows what the devices requirements are in terms of address space, then
assumingtherequestcanbefulfilled,softwarewillsimplyallocateanavailable
rangeofaddresses,oftheappropriatetype(IO,NPMMIOorPMMIO),tothat
device.

This is all accomplished through the Base Address Registers (BARs) in the
header of configuration space. As shown in Figure 42 on page 127, a Type 0
headerhassixBARsavailable(eachonebeing32bitsinsize),whileaType1
header has only two BARs. Type 1 headers are found in all bridge devices,
which means every switch port and root complex port has a Type 1 header.
Type0headersareinnonbridgedeviceslikeendpoints.Anexampleofthiscan
beseeninFigure43onpage128.

Systemsoftwaremustfirstdeterminethesizeandtypeofaddressspacebeing
requested by a device. The device designer knows the collective size of the
internalregisters/storagethatshouldbeaccessibleviaIOorMMIO.Thedevice
designer also knows how the device will behave when those registers are
accessed (i.e. do reads have sideeffects or not). This will determine whether
prefetchable MMIO (reads have no sideeffects) or nonprefetchable MMIO
(readsdohavesideeffects)shouldberequested.Knowingthisinformation,the
devicedesignerhardcodesthelowerbitsoftheBARstocertainvaluesindicat
ingthetypeandsizeoftheaddressspacebeingrequested.

The upper bits of the BARs are writable by software. Once system software
checks the lower bits of the BARs to determine the size and type of address
spacerequested,systemsoftwarewillthenwritethebaseaddressoftheaddress
rangebeingallocatedtothisdeviceintotheupperbitsoftheBAR.Sinceasingle

126
PCIe 3.0.book Page 127 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Endpoint (Type 0 header) has six BARs, up to six different address space
requests can be made. However, this is not common in the real world. Most
deviceswillrequest13differentaddressranges.

NotallBARshavetobeimplemented.IfadevicedoesnotneedalltheBARsto
maptheirinternalregisters,theextraBARsarehardcodedwithall0snotifying
softwarethattheseBARsarenotimplemented.

Figure42:BARsinConfigurationSpace

Type 0 Header Type 1 Header


31 23 15 7 0 31 23 15 7 0

Device ID Vendor ID 00h Device ID Vendor ID 00h

Status Command 04h Status Command 04h

Rev 08h Rev 08h


Class Code ID
Class Code ID
BIST Header Latency Cache 0Ch BIST Header Latency Cache 0Ch
Type Timer Line Size Type Timer Line Size
Base Address 0 (BAR0) 10h Base Address 0 (BAR0) 10h

Base Address 1 (BAR1) 14h Base Address 1 (BAR1) 14h


Secondary Subordinate Secondary Primary
Base Address 2 (BAR2) 18h
Lat Timer Bus # Bus # Bus #
18h

Base Address 3 (BAR3) 1Ch


Secondary IO IO 1Ch
Status Limit Base
(Non-Prefetchable) (Non-Prefetchable)
Base Address 4 (BAR4) 20h 20h
Memory Limit Memory Base
24h Prefetchable Prefetchable 24h
Base Address 5 (BAR5) Memory Limit Memory Base
28h Prefetchable Memory Base 28h
CardBus CIS Pointer Upper 32 Bits
Subsystem Subsystem Prefetchable Memory Limit
Vendor ID 2Ch 2Ch
Device ID Upper 32 Bits
IO Limit IO Base
Expansion ROM Base Address 30h Upper 16 Bits Upper 16 Bits 30h

Reserved Capability Reserved Capability


34h 34h
Pointer Pointer
Reserved 38h Expansion ROM Base Address 38h

Max Lat Min Gnt Interrupt Interrupt 3Ch Bridge Interrupt Interrupt 3Ch
Pin Line Control Pin Line

OncetheBARshavebeenprogrammed,theinternalregistersorlocalmemory
withinthedevicecanbeaccessedviatheaddressrangesprogrammedintothe
BARs.Anytimethedeviceseesarequestwithanaddressthatmapstooneofits
BARs,itwillacceptthatrequestbecauseitisthetarget.

127
PCIe 3.0.book Page 128 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Figure43:PCIExpressDevicesAndType0AndType1HeaderUse

CPU

Root Complex System


Memory
P2P
(DRAM)

Type 1 Headers
P2P (Virtual PCI-PCI Bridges)
Switch
P2 P
P P2 Type 0 Headers

PCIe PCIe
Endpoint Endpoint

BAR Example 1: 32-bit Memory Address Space


Request
Figure44onpage130showsthebasicstepsinsettingupaBAR,whichinthis
example,isrequestinga4KBblockofnonprefetchablememory(NPMMIO).In
thefigure,theBARisshownatthreepointsintheconfigurationprocess:

1. In (1) of Figure 44, we see the uninitialized state of the BAR. The device
designerhasfixedthelowerbitstoindicatethesizeandtype,buttheupper
bits (which are readwrite) are shown as Xs to indicate their value is not
known. System software will first write all 1s to every BAR (using config
writes) to set all writable bits. (Of course, the hardcoded lower bits are
unaffected by any configuration writes.) The second view of the BAR,

128
PCIe 3.0.book Page 129 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

shownin(2)ofFigure44,showshowitlooksafterconfigurationsoftware
haswrittenall1stoit.
Writingall1sisdonetodeterminewhattheleastsignificantwritablebitis.
Thisbitpositionindicatesthesizeoftheaddressspacebeingrequested.In
this example, the leastsignificant writable bit is bit 12, so this BAR is
requesting212(or4KB)ofaddressspace.Iftheleastsignificantwritablebit
wouldhavebeenbit20,thentheBARwouldhavebeenrequesting220(or
1MB)ofaddressspace.
2. Afterwritingall1stotheBARs,softwareturnsaroundandreadsthevalue
of each BAR, starting with BAR0, to determine the type and size of the
address space being requested. Table 41 on page 129 summarizes the
resultsoftheconfigurationreadofBAR0forthisexample.
3. Thefinalstepinthisprocessisforsystemsoftwaretoallocateanaddress
rangetoBAR0nowthatsoftwareknowsthesizeandtypeoftheaddress
space being requested. The third view of the BAR, in (3) of Figure 44,
showshowitlooksaftersoftwarehaswrittenthestartaddressfortheallo
catedblockofaddresses.Inthisexample,thestartaddressisF900_0000h.
At this point the configuration of BAR0 is complete. Once software enables
memory address decoding in the Command register (offset 04h), this device
will accept any memory requests it receives that fall within the range from
F900_0000hF900_0FFFh(4KBinsize).

Table41:ResultsofReadingtheBARafterWritingAll1sToIt

BARBits Meaning

0 Readas0b,indicatingamemoryrequest.Sincethisisamemoryrequest,
bits3:1alsohaveanencodedmeaning.

2:1 Readas00bindicatingthetargetonlysupportsdecodinga32bit
address

3 Readas0b,indicatingrequestisfornonprefetchablememory(meaning
readsdohavesideeffects);NPMMIO

11:4 Readasall0s,indicatingthesizeoftherequest(thesebitsarehard
codedto0)

31:12 Readasall1sbecausesoftwarehasnotyetprogrammedtheupperbits
withastartaddressfortheblock.Sincebit12istheleastsignificantbit
thatcouldbewritten,thememorysizerequestedis212=4KB.

129
PCIe 3.0.book Page 130 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Figure44:32BitNonPrefetchableMemoryBARSetUp

Type 0 Header
31 23 15 7 0

Device ID Vendor ID 00h Uninitialized BAR


31 12 4 3 21 0
Status Command 04h

Rev
XXXX XXXX XXXX XXXX XXXX 00000000 0 0 0 0 (1)
Class Code 08h
ID
BIST Header Latency Cache 0Ch BAR Written with all 1s
Type Timer Line Size 31 12 4 3 21 0
Base Address 0 (BAR0) 10h 1111 1111 1111 1111 1111 00000000 0 0 0 0 (2)
Base Address 1 (BAR1) 14h

Base Address 2 (BAR2) 18h BAR Written With Base Address


31 12 4 3 21 0
Base Address 3 (BAR3) 1Ch 1111 1001 0000 0000 0000 00000000 0 0 0 0 (3)
Base Address 4 (BAR4) 20h (F) (9) (0) (0) (0)

0 = Memory request
Base Address 5 (BAR5) 24h 1 = IO request

CardBus CIS Pointer 28h 00 = 32-bit decoding


10 = 64-bit decoding
Subsystem Subsystem 0 = non-prefetchable
Vendor ID 2Ch
Device ID 1 = prefetchable
Expansion ROM Base Address 30h
Upper 20 bits of 4KB aligned
Reserved Capability start address (lower 12 bits assumed to be = 0)
34h
Pointer (F900 0000h)
Reserved 38h
This Example:
Max Lat Min Gnt Interrupt Interrupt 3Ch -4KB of non-prefetchable memory
Pin Line -Address range must be below 4GB (32-bit decode)
Note: if memory address is assigned below 4GB boundary,
the 3DW header must be used when targeting this device.

BAR Example 2: 64-bit Memory Address Space


Request
Inthepreviousexample,wesawBAR0beingusedtorequestnonprefetchable
memoryaddressspace(NPMMIO).Inthisexample,asshowninFigure45on
page132,BAR1andBAR2arebeingusedtorequesta64MBblockofprefetch
ablememoryaddressspace.TwosequentialBARsarebeingusedherebecause
thedevicesupportsa64bitaddressforthisrequest,meaningthatsoftwarecan
allocate the requested address space above the 4GB address boundary if it

130
PCIe 3.0.book Page 131 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

wants to (but that is not a requirement). Since the address can be a 64bit
address,twosequentialBARsmustbeusedtogether.

Asbefore,theBARsareshownatthreepointsintheconfigurationprocess:

1. In (1) of Figure 45, we see the uninitialized state of the BAR pair. The
devicedesignerhashardcodedthelowerbitsofthelowerBAR(BAR1in
our example) to indicate the request type and size, while the bits of the
upper BAR (BAR2) are all readwrite. System softwares first step was to
writeall1stoeveryBAR.In(2)ofFigure45,weseetheBARsafterhaving
all1swrittentothem.
2. Asdescribedinthepreviousexample,systemsoftware alreadyevaluated
BAR0.SosoftwaresnextstepistoreadthenextBAR(BAR1)andevaluateit
to see if the device is requesting additional address space. Once BAR1 is
read,softwarerealizesthatmoreaddressspaceisbeingrequestedandthis
requestisforprefetchablememoryaddressspacethatcanbeallocatedany
where in the 64bit address range. Since it supports a 64bit address, the
next sequential BAR (BAR2 in this case) is treated as the upper 32 bits of
BAR1.SosoftwarenowalsoreadsinthecontentsofBAR2.However,soft
waredoesnotevaluatethelowerbitsofBAR2inthesamewayitdidfor
BAR1, because it knows BAR2 is simply the upper 32 bits of the 64bit
address request started in BAR1. Table 42 on page 132 summarizes the
resultsoftheseconfigurationreads.
3. Thefinalstepinthisprocessisforsystemsoftwaretoallocateanaddress
rangetotheBARsnowthatsoftwareknowsthesizeandtypeoftheaddress
space being requested. The third view of the BARs in (3) of Figure 45
shows the result after software has used two configuration writes to pro
gramthe64bitstartaddressfortheallocatedrange.Inthisexample,bit1of
the Upper BAR (address bit 33 in the BAR pair) is set and bit 30 of the
LowerBAR(addressbit30intheBARpair)issettoindicateastartaddress
of2_4000_0000h.AllotherwritablebitsinbothBARsarecleared.
At this point, the configuration of the BAR pair (BAR1 & BAR2) is complete.
Once software enables memory address decoding in the Command register
(offset 04h), this device will accept any memory requests it receives that fall
withintherangefrom2_4000_0000h2_43FF_FFFFh(64MBinsize).

131
PCIe 3.0.book Page 132 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Figure45:64BitPrefetchableMemoryBARSetUp

Type 0 Header
31 23 15 7 0
Uninitialized BAR Pair
Device ID Vendor ID 00h 31 (BAR 2) 0 31 26 (BAR 1) 4 3 21 0

04h
XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XX 00 0000 0000 0000 0000 0000 1 1 0 0 (1)
Status Command
Rev BAR n+1 BAR n
Class Code 08h
ID
BIST Header Latency Cache 0Ch
Type Timer Line Size BAR Pair Written with all 1s
Base Address 0 (BAR0) 10h 31 (BAR 2) 0 31 26 (BAR 1) 4 3 21 0

Base Address 1 (BAR1) 14h


1111 1111 1111 1111 1111 1111 1111 1111 1111 11 00 0000 0000 0000 0000 0000 1 1 0 0 (2)

Base Address 2 (BAR2) 18h

Base Address 3 (BAR3) 1Ch


BAR Pair Written With Base Address
Base Address 4 (BAR4) 20h
31 (BAR 2) 0 31 26 (BAR 1) 4 3 21 0

Base Address 5 (BAR5) 24h 0000 0000 0000 0000 0000 0000 0000 0010 0100 00 00 0000 0000 0000 0000 0000 1 1 0 0 (3)
(0) (0) (0) (0) (0) (0) (0) (2) (4) (0)
CardBus CIS Pointer 28h 0 = non-prefetchable
1 = prefetchable
Subsystem Subsystem
Vendor ID 2Ch
Device ID 00 = 32-bit decoding
10 = 64-bit decoding
Expansion ROM Base Address 30h
0 = Memory request
Capability 1 = IO request
Reserved Pointer
34h
Upper 38 bits of 64MB aligned
Reserved 38h start address (lower bits assumed to be = 0)
(0000 0002 4000 0000h)
Max Lat Min Gnt Interrupt Interrupt 3Ch
Pin Line
This Example:
-64MB of prefetchable memory
-Address range may be above 4GB boundary (64-bit decode)

Table42:ResultsOfReadingtheBARPairafterWritingAll1sToBoth

BAR
BAR Meaning
Bits

Lower 0 Readas0b,indicatingamemoryrequest.Sincethisisamem
oryrequest,bits3:1alsohaveanencodedmeaning.

Lower 2:1 Readas10bindicatingthetargetsupportsa64bitaddress


decoder,andthatthenextsequentialBARcontainstheupper
32bitsoftheaddressinformation.

132
PCIe 3.0.book Page 133 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Table42:ResultsOfReadingtheBARPairafterWritingAll1sToBoth(Continued)

BAR
BAR Meaning
Bits

Lower 3 Readas1b,indicatingrequestisforprefetchablememory
(meaningreadsdonothavesideeffects);PMMIO

Lower 25:4 Readasall0s,indicatingthesizeoftherequest(thesebitsare


hardcodedto0)

Lower 31:26 Readasall1sbecausesoftwarehasnotyetprogrammedthe


upperbitswithastartaddressfortheblock.Notethatbecause
bit26wastheleastsignificantwritablebit,thememoryaddress
spacerequestsizeis226,or64MB.

Upper 31:0 Readasall1s.Thesebitswillbeusedastheupper32bitsofthe


64bitstartaddressprogrammedbysystemsoftware.

BAR Example 3: IO Address Space Request


Continuingfromtheprevioustwoexamples,thissamefunctionisalsorequest
ingIOspace,asshowninFigure46onpage134.Inthediagram,therequesting
BAR(BAR3intheexample)isshownatthreepointsintheconfigurationpro
cess:
1. In(1)ofFigure46,weseetheuninitializedstateoftheBAR.Systemsoft
warehaspreviouslywrittenall1stoeveryBARandhasevaluatedBAR0,
thenBAR1andBAR2.Nowsoftwareisgoingtoseeifthisdeviceisrequest
ingadditionaladdressspacewithBAR3.State(2)ofFigure46showsthe
stateoftheBAR3afterthewriteofall1s.
2. SoftwarenowreadsinBAR3toevaluatethesize andtype oftherequest.
Table 43onpage 134summarizestheresultsofthisconfigurationread.
3. Nowthatsoftwareknowsthisisarequestfor256bytesofIOaddressspace,
thefinalstepistoprogramtheBARwiththebaseaddressoftheIOaddress
rangebeingallocatedtothisdevice,specificallythisBAR.State(3)ofFigure
46 shows the state of the BAR after this step. In our example, the device
start address is 16KB, so bit 14 is written resulting in a base address of
4000h;allotherupperbitsarecleared.
Atthispoint,theconfigurationofBAR3iscomplete.OncesoftwareenablesIO
addressdecodingintheCommandregister(offset04h),thedevicewillaccept
and respond to IO transactions within the range 4000h 40FFh (256 bytes in
size).

133
PCIe 3.0.book Page 134 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Figure46:IOBARSetUp

Type 0 Header
31 23 15 7 0

00h
Uninitialized IO BAR
Device ID Vendor ID 31 8 21 0
Status Command 04h XXXX XXXX XXXX XXXX XXXX XXXX 0000 00 0 1 (1)
Rev 08h
Class Code ID
BIST Header Latency Cache 0Ch
IO BAR Written with all 1s
Type Timer Line Size 31 8 21 0
Base Address 0 (BAR0) 10h 1111 1111 1111 1111 1111 1111 0000 00 0 1 (2)
Base Address 1 (BAR1) 14h

Base Address 2 (BAR2) 18h IO BAR Written With Base Address


31 8 21 0
Base Address 3 (BAR3) 1Ch
0000 0000 0000 0000 0100 0000 0000 00 0 1 (3)
Base Address 4 (BAR4) 20h
(0) (0) (0) (0) (4) (0)

Base Address 5 (BAR5) 24h


0 = Memory request
1 = IO request
CardBus CIS Pointer 28h
Reserved (0)
Subsystem Subsystem
Vendor ID 2Ch
Device ID
Upper 24 bits of 256-byte aligned
Expansion ROM Base Address 30h start address (lower 7 bits assumed to be = 0)
(0000 4000h)
Reserved Capability
34h
Pointer
Reserved 38h This Example:
-256 bytes of IO address space
Max Lat Min Gnt Interrupt Interrupt 3Ch -Software assigns the start address at 16KB in IO address map.
Pin Line
Note: Only Legacy PCIe devices should make requests for IO
address space.

Table43:ResultsOfReadingtheIOBARafterWritingAll1sToIt

BARBits Meaning

0 Readas1b,indicatinganIOrequest.SincethisisanIOrequest,bit1is
reserved.

1 Reserved.Hardcodedto0b.

7:2 Readas0sIndicatessizeoftherequest(thesebitsarehardcodedto0)

31:8 Read as 1s because software has not yet programmed the upper bits
withastartaddressfortheblock.Notethatbecausebit8wastheleast
significantwritablebit,theIOrequestsizeis28,or256bytes.

134
PCIe 3.0.book Page 135 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

All BARs Must Be Evaluated Sequentially


Aftergoingthroughthepreviousthreeexamples,itbecomesclearthatsoftware
mustevaluateBARsinasequentialfashion.

Mostofthetime,functionsdonotneedallsixBARs.Evenintheexampleswe
wentthrough,onlyfourofthesixavailableBARswereused.Ifthefunctionin
ourexampledidnotneedtorequestanyadditionaladdressspace,thedevice
designerwouldhardcodeallbitsofBAR4andBAR5to0s.Soeventhoughsoft
warewritesthoseBARswithall1s,thewriteshavenoaffect.Afterevaluating
BAR3,softwarewouldmoveontoevaluatingBAR4.Onceitdetectedthatnone
ofthebitswereset,softwarewouldknowthisBARisnotbeingusedandmove
ontoevaluatingthenextBAR.

AllBARsmustbeevaluated,evenifsoftwarefindsaBARthatisnotbeingused.
TherearenorulesinPCIorPCIe,thatstatethatBAR0mustbethefirstBAR
used foraddressspacerequests.Ifadevicedesignerchoosesto,theycanuse
BAR4 for an address space request and hardcode BAR0, BAR1, BAR2, BAR3
andBAR5toall0s.ThismeanssoftwaremustevaluateeveryBARintheheader.

Resizable BARs
The2.1versionofthePCIExpressspecificationaddedsupportforchangingthe
size of the requested address space in the BARs by defining a new capability
structure in extended config space. The new structure allows the function to
advertisewhataddressspacesizesitcanoperatewithandthenhavesoftware
enableoneofthesizesbasedontheavailablesystemresources.Forexample,if
a function would ideally like to have 2GB of prefetchable memory address
space, but it could still operate with only 1GB, 512MB or 256MB of PMMIO,
system software may only enable the function to request 256MB of address
spaceifsoftwarewouldnotbeabletoaccommodatearequestofalargersize.

135
PCIe 3.0.book Page 136 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Base and Limit Registers

General
Once a functions BARs are programmed, the function knows what address
range(s)itowns,whichmeansthatfunctionwillclaimanytransactionsitsees
thatistargetinganaddressrangeitowns,anaddressrangeprogrammedinto
oneofitsBARs.Thisisgood,butitsimportanttorealizethattheonlywaythat
function is going to see the transactions it should claim is if the bridge(s)
upstreamofit,forwardthosetransactionsdownstreamtotheappropriatelink
thatthetargetfunctionisconnectedto.Therefore,eachbridge(e.g.switchports
androotcomplexports)needstoknowwhataddressrangeslivebeneathitsoit
candeterminewhichrequestsshouldbeforwardedfromitsprimaryinterface
(upstreamside) toitssecondaryinterface (downstreamside).Ifthe requestis
targetinganaddressthatisownedbyaBARinafunctionbeneaththebridge,
therequestshouldbeforwardedtothebridgessecondaryinterface.

It is the Base and Limit registers in the Type 1 headers that are programmed
withtherangeofaddressesthatlivebeneaththisbridge.Therearethethreesets
ofBaseandLimitregistersfoundineachType1header.Threesetsofregisters
areneededbecausetherecanbethreeseparateaddressrangeslivingbelowa
bridge:

PrefetchableMemoryspace(PMMIO)
NonPrefetchableMemoryspace(NPMMIO)
IOspace(IO)

ToexplainhowtheseBaseandLimitregisterswork,letscontinuetheexample
from the previous section and place that programmed function (an endpoint)
beneathaswitchasshowninFigure47onpage137.Thefigurealsoliststhe
addressrangesownedbytheBARsofthatfunction.

TheBaseandLimitregistersofeverybridgeupstreamoftheendpointwillneed
tobeprogrammed,buttostartout,weregoingtofocusonthebridgethatis
connectedtotheendpoint(PortB).

136
PCIe 3.0.book Page 137 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Figure47:ExampleTopologyforSettingUpBaseandLimitValues

CPU

Root Complex System


Memory
P2P
(DRAM)

Port
A
P2P

Switch
P2 P
P P2

Port Port
B C

PCIe PCIe
Endpoint Endpoint

NP-MMIO (4KB)
BAR0:
F900_0000h - F900_0FFFh
P-MMIO (64MB)
BAR1-2:
2_4000_0000h - 243FF_FFFFh
IO (256 bytes)
BAR3:
4000h - 40FFh
BAR4-5: Not Used (All 0s)

Prefetchable Range (P-MMIO)


Type1headershavetwopairsofprefetchablememorybase/limitregisters.The
Prefetchable Memory Base/Limit registers store address info for the lower 32
bits of the prefetchable address range. If this bridge supports decoding 64bit
addresses,thenthePrefetchableMemoryBase/LimitUpper32Bitsregistersare
alsousedandholdtheupper32bits(bits[63:32])oftheaddressrange.Figure4
8onpage138showsthevaluessoftwarewouldprogramintotheseregistersto
indicate that the prefetchable address range of 2_4000_0000h 2_43FF_FFFFh
livesbeneaththatbridge(PortB).Themeaningofeachfieldinthoseregistersis
summarizedinTable44.

137
PCIe 3.0.book Page 138 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Figure48:ExamplePrefetchableMemoryBase/LimitRegisterValues

Type 1 Header
31 23 15 7 0
Prefetchable Base Prefetchable
Device ID Vendor ID 00h Upper 32 Bits Memory Base
31 0 15 3 0
Status Command 04h
0000 0000 0000 0000 0000 0000 0000 0010 0100 0000 0000 0001
Rev 08h (0) (0) (0) (0) (0) (0) (0) (2) (4) (0) (0)
Class Code ID (RO)
0h = 32-bit
BIST Header Latency Cache 0Ch (RW) Bits 63:32 of (RW) Bits 31:20 of 1h = 64-bit
Type Timer Line Size Prefetchable Base Address Prefetchable Base Address
Base Address 0 (BAR0) 10h

Base Address 1 (BAR1) 14h Prefetchable Range


Secondary Subordinate Secondary Primary Base Address 0000 0002 4000 0000h
18h
Lat Timer Bus # Bus # Bus # Bits 19:0 are
Secondary IO IO 1Ch always 0s for Base
Status Limit Base
(Non-Prefetchable) (Non-Prefetchable)
20h
Memory Limit Memory Base
Prefetchable Prefetchable 24h Prefetchable Limit Prefetchable
Memory Limit Memory Base Upper 32 Bits Memory Limit
Prefetchable Memory Base 28h
31 0 15 3 0
Upper 32 Bits
0000 0000 0000 0000 0000 0000 0000 0010 0100 0011 1111 0001
Prefetchable Memory Limit 2Ch
Upper 32 Bits (0) (0) (0) (0) (0) (0) (0) (2) (4) (3) (F) (RO)
IO Limit IO Base 0h = 32-bit
Upper 16 Bits Upper 16 Bits 30h (RW) Bits 63:32 of (RW) Bits 31:20 of 1h = 64-bit
Prefetchable Base Address Prefetchable Base Address
Reserved Capability
34h
Pointer
Expansion ROM Base Address 38h
Prefetchable Range
Bridge Interrupt Interrupt 3Ch Limit Address 0000 0002 43FF FFFFh
Control Pin Line
Bits 19:0 are
always Fs for Limit

Prefetchable Memory Range: 2_4000_0000h - 2_43FF_FFFFh

138
PCIe 3.0.book Page 139 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Table44:ExamplePrefetchableMemoryBase/LimitRegisterMeanings

Register Value Use

PrefetchableMemory 4001h Theupper12bitsofthisregisterholdthe


Base upper12bitsofthe32bitBASEaddress(bits
[31:20]).Thelower20bitsofthebaseaddress
areimpliedtobeall0s,meaningthebase
addressisalwaysalignedona1MBbound
ary.
Thelower4bitsofthisregisterindicate
whethera64bitaddressdecoderissupported
inthebridge,meaningtheUpperBase/Limit
Registersareused.

PrefetchableMemory 43F1h Similarly,theupper12bitsofthisregister


Limit holdtheupper12bitsofthe32bitLIMIT
address(bits[31:20]).Thelower20bitsofthe
limitaddressareallimpliedtobeallFs.
Thelower4bitsofthisregisterhavethesame
meaningasthelower4bitsofthebaseregis
ter.

PrefetchableMemory 00000002h Holdstheupper32bitsofthe64bitBASE


BaseUpper32Bits addressforPrefetchableMemorydown
streamofthisport.

PrefetchableMemory 00000002h Holdstheupper32bitsofthe64bitLIMIT


LimitUpper32Bits addressforPrefetchableMemorydown
streamofthisport.

Non-Prefetchable Range (NP-MMIO)


Unliketheprefetchablememoryrange,thenonprefetchablememoryrangecan
onlysupport32bitaddresses.Sothereisonlyoneregisterforthebaseandone
registerforthelimit.FollowingtheexampleinFigure47,theNonPrefetchable
MemoryBase/LimitregistersofPortBwouldbeprogrammedwiththevalues
showninFigure49onpage140.Themeaningofthesevaluesissummarizedin
Table45.

139
PCIe 3.0.book Page 140 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Figure49:ExampleNonPrefetchableMemoryBase/LimitRegisterValues

Type 1 Header
31 23 15 7 0
(Non-Prefetchable)
Device ID Vendor ID 00h Memory Base
15 3 0
Status Command 04h
1111 1001 0000 0000
Rev 08h (F) (9) (0)
Class Code ID (RO)
Header Latency Cache 0Ch (RW) Bits 31:20 of Must be 0
BIST Type Timer Line Size Non-Prefetchable Base Address
Base Address 0 (BAR0) 10h

Base Address 1 (BAR1) 14h Non-Prefetchable


Secondary Subordinate Secondary Primary Range Base Address F900 0000h
18h
Lat Timer Bus # Bus # Bus # Bits 19:0 are
Secondary IO IO 1Ch always 0s for Base
Status Limit Base
(Non-Prefetchable) (Non-Prefetchable)
20h
Memory Limit Memory Base
Prefetchable Prefetchable 24h (Non-Prefetchable)
Memory Limit Memory Base Memory Limit
Prefetchable Memory Base 28h
15 3 0
Upper 32 Bits
1111 1001 0000 0000
Prefetchable Memory Limit 2Ch
Upper 32 Bits (F) (9) (0)
IO Limit IO Base (RO)
Upper 16 Bits Upper 16 Bits 30h (RW) Bits 31:20 of Must be 0
Non-Prefetchable LimitAddress
Reserved Capability
34h
Pointer
Expansion ROM Base Address 38h
Non-Prefetchable
Bridge
Control
Interrupt Interrupt 3Ch
Pin Line
Range Limit Address F90F FFFFh
Bits 19:0 are
always Fs for Limit

Non-Prefetchable Memory Range: F900_0000h - F90F_FFFFh

140
PCIe 3.0.book Page 141 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Table45:ExampleNonPrefetchableMemoryBase/LimitRegisterMeanings

Register Value Use

(NonPrefetchable) F900h Theupper12bitsofthisregisterholdthe


MemoryBase upper12bitsofthe32bitBASEaddress(bits
[31:20]).Thelower20bitsofthebaseaddress
areimpliedtobeall0s,meaningthebase
addressisalwaysalignedona1MBbound
ary.
Thelower4bitsofthisregistermustbe0s.

(NonPrefetchable) F900h Similarly,theupper12bitsofthisregister


MemoryLimit holdtheupper12bitsofthe32bitLIMIT
address(bits[31:20]).Thelower20bitsofthe
limitaddressareallimpliedtobeallFs.
Thelower4bitsofthisregistermustbe0s.

This example shows an interesting case where the nonprefetchable address


range programmed in Port Bs configuration space indicates a much larger
range (1MB) than the NPMMIO range (4KB) owned by the endpoint living
downstream. This is because the memory base/limit registers in the Type 1
header, can only be used to specify address bits 20 and above ([31:20]). The
lower20addressbits,[19:0],areimplied.Sothesmallestaddressrangethatcan
bespecifiedwiththememorybase/limitregistersis1MB.

In our example, the endpoint requested, and was granted, 4KB of NPMMIO
(F900_0000h F900_0FFFh). Port B was programmed with values indicating
1MB, or 1024KB, of NPMMIO lived downstream of that port (F900_0000h
F90F_FFFFh). This means 1020KB (F900_1000h F90F_FFFFh) of memory
addressspaceiswasted.ThisaddressspaceCANNOTbeallocatedtoanother
endpointbecausetheroutingofthepacketswouldnotwork.

IO Range
Likewiththeprefetchablememoryrange,Type1headershavetwopairsofIO
base/limitregisters.TheIOBase/Limitregistersstoreaddressinfoforthelower
16 bits of the IO address range. If this bridge supports decoding 32bit IO
addresses(whichisrareinrealworlddevices),thentheIOBase/LimitUpper16
Bits registers are also used and hold the upper 16 bits (bits [31:16]) of the IO

141
PCIe 3.0.book Page 142 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

addressrange.Followingourexample,Figure410onpage142showstheval
uessoftwarewouldprogramintotheseregisterstoindicatethattheIOaddress
rangeof4000h4FFFhlivesbeneaththatbridge(PortB).Themeaningofeach
fieldinthoseregistersissummarizedinTable46.

Figure410:ExampleIOBase/LimitRegisterValues

Type 1 Header
31 23 15 7 0
IO Base
Device ID Vendor ID 00h Upper 16 Bits IO Base
15 0 7 3 0
Status Command 04h
0000 0000 0000 0000 0100 0000
Rev 08h (0) (0) (0) (0) (4)
Class Code ID (RO)
Header Latency Cache (RW) Bits 31:16 of (RW) Bits 15:12 0h = 16-bit
BIST Type Timer Line Size
0Ch 1h = 32-bit
IO Base Address of IO Base Address
(if used)
Base Address 0 (BAR0) 10h

Base Address 1 (BAR1) 14h IO Range


Secondary Subordinate Secondary Primary Base Address 4000h
18h
Lat Timer Bus # Bus # Bus # Bits 11:0 are
Secondary IO IO 1Ch always 0s for IO Base
Status Limit Base
(Non-Prefetchable) (Non-Prefetchable)
20h
Memory Limit Memory Base
Prefetchable Prefetchable 24h IO Limit
Memory Limit Memory Base Upper 16 Bits IO Limit
Prefetchable Memory Base 28h
15 0 7 3 0
Upper 32 Bits
0000 0000 0000 0000 0100 0000
Prefetchable Memory Limit 2Ch
Upper 32 Bits (0) (0) (0) (0) (4) (RO)
IO Limit IO Base 0h = 16-bit
Upper 16 Bits Upper 16 Bits 30h (RW) Bits 31:16 of (RW) Bits 15:12
IO Limit Address of IO Limit Address 1h = 32-bit
Reserved Capability (if used)
34h
Pointer
Expansion ROM Base Address 38h
IO Range
Bridge Interrupt Interrupt 3Ch Limit Address 4FFFh
Control Pin Line
Bits 11:0 are
always Fs for IO Limit

IO Range: 4000h - 4FFFh

142
PCIe 3.0.book Page 143 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Table46:ExampleIOBase/LimitRegisterMeanings

Register Value Use

IOBase 40h Theupper4bitsofthisregisterholdthe


upper4bitsofthe16bitBASEaddress(bits
[15:12]).Thelower12bitsofthebaseaddress
areimpliedtobeall0s,meaningthebase
addressisalwaysalignedona4KBboundary.
Thelower4bitsofthisregisterindicate
whethera32bitIOaddressdecoderissup
portedinthebridge,meaningtheUpperBase/
LimitRegistersareused.

IOLimit 40h Similarly,theupper4bitsofthisregisterhold


theupper4bitsofthe16bitLIMITaddress
(bits[15:12]).Thelower12bitsofthelimit
addressareallimpliedtobeallFs.
Thelower4bitsofthisregisterhavethesame
meaningasthelower4bitsofthebaseregis
ter.

IOBaseUpper16Bits 0000h Holdstheupper16bitsofthe32bitBASE


addressforIOdownstreamofthisport.

IOLimitUpper16Bits 0000h Holdstheupper16bitsofthe32bitLIMIT


addressforIOdownstreamofthisport.

Inthisexample,weseeanothersituationwheretheaddressrangeprogrammed
into the upstream bridge far exceeds the actual address range owned by the
downstream function. The endpoint in our example owns 256 bytes of IO
addressspace(specifically4000h 40FFh). Port Bhasbeen programmedwith
values indicating that 4KB of IO address space lives downstream (addresses
4000h 4FFFh). Again, this is simply a limitation of Type 1 headers. For IO
addressspace,thelower12bits(bits[11:0])haveimpliedvalues,sothesmallest
rangeofIOaddressesthatcanbespecifiedis4KB.Thislimitationturnsoutto
bemoreseriousthanthe1MBminimumwindowformemoryranges.Inx86
based (Intel compatible) systems, the processors only support 16 bits of IO
addressspace,andsinceonlybits[15:12]oftheIOaddressrangecanbespeci
fiedinabridge,thatmeansthattherecanbeamaximumof16(24)differentIO
addressrangesinasystem.

143
PCIe 3.0.book Page 144 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Unused Base and Limit Registers


NoteveryPCIedevicewilluseallthreetypesofaddressspace.Infact,thePCI
ExpressspecificationactuallydiscouragestheuseofIOaddressspace,indicat
ing that it is only supported for legacy reasons and may be deprecated in a
futurerevisionofthespec.

Inthecaseswhereanendpointdoesnotrequestallthreetypesofaddressspace,
whatarethebaseandlimitregistersofthebridgesupstreamofthosedevices
programmed with? They cant be programmed with all 0s because the lower
addressbitswouldstillbeimpliedtobedifferent(base=0s;limit=Fs)which
wouldrepresentavalidrange.Sotohandlethesecases,thelimitregistermust
be programmed with a higher address than the base. For example, if an end
pointdoesnotrequestIOaddressspace,thenthebridgeimmediatelyupstream
ofthatfunctionwouldhaveitsIOBaseregisterprogrammedto00handitsIO
LimitregisterprogrammedwithF0h.Sincethelimitaddressishigherthanthe
base address, the bridge understands this is an invalid setting and takes it to
meanthattherearenofunctionsdownstreamofitthatownIOaddressspace.

Thismethod of invalidating baseand limit registersis valid forallthreebase


andlimitpairs,notjustfortheIObase/limitregisters.

Sanity Check: Registers Used For Address Routing


ToensurethatyouunderstandtherulesandmethodsforsettingupBARsand
Base/Limitregisters,pleaselookoverFigure411onpage145tomakesureit
makes sense. We have simply extended the example system to include addi
tionaladdressspacerequestsfromtheotherendpoint,aswellasfromoneof
theswitchports(PortA).RememberthatType1headersalsohaveBARs(two
ofthemtobeexact)andcanrequestaddressspacetoo.TheBase/Limitregisters
inabridgedoNOTincludetheaddressesownedbythatsamebridgesBARs.
TheBase/Limitregistersonlyrepresenttheaddressesthatlivedownstreamof
thatbridge.

144
PCIe 3.0.book Page 145 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Figure411:FinalExampleAddressRoutingSetup

BAR0-1: Not Used (All 0s)


IO Range: 4000h - 5FFFh
NP-MMIO Range: F900_0000h - F90F_FFFFh
P-MMIO Range: 2_3E00_0000h - 2_440F_FFFFh

CPU

Root Complex System


Memory
P2P
(DRAM)

P-MMIO (1KB)
BAR0-1:
2_3E00_0000h - 2_3E00_03FFh
Port IO Range: 4000h - 5FFFh
A NP-MMIO Range: F900_0000h - F90F_FFFFh
P2P P-MMIO Range: 2_4000_0000h - 2_440F_FFFFh
Switch
P2 P
BAR0-1: Not Used (All 0s) P P2 BAR0-1: Not Used (All 0s)
IO Range: 4000h - 4FFFh Port Port IO Range: 5000h - 5FFFh
NP-MMIO Range: F900_0000h - F90F_FFFFh B C NP-MMIO Range: Not Used (Base > Limit)
P-MMIO Range: 2_4000_0000h - 2_43FF_FFFFh P-MMIO Range: 2_4400_0000h - 2_440F_FFFFh

PCIe PCIe
Endpoint Endpoint

NP-MMIO (4KB) P-MMIO (8KB)


BAR0: BAR0-1:
F900_0000h - F900_0FFFh 2_4400_0000h - 2_4400_1FFFh
P-MMIO (64MB) BAR2-4: Not Used (All 0s)
BAR1-2:
2_4000_0000h - 243FF_FFFFh IO (4 bytes)
BAR5:
IO (256 bytes) 5000h - 5003h
BAR3:
4000h - 40FFh
BAR4-5: Not Used (All 0s)

TLP Routing Basics


ThepurposeofsettinguptheBARsandBase/Limitregistersasdescribedinthe
previoussections,istoensurethattraffictargetingafunctionwillberoutedcor
rectly so the targeted function can see the transactions and claim them. In
sharedbusarchitectures like PCI, all thetraffic isvisibleto every device.The
onlytimeroutingofrequestshappensiswhenthetargetisonanotherbusand
mustcrossabridge.SincePCIeLinksarepointtopoint,moreroutingwillbe
neededtodelivertransactionsbetweendevices.

145
PCIe 3.0.book Page 146 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Figure412:MultiPortPCIeDevicesHaveRoutingResponsibilities

CPU

Root Complex System


IN OUT IN OUT Memory

Switch OUT IN OUT IN

Legacy
Internal
Use
Endpoint
?
Traffic Types:
T
IN

- Physical Layer Ordered Sets


O
O
U

IN

- Data Link Layer Packets (DLLPs)


T

- Transaction Layer Packets (TLPs)

OUT IN OUT IN
IN = INGRESS PORT
OUT = EGRESS PORT
PCIe PCIe
Endpoint Endpoint

As illustrated in Figure 412 on page 146, a PCI Express topology consists of


independent, pointtopoint links connecting each device with one or more
neighbors. As traffic arrives at theinbound side of a link interface (called the
ingressport),theportchecksforerrors,thenmakesoneofthreedecisions:

1. Acceptthetrafficanduseitinternally
2. Forwardthetraffictotheappropriateoutbound(egress)port
3. Rejectthetrafficbecauseitisneithertheintendedtarget,noraninterfaceto
it(Notethatthereareotherreasonswhytrafficmayberejected)

146
PCIe 3.0.book Page 147 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Receivers Check For Three Types of Traffic


Assuming a link is fully operational, the receiver interface of each device
(ingress port) must detect and evaluated the arrival of the three types of link
traffic:OrderedSets,DataLinkLayerPackets(DLLPs),andTransactionLayer
Packets(TLPs).OrderedSetsandDLLPsarelocaltoalinkandthusarenever
routedtoanotherlink.TLPscananddomovefromlinktolink,basedonrout
inginformationcontainedinthepacketheaders.

Routing Elements
Devices with multiple ports, like Root Complexes and Switches, can forward
TLPsbetween theports andaresometimes called RoutingAgents orRouting
Elements. They accept TLPs that target internal resources and forward TLPs
betweeningressandegressports.

Interestingly, peertopeer routing support is required in Switches, but for a


RootComplexitsoptional.PeertopeertrafficistypicallywhereoneEndpoint
sendspacketsthattargetanotherEndpoint.

EndpointshaveonlyoneLinkandneverexpecttoseeingresstrafficotherthan
whatistargetingthem.TheysimplyacceptorrejectincomingTLPs.

Three Methods of TLP Routing


General
TLPs can be routed based on address (either memory or IO), based on ID
(meaning Bus, Device, Function number), or routed implicitly. The routing
methodusedisbasedontheTLPtype.Table 47onpage 147summarizesthe
TLPtypesandtheroutingmethodsusedforeach.

Table47:PCIExpressTLPTypesAndRoutingMethods

TLPType RoutingMethodUsed

MemoryRead[Lock],MemoryWrite,AtomicOp AddressRouting

IOReadandWrite AddressRouting

147
PCIe 3.0.book Page 148 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Table47:PCIExpressTLPTypesAndRoutingMethods(Continued)

TLPType RoutingMethodUsed

ConfigurationReadandWrite IDRouting

Message,MessageWithData AddressRouting,IDRout
ing,orImplicitrouting

Completion,CompletionWithData IDRouting

Messages are the only TLP type that support more than one routing method.
MostofthemessageTLPsdefinedinthePCIExpressspecuseimplicitrouting,
however,thevendordefinedmessagescoulduseaddressroutingorIDrouting
ifdesired.

Purpose of Implicit Routing and Messages


Inimplicitrouting,neitheraddressorIDroutinginformationapplies;instead,
thepacketisroutedbasedonacodeinthepacketheaderindicatingadestina
tionwithaknownlocationinthetopology,suchastheRootComplex.Thissim
plifiesroutingofmessagesinthecaseswhereatypeofimplicitroutingapplies.

WhyMessages?MessagetransactionswerenotdefinedinPCIorPCIX,
butwereintroducedwithPCIe.ThemainreasonforaddingMessagesasa
packet type was to pursue the PCIe design goal to drastically reduce the
numberofsidebandsignalsimplementedinPCI(e.g.interruptpins,error
pins,powermanagementsignals,etc.).Consequently,mostofthesideband
signalswerereplacedwithinbandpacketsintheformofMessageTLPs.

HowImplicitRoutingHelpsUsing inband messages in place of side


bandsignalsrequiresameansofroutingthemtotheproperrecipientina
topologyconsistingofnumerouspointtopointlinks.Implicitroutingtakes
advantageofthefactthatSwitchesandotherroutingelementsunderstand
the concept of upstream and downstream, and that the Root Complex is
foundatthetopofthetopologywhileEndpointsarefoundatthebottom.
Asaresult,aMessagecanuseasimplecodetoshowthatitshouldgotothe
RootComplex,forexample,ortobesenttoalldevicesdownstream.This
abilityeliminatestheneedtodefineaddressrangesorIDlistsspecifically
usedasthetargetofdifferentmessagetransactions.

ThedifferenttypesofimplicitroutingcanbefoundinImplicitRouting
onpage 163.

148
PCIe 3.0.book Page 149 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Split Transaction Protocol


Likemostotherserialtechnologies,PCIExpressusesthesplittransactionproto
col which allows a target device to receive one or more requests and then
respond to each request with a separate completion. This is a significant
improvementoverthePCIbusprotocolthatusedwaitstatesordelayedtrans
actions(retries)todealwithlatenciesinaccessingtargets.Insteadoftestingto
seewhenthetargetbecomesreadytodoalonglatencytransfer,thetargetini
tiatestheresponsewheneveritsready.ThisresultsinatleasttwoseparateTLPs
pertransactiontheRequestandtheCompletion(aswillbediscussedlater,a
single read request may result in multiple completion TLPs being sent back).
Figure 413 on page 149 illustrates the RequestCompletion components of a
splittransaction.ThisexampleshowssoftwarereadingdatafromanEndpoint.

Figure413:PCIExpressTransactionRequestAndCompletionTLPs

CPU

Root Complex System


IN OUT Memory
1) Request TLP (Memory Read)
K27.7 K29.7
OUT IN STP SEQ HDR LCRC END
END byte
Link CRC (4 bytes)
TLP Header (3DW or 4DW)
Switch TLP Sequence Number (2 bytes)
Receiver decode of STP symbol indicates
T
IN

start of a TLP
U
O
O
U

IN
T

2) Completion w/Data TLP


K27.7 K29.7
OUT IN STP SEQ HDR Data LCRC END

PCIe
Endpoint

149
PCIe 3.0.book Page 150 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Posted versus Non-Posted


TomitigatethepenaltyoftheRequestCompletionlatency,memorywritetrans
actionsareposted,meaningthetransactionisconsideredcompletedfromthe
RequestersperspectiveassoonastherequestleavestheRequester.Ifhelpful,
youcanassociatethetermpostingwiththepostalsystem,wherepostinga
memorywriteisanalogoustopostingaletterinthemail.Onceyouveplaceda
letterinthepostalboxyouputyourfaithinthesystemtodeliveritanddont
waitforverificationofdelivery.Thisapproachcanbemuchfasterthanwaiting
for the entire RequestCompletion transit, but as in all posting schemes
uncertaintyexistsconcerningwhen(andif)thetransactioncompletedsuccess
fullyattheultimaterecipient.

InPCIe,thesmallamountofuncertaintyinvolvedbymakingallmemorywrites
posted is considered acceptable in exchange for the performance gained. By
contrast, writes to IO and configuration space almost always affect device
behaviorandhaveatimelinessassociatedwiththem.Consequently,itisimpor
tanttoknowwhen(andif)thosewriterequestscompleted.Becauseofthis,IO
writes and configuration writes are always nonposted and a completion will
alwaysbereturnedtoreportthestatusoftheoperation.

Insummary,nonpostedtransactionsrequireacompletion.Postedtransactions
donotrequire,andshouldneverreceive,acompletion.Table 48onpage 150
listswhichPCIetransactionsarepostedandnonposted.

Table48:PostedandNonPostedTransactions

Request HowRequestIsHandled

MemoryWrite Allmemorywriterequestsareposted.Nocompletionsare
expectedorsent.

MemoryRead Allmemoryreadrequestsarenonposted.Acompletion
MemoryReadLock withdata(madeofoneormoreTLPs)willbereturnedbythe
Completertodeliverboththerequesteddataandthestatus
ofthememoryread.Intheeventofanerror,acompletion
withoutdatawillbereturnedreportingthestatus.

AtomicOp AllAtomicOprequestsarenonposted.Acompletionwith
datawillbereturnedbytheCompletercontainingtheorigi
nalvalueofthetargetlocation.

150
PCIe 3.0.book Page 151 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Table48:PostedandNonPostedTransactions(Continued)

Request HowRequestIsHandled

IORead AllIOrequestsarenonposted.Acompletionwithoutdata
IOWrite willbereturnedforwritesorfailedreads,andacompletion
withdatawillbereturnedforsuccessfulreads.

ConfigurationRead Allconfigurationrequestsarenonposted.Acompletion
ConfigurationWrite withoutdatawillbereturnedforwritesandfailedreads,
whileacompletionwithdatawillbereturnedforsuccessful
reads.

Message Allmessagesareposted.Theroutingmethoddependson
theMessagetype,buttheyreallconsideredpostedrequests.

Header Fields Define Packet Format and Type


General
AsshowninFigure414onpage152,eachTLPcontainsathreeorfourdouble
word(12or16byte)header.ThisincludesFormatandTypefieldsthatdefinethe
contentoftherestoftheheaderandindicatetheroutingmethodtobeusedfor
theTLPasittraversesthetopology.

151
PCIe 3.0.book Page 152 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Figure414:TransactionLayerPacketGeneric3DWAnd4DWHeaders

Transaction Layer Packet (TLP)


Framing Sequence Framing
Header Data Digest LCRC
(STP) Number (END)

Generic 3DW (12-byte) Header

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT
tr H D P
Length
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Bytes 8-11 Vary with Type Field

Generic 4DW (16-byte) Header

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT
tr H D P
Length
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Bytes 8-11 Vary with Type Field
Byte 12 Bytes 12-15 Vary with Type Field

152
PCIe 3.0.book Page 153 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Header Format/Type Field Encodings


Table 49onpage 153belowsummarizestheencodingsusedinTLPheaderFor
matandTypefields.

Table49:TLPHeaderFormatandTypeFieldEncodings

TLP FMT[2:0] TYPE[4:0]

MemoryReadRequest(MRd) 000=3DW,nodata 00000


001=4DW,nodata

MemoryReadLockRequest(MRdLk) 000=3DW,nodata 00001


001=4DW,nodata

MemoryWriteRequest(MWr) 010=3DW,w/ 00000


data
011=4DW,w/
data

IOReadRequest(IORd) 000=3DW,nodata 00010

IOWriteRequest(IOWr) 010=3DW,w/ 00010


data

ConfigType0ReadRequest(CfgRd0) 000=3DW,nodata 00100

ConfigType0WriteRequest 010=3DW,w/ 00100


(CfgWr0) data

ConfigType1ReadRequest(CfgRd1) 000=3DW,nodata 00101

ConfigType1WriteRequest 010=3DW,w/ 00101


(CfgWr1) data

MessageRequest(Msg) 001=4DW,nodata 10RRR*(forRRR,


seeroutingsubfield
inMessageType
FieldSummaryon
page 164)

MessageRequestw/Data(MsgD) 011=4DW,w/ 10RRR*(forRRR,


data seeroutingsubfield
inMessageType
FieldSummaryon
page 164)

153
PCIe 3.0.book Page 154 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Table49:TLPHeaderFormatandTypeFieldEncodings(Continued)

TLP FMT[2:0] TYPE[4:0]

Completion(Cpl) 000=3DW,nodata 01010

CompletionW/Data(CplD) 010=3DW,w/ 01010


data

CompletionLocked(CplLk) 000=3DW,nodata 01011

Completionw/Data(CplDLk) 010=3DW,w/ 01011


data

FetchandAddAtomicOpRequest 010=3DW,w/data 01100


(FetchAdd) 011=4DW,w/data

UnconditionalSwapAtomicOp 010=3DW,w/data 01101


Request(Swap) 011=4DW,w/data

CompareandSwapAtomicOp 010=3DW,w/data 01110


Request(CAS) 011=4DW,w/data

LocalTLPPrefix(LPrfx) 100=1DW 0LLLL

EndtoEndTLPPrefix(EPrfx) 100=1DW 1EEEE

TLP Header Overview


WhenTLPsarereceivedataningressport,theyarefirstcheckedforerrorsat
thePhysicalandDataLinkLayers.Iftherearenoerrors,theTLPisexaminedat
the Transaction Layer to learn which routing method is to be used. The basic
stepsare:

1. Format and Type fields determine the header size, format and type of the
packet.
2. Depending on the routing method associated with the packet type, the
device determines whether its the intended recipient. If so, it will accept
(consume) the TLP, but if not, it will forward the TLP to the appropriate
egressportsubjecttotherulesfororderingandflowcontrolforthategress
port.
3. If this device is not the intended recipient nor is it in the path to the
intended recipient, it will generally reject the packet as an Unsupported
Request(UR).

154
PCIe 3.0.book Page 155 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Applying Routing Mechanisms


Oncethesystemaddresseshavebeenconfiguredandtransactionsareenabled,
devicesexamineincomingTLPsandusethecorrespondingconfigurationfields
toroutethepacket.Thefollowingsectionsdescribethebasicfeatures/function
alityofeachroutingmechanismusedinroutingTLPsthroughthePCIExpress
fabric.

ID Routing
IDroutingisusedtotargetthelogicalpositionBusNumber,DeviceNumber,
FunctionNumber(typicallyreferredtoasBDF),ofaFunctionwithinthetopol
ogy.ItscompatiblewithroutingmethodsusedinthePCIandPCIXprotocols
forconfigurationtransactions.InPCIe,itisstillusedforroutingconfiguration
packetsandisalsousedtoroutecompletionsandsomemessages.

Bus Number, Device Number, Function Number Limits


PCIExpresssupportsthesametopologylimitsasPCIandPCIX:

1. Eightbitsareusedtogivethebusnumber,soamaximumof256bussesare
possibleinasystem.ThisincludesinternalbussescreatedbySwitches.
2. Fivebitsgivethedevicenumber,soamaximumof32devicesarepossible
perbus. An older PCI bus or an internal bus in a switch or root complex
mayhostmorethanonedownstreamdevice.However,externalPCIelinks
are always pointtopoint and theres only one downstream device on the
link.Thedevicenumberforanexternallinkisforcedbythedownstream
port to always be Device 0, so every external Endpoint will always be
Device 0 (unless using Alternative RoutingID Interpretation (ARI), in
whichcase,therearenodevicenumbers;moreaboutARIcanbefoundin
thesectiononIDO(IDbasedOrdering)onpage 909.
3. Threebitsgivethefunctionnumber,soamaximumof8internalfunctions
ispossibleperdevice.

Key TLP Header Fields in ID Routing


IftheTypefieldinareceivedTLPindicatesIDroutingistobeused,then
theIDfieldsintheheader(Bus,Device,Function)areusedtoperformthe
routingcheck.Therearetwocases:IDroutingwitha3DWheaderandID
routing with a 4DW header (only possible in messages). Figure 415 on
page156illustratesaTLPusingIDroutingandthe3DWheader,whileFig
ure416onpage156showsthe4DWheaderforIDrouting.

155
PCIe 3.0.book Page 156 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Figure415:3DWTLPHeaderIDRoutingFields

3DW Header Using ID Routing

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x0 tr H D P
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Bus Number Device Func Bytes 10-11 Vary with Type Field
Function Number with ARI

Figure416:4DWTLPHeaderIDRoutingFields

4DW Header Using ID Routing

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x1 tr H D P
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Bus Number Device Func Bytes 10-11 Vary with Type Field
Function Number with ARI

Byte 12 Bytes 12-15 Vary with Type Field

Endpoints: One Check


For ID routing, an Endpoint simply checks the ID field in the packet header
againstitsownBDF.EachfunctioncapturesitsownBusandDeviceNumber
everytimeaType0configurationwriteisseenonitslinkfrombytes89inthe
TLPHeader.WherethecapturedBusandDeviceNumberinformationshould
bestoredinnotspecified,onlythatfunctionsmustsaveit.ThesavedBusand

156
PCIe 3.0.book Page 157 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

DevicenumbersareusedastheRequesterIDinTLPrequeststhatthisEndpoint
initiatessotheCompleterofthatrequestcanincludetheRequesterIDvaluein
the completion packet(s). The Requester ID in a completion packet is used to
routethecompletion.

Switches (Bridges): Two Checks Per Port


ForanIDroutedTLP,aswitchportfirstcheckstoseewhetheritistheintended
target by comparing the target ID in the TLP Header against its own BDF, as
shown by (1) in Figure 417 on page 158. As was true for an Endpoint, each
switchportcapturesitsownBusandDevicenumbereverytimeaconfiguration
write(Type0)isdetectedonitsUpstreamPort.IfthetargetIDfieldintheTLP
agrees with the ID of the switch port, it consumes the packet. If the ID field
doesntmatch,itthencheckstoseeiftheTLPistargetingadevicebelowthis
switchport.ItdoesthisbycheckingtheSecondaryandSubordinateBusNum
ber registers to see if the target Bus Number in the TLP is within this range
(inclusive).Ifso,thentheTLPshouldbeforwardeddownstream.Thischeckis
indicated by (2) in Figure 417 on page 158. If the packet was moving down
stream (arrived on the Upstream Port) and doesnt match the BDF of the
Upstream Port or fall within the SecondarySubordinate bus range, it will be
handledasanUnsupportedRequestontheUpstreamPort.

IftheUpstreamPortdeterminesthataTLPitreceivedisforoneofthedevices
beneathit(becausethetargetbusnumberwaswithintherangeofitsSecond
arySubordinatebusnumberrange),thenitforwardsitdownstreamandallthe
downstream ports of the switch perform the same checks. Each downstream
portcheckstoseeiftheTLPistargetingthem.Ifso,thetargetedportwillcon
sumetheTLPandtheotherportsignoreit.Ifnot,alldownstreamportscheckto
seeiftheTLPistargetingadevicebeneaththeirport.Theoneportthatreturns
true on that check will forward the TLP to its Secondary Bus and the other
downstreamportsignoretheTLP.

In this section, it is important to remember that each port on a switch is a


Bridge,andthushasitsownconfigurationspacewithaType1Header.Even
thoughFigure417onpage158onlyshowsasingleType1Header,inreality,
eachport(eachP2PBridge)hasitsownType1Headerandperformsthesame
twochecksonTLPswhentheyareseenbythatport.

157
PCIe 3.0.book Page 158 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Figure417:SwitchChecksRoutingOfAnInboundTLPUsingIDRouting

Type 1 Header
CPU 31 23 15 7 0

Device ID Vendor ID 00h

Status Command 04h

Root Complex System Class Code Cache 08h


Memory Line Size

P2P
(DRAM) BIST Header Latency Cache 0Ch
Type Timer Line Size
Base Address 0 (BAR0) 10h
TLP ID Field 1. Packet for me?
(BDF) Base Address 1 (BAR1) 14h
Secondary Subordinate Secondary Primary
18h
P2P Lat Timer Bus # Bus # Bus #
2. Packet for someone Secondary IO IO 1Ch
Status Limit Base
Switch beneath me? (Non-Prefetchable) (Non-Prefetchable)
P2 20h
P Memory Limit Memory Base
P P2 Prefetchable Prefetchable 24h
Memory Limit Memory Base
Prefetchable Memory Base 28h
Upper 32 Bits
Prefetchable Memory Limit 2Ch
Upper 32 Bits
PCIe PCIe IO Limit IO Base
Upper 16 Bits Upper 16 Bits 30h
Endpoint Endpoint Capability
Reserved Pointer
34h

Expansion ROM Base Address 38h

Bridge Interrupt Interrupt 3Ch


Control Pin Line

Address Routing
TLPsthatuseaddressroutingrefertothesamememory(systemmemoryand
memorymappedIO)andIOaddressmapsthatPCIandPCIXtransactionsdo.
Memory requests targeting an address below 4GB (i.e. a 32bit address) must
use a 3DW header, and requests targeting an address above 4GB (i.e. a 64bit
address)mustusea4DWheader.IOrequestsarerestrictedto32bitaddresses
andareonlyimplementedtosupportlegacyfunctionality.

158
PCIe 3.0.book Page 159 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Key TLP Header Fields in Address Routing


WhentheTypefieldindicatesaddressroutingistobeusedforaTLP,thenthe
AddressFieldsintheheaderareusedtoperformtheroutingcheck.Thesecan
be32bitaddressesor64bitaddresses.

TLPswith32BitAddressFor IO or 32bit memory requests, a 3DW


headerisusedasshowninFigure418.Thememorymappedregisterstar
getedwiththeseTLPswillthereforeresidebelowthe4GBmemoryorIO
addressboundary.

TLPswith64BitAddressFor64bitmemoryrequests,a4DWheaderis
usedasshowninFigure419onpage160.Thememorymappedregisters
targetedwiththeseTLPsareabletoresideabovethe4GBmemorybound
ary.

Figure418:3DWTLPHeaderAddressRoutingFields

3DW Header Using Address Routing

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x0 tr H D P
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Address [31:2] R

159
PCIe 3.0.book Page 160 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Figure419:4DWTLPHeaderAddressRoutingFields

4DW Header Using Address Routing

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x1 tr H D P
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Address [63:32]

Byte 12 Address [31:2] R

Endpoint Address Checking


If an Endpoint receives a TLP that uses address routing then it checks the
addressintheheaderagainsteachofitsimplementedBaseAddressRegisters
(BARs) in its configuration header, as shown in Figure 420. Since Endpoints
onlyhaveonelinkinterface,itwilleitheracceptthepacketorrejectit.TheEnd
pointwillacceptthepacketifthetargetaddressintheTLPmatchesoneofthe
rangesprogrammedintoitsBARs.MoreinfoonhowtheBARsareusedcanbe
foundinsectionBaseAddressRegisters(BARs)onpage 126.

160
PCIe 3.0.book Page 161 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Figure420:EndpointChecksIncomingTLPAddress

CPU Type 0 Header


31 23 15 7 0

Device ID Vendor ID 00h

Root Complex System Status Command 04h


Memory
(DRAM) Cache 08h
P2P Class Code Line Size

{
BIST Header Latency Cache 0Ch
TLP Type Timer Line Size
(Addr) Base Address 0 (BAR0) 10h

Base Address 1 (BAR1) 14h


P2P
Base Address 2 (BAR2) 18h
Switch Packet for me?
P2 Base Address 3 (BAR3) 1Ch
P P
P2 Base Address 4 (BAR4) 20h

TLP Base Address 5 (BAR5) 24h


(Addr)
CardBus CIS Pointer 28h

PCIe PCIe Subsystem Subsystem


2Ch
Device ID Vendor ID
Endpoint Endpoint Expansion ROM Base Address 30h

Reserved Capability
34h
TLP Address field Pointer
Reserved
should match a BAR 38h

within a PCIe Function Max Lat Min Gnt Interrupt Interrupt 3Ch
Pin Line

Switch Routing
IfanincomingTLPusesaddressrouting,aSwitchPortfirstcheckstoseeif
theaddressislocalwithinthePortitselfbycomparingtheaddressinthe
packet header againstits two BARs in itsType 1 configuration header, as
showninStep1ofFigure421onpage162.IfitmatchesoneoftheseBARs,
theswitchportisthetargetoftheTLPandconsumesthepacket.Ifnot,the
portthenchecksitsBase/LimitregisterpairstoseeiftheTLPistargetinga
function beneath (downstream of) this bridge. If the Request targets IO
space, it will check the IO Base and Limit registers, as shown in Step 2a.
However, if the Request targets memory space, it will check the Non
prefetchable Memory Base/ Limit registers and the Prefetchable Memory
Base/Limit registers, as indicated by Step 2b in Figure 421 on page 162.
MoreinfoonhowtheBase/Limitregisterpairsareevaluatedcanbefound
insectionBaseandLimitRegistersonpage 136.

161
PCIe 3.0.book Page 162 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Figure421:SwitchChecksRoutingOfAnInboundTLPUsingAddress

Type 1 Header
CPU 31 23 15 7 0

Device ID Vendor ID 00h

Status Command 04h

Root Complex System Class Code Cache 08h


Memory Line Size

P2P
(DRAM) BIST Header Latency Cache 0Ch
Type Timer Line Size
Base Address 0 (BAR0) 10h
TLP 1. Packet for me?
(Addr) Base Address 1 (BAR1) 14h
Secondary Subordinate Secondary Primary
18h
P2P Lat Timer Bus # Bus # Bus #
2a. IO Packet for some- Secondary IO IO 1Ch
Status Limit Base
Switch one beneath me? (Non-Prefetchable) (Non-Prefetchable)
P2 20h
P Memory Limit Memory Base
P P2 Prefetchable Prefetchable
2b. Mem Packet for some- Memory Limit Memory Base 24h
one beneath me? Prefetchable Memory Base 28h
Upper 32 Bits
Prefetchable Memory Limit 2Ch
Upper 32 Bits
PCIe PCIe IO Limit IO Base
Upper 16 Bits Upper 16 Bits 30h
Endpoint Endpoint Capability
Reserved Pointer
34h

Expansion ROM Base Address 38h

Bridge Interrupt Interrupt 3Ch


Control Pin Line

To understand routing of addressbased TLPs in switches, it is good to


rememberthateachswitchportisitsownbridge.Belowarethestepsthata
bridge(switchport)takesuponreceivinganaddressbasedTLP:

DownstreamTravelingTLPs(ReceivedonPrimaryInterface)

1. IF the target address in the TLP matches one of the BARs, then this
bridge (switch port) consumes the TLP because it is the target of the
TLP.
2. IF the target address in the TLP falls in the range of one of its Base/
Limitregistersets,thepacketwillbeforwardedtothesecondaryinter
face(downstream).
3. ELSEtheTLPwillbehandledasanUnsupportedRequestonthepri
maryinterface.(Thisistrueifnootherbridgesontheprimaryinterface
claimtheTLPeither.)

162
PCIe 3.0.book Page 163 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

UpstreamTravelingTLPs(ReceivedonSecondaryInterface)

1. IF the target address in the TLP matches one of the BARs, then this
bridge (switch port) consumes the TLP because it is the target of the
TLP.
2. IF the target address in the TLP falls in the range of one of its Base/
Limitregistersets,theTLPwillbehandledasanUnsupportedRequest
onthesecondaryinterface.(Thisistrueunlessthisportistheupstream
port of the switch. In these cases, the packet may be a peertopeer
transaction and will be forwarded downstream on a different down
streamportthantheoneitwasreceivedon.)
3. ELSE the TLP will be forwarded to the primary interface (upstream)
giventhattheTLPaddressisnotforthisbridgeandisnotforanyfunc
tionbeneaththisbridge.

Multicast Capabilities
The2.1versionofthePCIExpressspecificationaddedsupportforspecifyinga
range of addresses that provide multicast functionality. Any packets received
that fall within the address range specified as the multicast range are routed/
accepted according to the multicast rules. This address range might not be
reservedinafunctionsBARsandmightnotbewithinabridgesBase/Limitreg
ister pair, but would still need to be accepted/forwarded appropriately. More
info can be found on the multicast functionality in the section on Multicast
CapabilityRegistersonpage 889.

Implicit Routing
Implicit routing,used in some message packets, is based on theawareness of
routing elements that the topology has upstream and downstream directions
andasingleRootComplexatthetop.Thisallowssomesimpleroutingmethods
withouttheneedtoassignatargetaddressorID.SincetheRootComplexgen
erally integrates power management, interrupt, and error handling logic, it is
eitherthesourceorrecipientofmostPCIExpressmessages.

Only for Messages


SomemessagesuseaddressorIDroutingratherthanimplicitrouting,andfor
them,theroutingmechanismsareappliedinthesamewayasdescribedinthe
those sections. However, most messages use implicit routing. The purpose of
implicit routing is to mimic sideband signal behavior since a design goal for
PCIe wasto eliminate as many sideband signals from PCIas possible. These

163
PCIe 3.0.book Page 164 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

sidebandsignalsinPCIweretypicallyeitherthehostnotifyingalldevicesofan
eventordevicesnotifyingthehostofanevent.InPCIe,wehaveMessageTLPs
toconveytheseevents.ThetypesofeventsthatPCIehasdefinedmessagesfor
are:
PowerManagement
INTxlegacyinterruptsignaling
Errorsignaling
LockedTransactionsupport
HotPlugsignaling
Vendorspecificsignaling
SlotPowerLimitsettings

Key TLP Header Fields in Implicit Routing


Forimplicitrouting,theroutingsubfieldintheheaderisusedtodeterminethe
messagedestination. Figure 422onpage 164 illustrates amessageTLPusing
implicitrouting.

Figure422:4DWMessageTLPHeaderImplicitRoutingFields

4DW Header for Messages

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R TH T E Attr AT Length
0x1 1 0 r r r tr 0 D P 0 0 0 0
Byte 4 Requester ID Tag Message
Code
Byte 8 Bytes 8-11 Vary with Message Code Field
Byte 12 Bytes 12-15 Vary with Message Code Field

Message Type Field Summary


Table 410 on page 165 shows how the TLP header Type field for Messages is
interpreted. As shown, the upper two bits indicate the packet is a Message
whilethelowerthreebitsspecifytheroutingmethodtoapply.NotethatMes
sageTLPsalwaysusea4DWheaderregardlessoftheroutingoptionselected.

164
PCIe 3.0.book Page 165 Sunday, September 2, 2012 11:25 AM

Chapter 4: Address Space & Transaction Routing

Foraddressrouting,bytes815containuptoa64bitaddress,andforIDrout
ing,bytes8and9containthetargetBDF.

Table410:MessageRequestHeaderTypeFieldUsage

TypeFieldBits Description

Bit4:3 Definesthetypeoftransaction:
10b=MessageTLP

Bit2:0 MessageRoutingSubfieldR[2:0]
000b=ImplicitRoutetotheRootComplex
001b=RoutebyAddress(bytes815ofheadercontainaddress)
010b=RoutebyID(bytes89ofheadercontainID)
011b=ImplicitBroadcastdownstream
100b=ImplicitLocal:terminateatreceiver
101b=ImplicitGather&routetotheRootComplex
110b111b=Reserved:terminateatreceiver

Endpoint Handling
Forimplicitrouting,anEndpointsimplycheckswhethertheroutingsubfieldis
appropriateforit.Forexample,anEndpointwillacceptaBroadcastMessageor
aMessagethatterminatesatthereceiver;butnotMessagesthatimplicitlytarget
theRootComplex.

Switch Handling
RoutingelementslikeSwitchesconsidertheportonwhichtheTLParrivedon
andwhethertheroutingsubfieldcodeisappropriateforit.Forexample:

1. ASwitchUpstreamPortmaylegitimatelyreceiveaBroadcastMessage.It
willduplicatethatandforwardittoallitsDownstreamPorts.Animplicitly
routed Broadcast Message received on a Downstream Port of a Switch
(meaning the message was traveling upstream) would be an error that
wouldbehandledasaMalformedTLP.
2. ASwitchmayreceiveimplicitlyroutedMessagesfortheRootComplexon
DownstreamPortsandwillforwardthesetoitsUpstreamPortbecausethe
location of the Root Complex is understood to be upstream. It would not
acceptMessagesreceivedonitsUpstreamPort(meaningthemessagewas
travelingdownstream)thatareimplicitlyroutedtotheRootComplex.

165
PCIe 3.0.book Page 166 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

3. IfanimplicitlyroutedMessageindicatesitshouldterminateatthereceiver,
then the receiving switch port will consume the message rather than for
wardit.
4. FormessagesroutedusingaddressorIDrouting,aSwitchwillsimplyper
formnormaladdressorIDchecksindecidingwhethertoacceptorforward
it.

DLLPs and Ordered Sets Are Not Routed


DLLPandOrderedSettrafficisnotroutedfromingressportstoegressportsof
switchesorrootcomplexes.Thesepacketsmovefromporttoportacrossalink
fromPhysicalLayertoPhysicalLayer.

DLLPsoriginateattheDataLinkLayerofaPCIExpressport,passthroughthe
Physical Layer, exit the port, traverse the Link and arrive at the neighboring
port.Atthisport,thepacketpassesthroughthePhysicalLayerandendsupat
theData LinkLayerwhereit is processed andconsumed.DLLPs do not pro
ceedfurtheruptheporttotheTransactionLayerandhencearenotrouted.

Similarly, OrderedSet packets originate at the Physical Layer, exit the port,
traverse the Link and arrive at the neighboring port. At this port, the packet
arrivesatthePhysicalLayerwhereitisprocessedandconsumed.OrderedSets
do not proceed further up the port to the Data Link Layer and Transaction
Layerandhencearenotrouted.

Ashasbeendiscussedinthischapter,onlyTLPsareroutedthroughswitches
androotcomplexes.TheoriginateattheTransactionLayerofasourceportand
endupattheTransactionLayerofadestinationport.

166
PCIe 3.0.book Page 167 Sunday, September 2, 2012 11:25 AM

PartTwo:

TransactionLayer
PCIe 3.0.book Page 168 Sunday, September 2, 2012 11:25 AM
PCIe 3.0.book Page 169 Sunday, September 2, 2012 11:25 AM

5 TLPElements
The Previous Chapter
Thepreviouschapterdescribesthepurposeandmethodsofafunctionrequest
ingaddressspace(eithermemoryaddressspaceorIOaddressspace)through
Base Address Registers (BARs) and how software must setup the Base/Limit
registersinallbridgestorouteTLPsfromasourceporttothecorrectdestina
tion port. The general concepts of TLP routing in PCI Express are also dis
cussed,includingaddressbasedrouting,IDbasedroutingandimplicitrouting.

This Chapter
Information moves between PCI Express devices in packets. The three major
classes of packets are Transaction Layer Packets (TLPs), Data Link Layer Packets
(DLLPs)andOrderedSets.Thischapterdescribestheuse,format,anddefinition
ofthevarietyofTLPsandthedetailsoftheirrelatedfields.DLLPsaredescribed
separatelyinChapter9,entitledDLLPElements,onpage307.

The Next Chapter


The next chapter discusses the purposes and detailed operation of the Flow
Control Protocol. Flow control is designed to ensure that transmitters never
sendTransactionLayerPackets(TLPs)thatareceivercantaccept.Thisprevents
receivebufferoverrunsandeliminatestheneedforPCIstyleinefficiencieslike
disconnects,retries,andwaitstates.

Introduction to Packet-Based Protocol

General
Unlikeparallelbuses,serialtransportbuseslikePCIeusenocontrolsignalsto
identifywhatshappeningontheLinkatagiventime.Instead,thebitstream
theysendmusthaveanexpectedsizeandarecognizableformattomakeitpos

169
PCIe 3.0.book Page 170 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

sibleforthereceivertounderstandthecontent.Inaddition,PCIedoesnotuse
anyimmediatehandshakeforthepacketwhileitisbeingtransmitted.

With the exception of the Logical Idle symbols and Physical Layer packets
calledOrderedSets,informationmovesacrossanactivePCIeLinkinfundamen
talchunkscalledpacketsthatarecomprisedofsymbols.Thetwomajorclasses
of packets exchanged are the highlevel Transaction Layer Packets (TLPs), and
lowlevelLinkmaintenancepacketscalledDataLinkLayerPackets(DLLPs).The
packetsandtheirflowareillustratedinFigure51onpage170.OrderedSetsare
packets too, however, they are not framed with a start and end symbol like
TLPsandDLLPsare.TheyarealsonotbytestripedlikeTLPsandDLLPsare.
OrderedSetpacketsareinsteadreplicatedonallLanesofaLink.

Figure51:TLPAndDLLPPackets

PCIe Device A PCIe Device B

Device Core Device Core

Transaction Layer Transaction Layer

Data Link Layer Data Link Layer

Physical Layer Physical Layer


(RX) (TX) (RX) (TX)

DLLP TLP
DLLP TLP (Link)

Transaction Layer Packet (TLP)

STP Seq Num HDR Data Digest CRC End TLP Types:
- Memory Read / Write
- IO Read / Write
- Configuration Read / Write
- Completion
- Message
Data Link Layer Packet (DLLP) - AtomicOp
Framing C Framing DLLP Types:
DLLP R
(SDP) C (END) - TLP Ack/Nak
- Power Management
- Link Flow Control
- Vendor-Specific

170
PCIe 3.0.book Page 171 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Motivation for a Packet-Based Protocol


Therearethreedistinctadvantagestousingapacketbasedprotocolespecially
whenitcomestodataintegrity:

1. Packet Formats Are Well Defined


EarlierbuseslikePCIallowtransfersofindeterminatesize,makingidentifica
tionofpayloadboundariesimpossibleuntiltheendofthetransfer.Inaddition,
eitherdeviceisabletoterminatethetransferbeforeitcompletes,makingitdiffi
cultforthesendertocalculateandsendachecksumorCRCcoveringanentire
payload.Instead,PCIusesasimpleparityschemeandchecksitoneachdata
phase.

Bycomparison,PCIepacketshaveaknownsizeandformat.Thepacketheader
at the beginning indicates the packet type and contains the required and
optional fields. The size of the header fields is fixed except for the address,
whichcanbe32bitsor64bitsinsize.Onceatransfercommences,therecipient
cantpauseorterminateitearly.Thisstructuredformatallowsincludinginfor
mationintheTLPstoaidinreliabledelivery,includingframingsymbols,CRC,
andapacketSequenceNumber.

2. Framing Symbols Define Packet Boundaries


Whenusing8b/10bencodinginGen1andGen2modeofoperation,eachTLP
and DLLP packet sent is framed by Start and End control symbols, clearly
definingthepacketboundariesforthereceiver.Thisisabigimprovementover
PCIandPCIX,wheretheassertionanddeassertionofthesingleFRAME#sig
nalindicatesthebeginningandendofatransaction.Aglitchonthatsignal(or
anyoftheothercontrolsignals)couldcauseatargettomisconstruebusevents.
APCIereceivermustproperlydecodeacomplete10bitsymbolbeforeconclud
ingLinkactivityisbeginningorending,sounexpectedorunrecognizedsym
bolsaremoreeasilyrecognizedandhandledaserrors.

For the 128b/130b encoding used in Gen3, control characters are no longer
employed and there are no framing symbols as such. For more on the differ
encesbetweenGen3encodingandtheearlierversions,seeChapter12,entitled
PhysicalLayerLogical(Gen3),onpage407.

171
PCIe 3.0.book Page 172 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

3. CRC Protects Entire Packet


Unlike the sideband parity signals used by PCI during the address and data
phasesofatransaction,theinbandCRCvalueofPCIeverifieserrorfreedeliv
eryoftheentirepacket.TLPpacketsalsohaveaSequenceNumberappendedto
thembythetransmittersDataLinkLayersothatifanerrorisdetectedatthe
Receiver,theproblempacketcanbeautomaticallyresent.Thetransmittermain
tainsacopyofeachTLPsentinaRetryBufferuntilithasbeenacknowledgedby
thereceiver.ThisTLPacknowledgementmechanism,calledtheAck/NakProto
col, (and described in Chapter 10, entitled Ack/Nak Protocol, on page 317)
formsthebasisofLinklevelTLPerrordetectionandcorrection.ThisAck/Nak
Protocolerrorrecoverymechanismallowsforatimelyresolutionoftheprob
lemattheplaceorLinkwheretheproblemoccurred,butrequiresalocalhard
waresolutiontosupportit.

Transaction Layer Packet (TLP) Details


InPCIExpress,highleveltransactionsoriginateinthedevicecoreofthetrans
mittingdeviceandterminateatthecoreofthereceivingdevice.TheTransaction
LayeractsontheserequeststoassembleoutboundTLPsintheTransmitterand
interpretthemattheReceiver.Alongtheway,theDataLinkLayerandPhysical
Layerofeachdevicealsocontributetothefinalpacketassembly.

TLP Assembly And Disassembly


ThegeneralflowofTLPassemblyatthetransmitsideofaLinkanddisassem
blyatthereceiverisshowninFigure52onpage173.Letsnowwalkthrough
thestepsfromcreationofapackettoitsdeliverytothecorelogicofthereceiver.
ThekeystagesinTransactionLayerPacketassemblyanddisassemblyarelisted
below.ThelistnumberscorrespondtothenumbersinFigure52onpage173.

Transmitter:
1. ThecorelogicofDeviceAsendsarequesttoitsPCIeinterface.Howthisis
accomplished is outside the scope of the spec or this book. The request
includes:
TargetaddressorID(routinginformation)
SourceinformationsuchasRequesterIDandTag
Transactiontype/packettype(Commandtoperform,suchasamemory
read.)
Datapayloadsize(ifany)alongwithdatapayload(ifany)
TrafficClass(toassignpacketpriority)
AttributesoftheRequest(NoSnoop,RelaxedOrdering,etc.)

172
PCIe 3.0.book Page 173 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

2. Based on that request, the Transaction Layer builds the TLP header,
appends any data payload, and optionally calculates and appends the
digest(EndtoEndCRC,ECRC)ifthatssupportedandhasbeenenabled.
At this point the TLP is placed into a Virtual Channel buffer. The Virtual
ChannelmanagesthesequenceofTLPsaccordingtotheTransactionOrder
ingrulesandalsoverifiesthatthereceiverhasenoughflowcontrolcredits
toacceptaTLPbeforeitcanbepasseddowntotheDataLinkLayer.
3. When it arrives at the Data Link Layer, the TLP is assigned a Sequence
NumberandthenaLinkCRCiscalculatedbasedonthecontentsoftheTLP
andthatSequenceNumber.Acopyoftheresultingpacketissavedinthe
RetryBufferincaseoftransmissionerrorswhileitisalsopassedontothe
PhysicalLayer.

Figure52:PCIeTLPAssembly/Disassembly

(1) Outbound From Transmitter Core: Device A Device B (8) Inbound To Receiver Core:
Requests to write/read data, Data R/W Requests,
Completions, Messages, etc. Device Device Completions, Messages, etc.
Core Core
(2) (7)
Transaction Transaction
HDR Data Digest HDR Data Digest
Layer Layer
(3) (3) (6) (6)
Data Data
Seq Num HDR Data Digest CRC Seq Num HDR Data Digest CRC
Link Layer Link Layer

(4) (4) (5) (5)


STP Seq Num HDR Data Digest CRC End
Physical Physical STP Seq Num HDR Data Digest CRC End
Layer Layer
(RX) (TX) (RX) (TX)

4. The Physical Layer does several things to prepare the packet for serial
transmission,includingbytestriping,scrambling,encoding,andserializing
thebits.ForGen1andGen2devices,whenusing8b/10bencoding,thecon
trolcharactersSTPandENDareaddedtoeitherendofthepacket.Finally,
thepacketistransmittedacrosstheLink.InGen3mode,STPtokenisadded
tothefrontendofaTLP,butENDisnotaddedtotheendofthepacket.
RathertheSTPtokencontainsinformationaboutTLPpacketsize.

Receiver:
5. AttheReceiver(DeviceBinthisexample),everythingdonetopreparethe
packetfortransmissionmustnowbeundone.ThePhysicalLayerdeserial
izesthebitstream,decodestheresultingsymbols,andunstripesthebytes.

173
PCIe 3.0.book Page 174 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Thecontrolcharactersareremovedherebecausetheyonlyhavemeaningat
thePhysicalLayer,andthenthepacketisforwardedtotheDataLinkLayer.
6. The Data Link Layer calculates the CRC and compares it to the received
CRC. If that matches, the Sequence Number is checked. If there are no
errors,theCRCandSequenceNumberareremovedandtheTLPispassed
to the Transaction Layer of the receiver and notifies the sender of good
receptionbyreturninganAckDLLP.IntheeventofanerroraNakwillbe
returnedinstead,andthetransmitterwillrereplayTLPsinitsRetryBuffer.
7. AttheTransactionLayer,theTLPisdecodedandtheinformationispassed
tothe corelogicfor appropriateaction.Ifthereceivingdevice isthefinal
target of this packet, it checks for ECRC errors and reports any related
ECRCerrorconditiontothecorelogicshouldtherebeany.

TLP Structure
ThebasicusageofeachfieldinaTransactionLayerPacketisdefinedinTable 5
1onpage 174.

Table51:TLPHeaderTypeFieldDefinesTransactionVariant

TLP Protocol
ComponentUse
Component Layer

Header Transaction 3or4DW(12or16bytes)insize.Formatvarieswith


Layer type,butHeaderdefinesparameters,including:
Transactiontype
Targetaddress,ID,etc.
Transfersize(ifany),ByteEnables
Attributes
TrafficClass

Data Transaction Optional11024DWPayload,whichisqualified


Layer withByteEnablesorbytealignedstartandend
addresses.Notethatalengthofzerocantbespeci
fied,butazerolengthread(usefulinsomecases)
canbeapproximatedbyspecifyingalengthof1DW
andByteEnablesofallzero.Theresultingdatafrom
theCompleterwillbeundefinedbuttheRequester
doesntuseit,sotheresultisthesame.

Digest/ECRC Transaction Optional.Whenpresent,ECRCisalways1DWin


Layer size.

174
PCIe 3.0.book Page 175 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Generic TLP Header Format


General
Figure53onpage175illustratestheformatandcontentsofagenericTLP4DW
header. In this section, fields common to nearly all transactions are summa
rized.Headerformatdifferencesassociatedwithspecifictransactiontypesare
coveredlater.

Figure53:GenericTLPHeaderFields

Transaction Layer Packet (TLP)


Framing Sequence
Header Data Digest LCRC Framing
(STP) Number (End)

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Byte 4 Bytes 4-7 vary with Type Last DW 1st DW
BE BE
Byte 8 Bytes 8-11 vary with Type
Byte 12 Bytes 12-15 vary with Type (not always required)

Generic Header Field Summary


Table 52onpage 176summarizesthesizeanduseofeachofthegenericTLP
header fields. Note that fields marked R in Figure 53 on page 175 are
reservedandshouldbesettozero.

175
PCIe 3.0.book Page 176 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table52:GenericHeaderFieldSummary

Header Header
FieldUse
Field Location

Fmt[2:0] Byte0Bit7:5 Thesebitsencodeinformationaboutheadersizeand


(Format) whetheradatapayloadwillbepartoftheTLP:
00b3DWheader,nodata
01b4DWheader,nodata
10b3DWheader,withdata
11b4DWheader,withdata
Anaddressbelow4GBmustusea3DWheader.The
specstatesthatreceiverbehaviorisundefinedif
4DWheaderisusedforanaddressbelow4GBwith
theupper32bitsofthe64bitaddresssettozero.

Type[4:0] Byte0Bit4:0 Thesebitsencodethetransactionvariantusedwith


thisTLP.TheTypefieldisusedwithFmt[1:0]field
tospecifytransactiontype,headersize,andwhether
datapayloadispresent.SeeGenericHeaderField
Detailsonpage 178fordetails.

TC[2:0] Byte1Bit6:4 Thesebitsencodethetrafficclasstobeappliedto


(Traffic thisTLPandtothecompletionassociatedwithit(if
Class) any):
000b=TrafficClass0(Default)
.
.
111b=TrafficClass7
TC0isthedefaultclass,whileTC17areusedto
providedifferentiatedservices.SeeTrafficClass
(TC)onpage 247foradditionalinformation.

Attr[2] Byte1Bit2 ThisthirdAttributebitindicateswhetherIDbased


(Attributes) OrderingistobeusedforthisTLP.Tolearnmore,
seeIDBasedOrdering(IDO)onpage 301.

TH Byte1Bit0 IndicateswhenTLPHintshavebeenincludedto
(TLPPro givethesystemsomeideaabouthowbesttohandle
cessing thisTLP.SeeTPH(TLPProcessingHints)on
Hints) page 899foradiscussionontheirusage.

176
PCIe 3.0.book Page 177 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Table52:GenericHeaderFieldSummary(Continued)

Header Header
FieldUse
Field Location

TD Byte2Bit7 IfTD=1,theoptional4byteTLPDigesthasbeen
(TLPDigest) includedwiththisTLPastheECRCvalue.
Somerules:
PresenceoftheDigestfieldmustbecheckedbyall
receiversbasedonthisbit.
ATLPwithTD=1butnoDigestishandledasa
MalformedTLP.
IfadevicesupportscheckingECRCandTD=1,it
mustperformtheECRCcheck.
If a device does not support checking ECRC
(optional) at the ultimate destination, it must
ignorethedigest.
For more on this topic see CRC on page 653 and
ECRCGenerationandCheckingonpage 657.

EP Byte2Bit6 IfEP=1,thedataaccompanyingthisdatashouldbe
(Poisoned consideredinvalidalthoughthetransactionisbeing
Data) allowedtocompletenormally.Formoreonpoisoned
packets,refertoDataPoisoningonpage 660.

Attr[1:0] Byte2Bit5:4 Bit5=Relaxedordering:Whensetto1,PCIX


(Attributes) relaxedorderingisenabledforthisTLP.If0,then
strictPCIorderingisused.
Bit4=NoSnoop:Whensetto1,Requesterisindicat
ingthatnohostcachecoherencyissuesexistforthis
TLP.Systemhardwarecanthussavetimebyskip
pingthenormalprocessorcachesnoopforthis
request.When0,PCItypecachesnoopprotectionis
required.

177
PCIe 3.0.book Page 178 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table52:GenericHeaderFieldSummary(Continued)

Header Header
FieldUse
Field Location

Address Byte2Bit3:2 ForMemoryandAtomicRequests,thisfieldsup


Type[1:0] portsaddresstranslationforvirtualizedsystems.
Thetranslationprotocolisdescribedinaseparate
speccalledAddressTranslationServices,whereitcan
beseenthatthefieldencodesas:
00=Default/Untranslated
01=TranslationRequest
10=Translated
11=Reserved

Length[9:0] Byte2Bit1:0 TLPdatapayloadtransfersize,inDW.Encoding:


Byte3Bit7:0 0000000001b=1DW
0000000010b=2DW
.
.
1111111111b=1023DW
0000000000b=1024DW

LastDW Byte7Bit7:4 Thesefourhightruebitsmaponetoonetothe


ByteEnables byteswithinthelastdoublewordofpayload.
[3:0] Bit7=1:Byte3inlastDWisvalid;otherwisenot
Bit6=1:Byte2inlastDWisvalid;otherwisenot
Bit5=1:Byte1inlastDWisvalid;otherwisenot
Bit4=1:Byte0inlastDWisvalid;otherwisenot

FirstDW Byte7Bit3:0 Thesefourhightruebitsmaponetoonetothe


ByteEnables byteswithinthefirstdoublewordofpayload.
[3:0] Bit3=1:Byte3infirstDWisvalid;otherwisenot
Bit2=1:Byte2infirstDWisvalid;otherwisenot
Bit1=1:Byte1infirstDWisvalid;otherwisenot
Bit0=1:Byte0infirstDWisvalid;otherwisenot

Generic Header Field Details


Inthefollowingsections,wedescribedetailsofeachTLPHeaderfielddepicted
inFigure53onpage175.

178
PCIe 3.0.book Page 179 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Header Type/Format Field Encodings


Table 53onpage 179summarizestheencodingsusedinTLPheaderTypeand
Format(Fmt)fields.

Table53:TLPHeaderTypeandFormatFieldEncodings

TLP FMT[2:0] TYPE[4:0]

MemoryReadRequest(MRd) 000=3DW,nodata 00000


001=4DW,nodata

MemoryReadLockRequest(MRdLk) 000=3DW,nodata 00001


001=4DW,nodata

MemoryWriteRequest(MWr) 010=3DW,w/data 00000


011=4DW,w/data

IOReadRequest(IORd) 000=3DW,nodata 00010

IOWriteRequest(IOWr) 010=3DW,w/data 00010

ConfigType0ReadRequest(CfgRd0) 000=3DW,nodata 00100

ConfigType0WriteRequest(CfgWr0) 010=3DW,w/data 00100

ConfigType1ReadRequest(CfgRd1) 000=3DW,nodata 00101

ConfigType1WriteRequest(CfgWr1) 010=3DW,w/data 00101

MessageRequest(Msg) 001=4DW,nodata 10rrr*


(seeroutingfield)

MessageRequestW/Data(MsgD) 011=4DW,w/data 10rrr*


(seeroutingfield)

Completion(Cpl) 000=3DW,nodata 01010

CompletionW/Data(CplD) 010=3DW,w/data 01010

CompletionLocked(CplLk) 000=3DW,nodata 01011

CompletionW/Data(CplDLk) 010=3DW,w/data 01011

FetchandAddAtomicOpRequest 010=3DW,w/data 01100


011=4DW,w/data

179
PCIe 3.0.book Page 180 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table53:TLPHeaderTypeandFormatFieldEncodings(Continued)

TLP FMT[2:0] TYPE[4:0]

UnconditionalSwapAtomicOp 010=3DW,w/data 01101


Request 011=4DW,w/data

CompareandSwapAtomicOp 010=3DW,w/data 01110


Request 011=4DW,w/data

LocalTLPPrefix 100=TLPPrefix 0L3L2L1L0

EndtoEndTLPPrefix 100=TLPPrefix 1E3E2E1E0

Digest / ECRC Field


TheTLPDigestbitreportsthepresenceoftheEndtoEndCRC(ECRC).Ifthis
optional feature is supported and enabled by software, devices calculate and
apply an ECRC for all TLPs they originate. Note that using ECRC requires
devices to include the optional Advanced Error Reporting registers, since the
capabilityandcontrolregistersforitarelocatedthere.

ECRCGenerationandChecking.ECRC covers all fields that do not


changeastheTLPisforwardedacrossthefabric.However,therearetwobits
thatcanlegallychangeasapacketmakesitswayacrossatopology:

Bit0oftheTypefieldchangeswhenaconfigurationtransactionisfor
wardedacrossabridgeandchangesfromatype1toatype0configuration
transactionbecauseithasreachedthetargetedbus.Thisisaccomplishedby
changingbit0ofthetypefield.
Error/Poisoned(EP)bitthiscanchangeasaTLPtraversesthefabricif
thedataassociatedwiththepacketisseenascorrupted.Thisisanoptional
featurereferredtoaserrorforwarding.

WhoChecksECRC?TheintendedtargetofanECRCistheultimaterecipi
ent of the TLP. Checking the LCRC verifies no transmission errors across a
givenLink,butthatgetsrecalculatedforthepacketattheegressportofarout
ingelement(SwitchorRootComplex)beforebeingforwardedtothenextLink,
whichcouldmaskaninternalerrorintheroutingelement.Toprotectagainst
that, the ECRC is carried forward unchanged on its journey between the
RequesterandCompleter.WhenthetargetdevicecheckstheECRC,anyerror
possibilitiesalongthewayhaveahighprobabilityofbeingdetected.

180
PCIe 3.0.book Page 181 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

ThespecmakestwostatementsregardingaSwitchsroleinECRCchecking:
A Switch that supportsECRCcheckingperforms this check on TLPsdes
tinedtoalocationwithintheSwitchitself.OnallotherTLPsaSwitchmust
preservetheECRC(forwardituntouched)asanintegralpartoftheTLP.
NotethataSwitchmayperformECRCcheckingonTLPspassingthrough
the Switch. ECRCErrors detected by the Switchare reported inthe same
wayanyotherdevicewouldreportthem,butdonotaltertheTLPspassage
throughtheSwitch.

Using Byte Enables


General.Like PCI, PCIe needs a mechanism to reconcile its DWaligned
addresseswiththeneed,attimes,fortransfersizesorstarting/endingaddresses
thatarenotDWaligned.Towardthisend, PCIExpressmakesuseofthe two
ByteEnablefieldsintroducedearlierinFigure53onpage175andinTable 52
onpage 176.TheFirstDWByteEnablefieldandtheLastDWByteEnablefields
allowtheRequestertoqualifythebytesofinterestwithinthefirstandlastdou
blewordstransferred.

ByteEnableRules
1. Byteenablebitsarehightrue.Avalueof0indicatesthecorrespondingbyte
inthedatapayloadshouldnotbeusedbytheCompleter.Avalueof1indi
catesitshould.
2. Ifthevaliddataisallwithinasingledoubleword,theLastDWByteenable
fieldmustbe=0000b.
3. IftheheaderLengthfieldindicatesatransferismorethan1DW,theFirst
DWByteEnablemusthaveatleastonebitenabled.
4. IftheLengthfieldindicatesatransferof3DWormore,thentheFirstDW
ByteEnablefieldandtheLastDWByteEnablefieldmusthavecontiguous
bitsset.Inthesecases,theByteEnablesareonlybeingusedtogivethebyte
offset of the effective starting and ending address from the DWaligned
address.
5. DiscontinuousbyteenablebitpatternsintheFirstDWByteenablefieldare
allowedifthetransferis1DW.
6. Discontinuous byte enable bit patterns in both the First and Second DW
ByteenablefieldsareallowedifthetransferisbetweenoneandtwoDWs.
7. A write request with a transfer length of 1DW and no byte enables set is
legal,buthasnoeffectontheCompleter.
8. Ifareadrequestof1DWhasnobyteenablesset,thecompleterreturnsa
1DWdatapayloadofundefineddata.ThismaybeusedasaFlushmecha
nism that takes advantage of transaction ordering rules to force all previ
ouslypostedwritesouttomemorybeforethecompletionisreturned.

181
PCIe 3.0.book Page 182 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ByteEnableExample.Anexampleofbyteenableuseinthiscaseisillus
tratedinFigure54onpage182.Notethatthetransferlengthmustextendfrom
thefirstDWwithanyvalidbyteenabledtothelastDWwithanyvalidbytes
enabled.Becausethetransferismorethan2DW,thebyteenablesmayonlybe
usedtospecifythestartaddresslocation(2d)andendaddresslocation(34d)of
thetransfer.

Figure54:UsingFirstDWandLastDWByteEnableFields

Transaction Descriptor Fields


As transactions move between requester and completer, its necessary to
uniquelyidentifyatransaction,sincemanysplittransactionsmaybequeuedup
fromthesameRequesteratanyinstant.Tohelpwiththis,thespecdefinessev
eralimportantheaderfieldsthatformauniqueTransactionDescriptor,asillus
tratedinFigure55.

182
PCIe 3.0.book Page 183 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Figure55:TransactionDescriptorFields

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Cmpl C B
Byte 4 Completer ID Byte Count
Status M
Byte 8 Requester ID Tag R Lower Addr

WhiletheTransactionDescriptorfieldsarenotinadjacentheaderlocations,col
lectivelytheydescribekeytransactionattributes,including:

TransactionID.The combination of the Requester ID (Bus, Device, and


FunctionNumberoftheRequester)andtheTagfieldoftheTLP.

TrafficClass.TheTrafficClass(TC)isaddedbytherequesterbasedonthe
corelogicrequestandtravelsunmodifiedthroughthetopologytotheCompl
eter.OneveryLink,theTCismappedtooneoftheVirtualChannels.

TransactionAttributes.The IDbased Ordering, Relaxed Ordering, and


NoSnoopbitsalsotravelwiththeRequestpackettotheCompleter.

Additional Rules For TLPs With Data Payloads


ThefollowingrulesapplywhenaTLPincludesadatapayload.

1. TheLengthfieldrefersonlytothedatapayload.
2. The first byte of data in the payload (immediately after the header) is
alwaysassociatedwiththelowest(start)address.
3. TheLengthfieldalwaysrepresentsanintegralnumberofDWstransferred.
PartialDWsarequalifiedusingFirstandLastByteEnablefields.
4. Thespecstatesthat,whenmultipletransactionsarereturnedbyacompl
eterinresponsetoasinglememoryrequest,eachintermediatetransaction
must end on naturallyaligned 64 or 128byte address boundaries for a
Root Complex. This is controlled by a configuration bit called the Read
Completion Boundary (RCB). All other devices follow the PCIX protocol

183
PCIe 3.0.book Page 184 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

andbreaksuchtransactionsatnaturallyaligned128byteboundaries.This
makesbuffermanagementsimplerinbridges.
5. The Length field is reserved when sending Message Requests unless the
messageistheversionwithdata(MsgD).
6. The TLP data payload must not exceed the current value in the
Max_Payload_SizefieldoftheDeviceControlRegister.Onlywritetransac
tionshavedatapayloads,sothisrestrictiondoesntapplytoreadrequests.
AreceiverisrequiredtocheckforviolationsoftheMax_Payload_Sizelimit
duringwrites,andviolationsaretreatedasMalformedTLPs.
7. ReceiversalsomustcheckfordiscrepanciesbetweenthevalueintheLength
fieldandtheactualamountofdatatransferredinaTLP.Thistypeofviola
tionisalsotreatedasaMalformedTLP.
8. Requests must not mix combinations of start address and transfer length
thatwouldcauseamemoryaccesstocrossa4KBboundary.Whilechecking
forthisisoptional,ifseenitstreatedasaMalformedTLP.

Specific TLP Formats: Request & Completion TLPs


Inthissection,theformatof3DWand4DWheadersusedtoaccomplishspecific
transactiontypesaredescribed.Manyofthegenericfieldsdescribedpreviously
apply, but an emphasis is placed on the fields which are handled differently
withspecifictransactiontypes.DetaileddescriptionofTLPHeaderformatare
described is sections following for TLP types: 1) IO Request, 2) Memory
Requests,3)ConfigurationRequests,4)Completionsand5)MessageRequests.

IO Requests
While the spec discourages the use of IO transactions, allowance is made for
Legacydevicesandforsoftwarethatmayneedtorelyonacompatibledevice
residinginthesystemIOmapratherthanthememorymap.WhiletheIOtrans
actions can technically access a 32bit IO range, in reality many systems (and
CPUs)restrictIOaccesstothelower16bits(64KB)ofthisrange.Figure56on
page185depictsthesystemIOmapandthe16and32bitaddressboundaries.
Devicesthatdontidentify themselves asLegacydevicesarenotpermitted to
requestIOaddressspaceintheirconfigurationBaseAddressRegisters.

184
PCIe 3.0.book Page 185 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Figure56:SystemIOMap

IORequestHeaderFormat.A3DWIOrequestheaderisshowninFig
ure57on page 185andeachofthefieldsis describedinthesection that fol
lows.

Figure57:3DWIORequestHeaderFormat

CPU

Root Complex

IO Request TLP
Framing Sequence Framing
Header Data Digest LCRC
Legacy (STP) Number (End)
Endpoint

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R Attr R TH T E Attr AT Length
0x0 00010 000 0 0DP00 00 00000000001
Byte 4 Requester ID Tag Last DW BE 1st DW
0000 BE
Byte 8 Address [31:2] R

185
PCIe 3.0.book Page 186 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

IORequestHeaderFields.The location and use of each field in an IO


requestheaderisdescribedinTable 54onpage 186.

Table54:IORequestHeaderFields

FieldName HeaderByte/Bit Function

Fmt[2:0] Byte0Bit7:5 PacketFormatforIOrequests:


(Format) 000b=IORead(3DWwithoutdata)
010b=IOWrite(3DWwithdata)

Type[4:0] Byte0Bit4:0 Packettypeis00010bforIOrequests

TC[2:0] Byte1Bit6:4 TrafficClassforIOrequestsisalways


(TrafficClass) zero,ensuringthatthesepacketswill
neverinterferewithanyhighpriority
packets.

Attr[2] Byte1Bit2 IDbasedOrderingdoesntapplyfor


(Attributes) IOrequestsandthisbitisreserved.

TH Byte1Bit0 TLPprocessingHintsdontapplyto
(TLPProcessingHints) IOrequestsandthisbitisreserved.

TD Byte2Bit7 Indicatesthepresenceofadigestfield
(TLPDigest) (ECRC)attheendoftheTLP.

EP Byte2Bit6 Indicateswhetherthedatapayload(if
(PoisonedData) present)ispoisoned.

Attr[1:0] Byte2Bit5:4 RelaxedOrderingandNoSnoopbits


(Attributes) dontapplyforIOrequestsandare
alwayszero.

AT[1:0] Byte2Bit3:2 AddressTypedoesntapplyforIO


(AddressType) requestsandthesebitsmustbezero.

Length[9:0] Byte2Bit1:0 IndicatesdatapayloadsizeinDW.


Byte3Bit7:0 ForIOrequests,thisfieldisalways
just1sincenomorethan4bytescan
betransferred.TheFirstDWByte
Enablesqualifywhichbytesareused.

186
PCIe 3.0.book Page 187 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Table54:IORequestHeaderFields(Continued)

FieldName HeaderByte/Bit Function

RequesterID[15:0] Byte4Bit7:0 IdentifiestheRequestersreturn


Byte5Bit7:0 addressforcorrespondingComple
tion.
Byte4,7:0=BusNumber
Byte5,7:3=DeviceNumber
Byte5,2:0=FunctionNumber

Tag[7:0] Byte6Bit7:0 Thesebitsidentifythespecific


requestsfromtherequester.Aunique
tagvalueisassignedtoeachoutgoing
Request.Bydefault,onlybits4:0are
used,buttheExtendedTagandPhan
tomFunctionsoptionscanextendthat
to11bits,permittingupto2048out
standingrequeststobeinprogress
simultaneously.

LastDWBE[3:0] Byte7Bit7:4 Thesebitsmustbe0000bbecauseIO


(LastDWByteEnables) requestscanonlybeoneDWinsize.

1stDWBE[3:0] Byte7Bit3:0 Thesebitsqualifythebytesintheone


(FirstDWByteEnables) DWpayload.ForIOrequests,anybit
combinationisvalid(includingall
zeros).

Address[31:2] Byte8Bit7:0 Theupper30bitsofthe32bitstart


Byte9Bit7:0 addressfortheIOtransfer.Thelower
Byte10Bit7:0 twobitsofthe32bitaddressare
Byte11Bit7:2 reserved(00b),forcingaDWaligned
startaddress.

187
PCIe 3.0.book Page 188 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Memory Requests
PCIExpressmemorytransactionsincludetwoclasses:ReadRequestswiththeir
corresponding Completions, and Write Requests. The system memory map
showninFigure58onpage188depictsbotha3DWand4DWmemoryrequest
packet.Keepinmindapointthatthespecreiteratesseveraltimes:amemory
transferisneverpermittedtocrossa4KBaddressboundary.

Figure58:3DWAnd4DWMemoryRequestHeaderFormats

CPU

Root Complex Memory

3DW or 4DW Memory Request TLP


Framing Sequence
Header Data Digest LCRC Framing
(STP) Number (End)

4DW Memory Request Header


PCIe +0 +1 +2 +3
Endpoint System Memory Map
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 64
2
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x1 tr H D P
Byte 4 Requester ID Tag Last DW 1st DW
BE BE
Byte 8 Address [63:32]
Byte 12 Address [31:2] R

3DW Memory Request Header


+0 +1 +2 +3
4GB
32
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 2
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x0 tr H D P
Byte 4 Requester ID Tag Last DW 1st DW
BE BE 0
Byte 8 Address [31:2] R

MemoryRequestHeaderFields.Thelocationanduseofeachfieldina
4DW memory request header is listed in Table 55 on page 189. Note that the
differencebetweena3DWheaderanda4DWheaderissimplythelocationand
sizeofthestartingAddressfield.

188
PCIe 3.0.book Page 189 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Table55:4DWMemoryRequestHeaderFields

FieldName HeaderByte/Bit Function

Fmt[2:0] Byte0Bit7:5 PacketFormats:


(Format) 000b=MemoryRead(3DWw/odata)
010b=MemoryWrite(3DWw/data)
001b=MemoryRead(4DWw/odata)
011b=MemoryWrite(4DWw/data)
1xxb=TLPPrefixhasbeenaddedto
thebeginningofthepacket.SeeTPH
(TLPProcessingHints)onpage 899
formoreonthis.

Type[4:0] Byte0Bit4:0 TLPpacketTypefield:


00000b=MemoryReadorWrite
00001b=MemoryReadLocked
TypefieldisusedwithFmt[1:0]field
tospecifytransactiontype,header
size,andwhetherdatapayloadis
present.

TC[2:0] Byte1Bit6:4 Thesebitsencodethetrafficclassto


(TrafficClass) beappliedtoaRequestandtoany
associatedCompletion.
000b=TrafficClass0(Default)
.
.
111b=TrafficClass7
SeeTrafficClass(TC)onpage 247
formoreonthis.

Attr[2] Byte1Bit2 IndicateswhetherIDbasedOrdering


(Attributes) istobeusedforthisTLP.Tolearn
more,seeIDBasedOrdering(IDO)
onpage 301.

TH Byte1Bit0 IndicateswhetherTLPHintshave
(TLPProcessingHints) beenincluded.SeeTPH(TLPPro
cessingHints)onpage 899foradis
cussiononthesehints.

189
PCIe 3.0.book Page 190 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table55:4DWMemoryRequestHeaderFields(Continued)

FieldName HeaderByte/Bit Function

TD Byte2Bit7 If1,theoptionalTLPDigestfieldis
(TLPDigest) includedwiththisTLP.
Somerules:
ThepresenceoftheDigestfieldmust
becheckedbyallreceivers(usingthis
bit)
TLPs with TD = 1 but no Digest
fieldaretreatedasMalformed.
If the TD bit is set, recipient must
performtheECRCcheckifenabled.
If a Receiver doesnt support the
optional ECRC checking, it must
ignorethedigestfield.

EP Byte2Bit6 If1,thedataaccompanyingthis
(PoisonedData) packetshouldbeconsideredtohave
anerroralthoughthetransactionis
allowedtocompletenormally.

Attr[1:0] Byte2Bit5:4 Bit5=Relaxedordering.


(Attributes) Whenset=1,PCIXrelaxedordering
isenabledforthisTLP.Otherwise,
strictPCIorderingisused.
Bit4=NoSnoop.
If1,systemhardwareisnotrequired
tocauseprocessorcachesnoopfor
coherencyforthisTLP.Otherwise,
cachesnoopingisrequired.

AddressType[1:0] Byte2Bit3:2 Thisfieldsupportsaddresstransla


tionforvirtualizedsystems.The
translationprotocolisdescribedina
separatespeccalledAddressTransla
tionServices,whereitcanbeseenthat
thefieldencodesas:
00=Default/Untranslated
01=TranslationRequest
10=Translated
11=Reserved

190
PCIe 3.0.book Page 191 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Table55:4DWMemoryRequestHeaderFields(Continued)

FieldName HeaderByte/Bit Function

Length[9:0] Byte2Bit1:0 TLPdatapayloadtransfersize,in


Byte3Bit7:0 DW.Maximumsizeis1024DW
(4KB),encodedas:
0000000001b=1DW
0000000010b=2DW
.
.
1111111111b=1023DW
0000000000b=1024DW

RequesterID[15:0] Byte4Bit7:0 IdentifiesaRequestersreturn


Byte5Bit7:0 addressforacompletion:
Byte4,7:0=BusNumber
Byte5,7:3=DeviceNumber
Byte5,2:0=FunctionNumber

Tag[7:0] Byte6Bit7:0 Theseidentifyeachoutstanding


requestissuedbytheRequester.
Bydefaultonlybits4:0areused,
allowingupto32requeststobein
progressatatime.IftheExtended
TagbitintheControlRegisterisset,
thenall8bitsmaybeused(256tags).

LastBE[3:0] Byte7Bit7:4 Thesequalifybyteswithinthelast


(LastDWByteEnables) DWofdatatransferred.

1stDWBE[3:0] Byte7Bit3:0 Thesequalifybyteswithinthefirst


(FirstDWByteEnables) DWofthedatapayload.

Address[63:32] Byte8Bit7:0 Theupper32bitsofthe64bitstart


Byte9Bit7:0 addressforthememorytransfer.
Byte10Bit7:0
Byte11Bit7:0

Address[31:2] Byte12Bit7:0 Thelower32bitsofthe64bitstart


Byte13Bit7:0 addressforthememorytransfer.The
Byte14Bit7:0 lowertwobitsoftheaddressare
Byte15Bit7:2 reserved,forcingaDWalignedstart
address.

191
PCIe 3.0.book Page 192 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

MemoryRequestNotes.Featuresofmemoryrequestsinclude:
1. Memorydatatransfersarenotpermittedtocrossa4KBboundary.
2. Allmemorymappedwritesarepostedtoimproveperformance.
3. Either32or64bitaddressingmaybeused.
4. Datapayloadsizeisbetween0and1024DW(04KB).
5. QualityofServicefeaturesmaybeused,includingupto8TrafficClasses.
6. The No Snoop attribute can be used to relieve the system of the need to
snoopprocessorcacheswhentransactionstargetmainmemory.
7. TheRelaxedOrderingattributemaybeusedtoallowdevicesinthepackets
path to apply the relaxed ordering rules in hopes of improving perfor
mance.

Configuration Requests
PCIExpressusesbothType0andType1configurationrequeststhesameway
PCIdidtomaintainbackwardcompatibility.AType1cyclepropagatesdown
streamuntilitreachesthebridgewhosesecondarybusmatchesthetargetbus.
Atthatpoint,theconfigurationtransactionisconvertedfromType1toType0
by the bridge. The bridge knows when to forward and convert configuration
cycles based on the previously programmed bus number registers: Primary,
Secondary,andSubordinateBusNumbers.Formoreonthistopic,refertothe
sectionLegacyPCIMechanismonpage 91.

192
PCIe 3.0.book Page 193 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Figure59:3DWConfigurationRequestAndHeaderFormat

CPU

Root Complex

Type 1
Configuration Request

Switch
Type 0
Configuration Request Configuration Request TLP
Framing Sequence
PCIe Header Data Digest LCRC Framing
(STP) Number (End)
Endpoint

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R Attr R TH T E Attr AT Length
0x0 0010x 000 0 0DP 00 00 0000000001
Byte 4 Requester ID Tag Last DW BE 1st DW
0000 BE
Byte 8 Bus Number Device Func Rsvd Ext Reg Register R
Function Number with ARI Number Number

InFigure59onpage193,aType1configurationcycleisshownmakingitsway
downstream,whereitisconvertedtoType0bythebridgeforthatbus(accom
plished by changing bit 0 of the Type field). Note that, unlike PCI, only one
device can reside downstream on a Link. Consequently, no IDSEL or other
hardwareindicationisneededtotellthedevicethatitshouldclaimtheType0
cycle;anyType0configurationcycleadeviceseesonitsUpstreamLinkwillbe
understoodastargetingthatdevice.

DefinitionsOfConfigurationRequestHeaderFields.Table 56on
page 194 describes the location and use of each field in the configuration
requestheaderillustratedinFigure59onpage193.

193
PCIe 3.0.book Page 194 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table56:ConfigurationRequestHeaderFields

FieldName HeaderByte/Bit Function

Fmt[2:0] Byte0Bit7:5 Alwaysa3DWheader


(Format) 000b=configurationread(nodata)
010b=configurationwrite(withdata)

Type[4:0] Byte0Bit4:0 00100b=Type0ConfigRequest


00101b=Type1ConfigRequest

TC[2:0] Byte1Bit6:4 TrafficClassmustbezeroforConfigu


(TransferClass) rationrequests,ensuringthatthese
packetswillneverinterferewithany
highprioritypackets.

Attr[2] Byte1Bit2
(Attributes) Thesebitsarereservedandmustbe
zeroforConfigRequests.
TH Byte1Bit0
(TLPProcessingHints)

TD Byte2Bit7 Indicatesthepresenceofadigestfield
(TLPDigest) (1DW)attheendoftheTLP.

EP Byte2Bit6 Indicatesthatdatapayloadispoi
(PoisonedData) soned.

Attr[1:0] Byte2Bit5:4 RelaxedOrderingandNoSnoopbits


(Attributes) arebothalways=0inconfiguration
requests.

AT[1:0] Byte2Bit3:2 AddressTypeisreservedforconfig


(AddressType) requestsandmustbezero.

Length[9:0] Byte2Bit1:0 DatapayloadsizeinDWisalways=1


Byte3Bit7:0 forconfigurationrequests.Byte
EnablesqualifybyteswithintheDW
andanycombinationislegal.

194
PCIe 3.0.book Page 195 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Table56:ConfigurationRequestHeaderFields(Continued)

FieldName HeaderByte/Bit Function

RequesterID[15:0] Byte4Bit7:0 IdentifiestheRequestersreturn


Byte5Bit7:0 addressforacompletion:
Byte4,7:0=BusNumber
Byte5,7:3=DeviceNumber
Byte5,2:0=FunctionNumber

Tag[7:0] Byte6Bit7:0 Thesebitsidentifyoutstandingrequest


issuedbytherequester.Bydefault,
onlybits4:0areused(32outstanding
transactionsatatime),butifthe
ExtendedTagbitintheControlRegis
terisset=1,thenall8bitsmaybeused
(256tags).

LastBE[3:0] Byte7Bit7:4 ThesequalifybytesinthelastdataDW


(LastDWByteEnables) transferred.Sinceconfigrequestscan
onlybeoneDWinsize,thesebitsmust
bezero.

1stDWBE[3:0] Byte7Bit3:0 Thesehightruebitsqualifybytesin


(FirstDWByteEnables) thefirstdataDWtransferred.Forcon
figrequests,anybitcombinationis
valid(includingnoneactive).

CompleterID[15:0] Byte8Bit7:0 Identifiesthecompleterbeingaccessed


Byte9Bit7:0 withthisconfigurationcycle.
Byte8,7:0=BusNumber
Byte9,7:3=DeviceNumber
Byte9,2:0=FunctionNumber

ExtRegisterNumber Byte10Bit3:0 Theseprovidetheupper4bitsofDW


[3:0] spaceforaccessingtheextendedcon
(ExtendedRegister figspace.TheyrecombinedwithReg
Number) isterNumbertocreatethe10bit
addressneededtoaccessthe1024DW
(4096byte)space.ForPCIcompatible
configspace,thisfieldmustbezero.

195
PCIe 3.0.book Page 196 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table56:ConfigurationRequestHeaderFields(Continued)

FieldName HeaderByte/Bit Function

RegisterNumber[5:0] Byte11Bit7:0 Asthelower8bitsofconfiguration


DWspace,thesespecifytheregister
number.Thetwolowestbitsare
alwayszero,forcingDWaligned
accesses.

ConfigurationRequestNotes.Configuration requests always use the


3DWheaderformatandareroutedbasedonthetargetBus,DeviceandFunc
tionnumbers.AlldevicescapturetheirBusandDeviceNumberfromthetar
getnumbersintheRequestwhenevertheyreceiveaType0configurationwrite
cycle.ThereasonforthatisbecausetheyllneeditlatertouseastheirRequester
IDwhentheysendrequestsoftheirowninthefuture.

Completions
CompletionsareexpectedinresponsetononpostedRequest,unlesserrorspre
vent them. For example Memory, IO, or Configuration Read requests usually
resultinCompletionswithdata.Ontheotherhand,IOorConfigurationWrite
requestsusuallyresultinacompletionwithoutdatathatmerelyreportsthesta
tusofthetransaction.

Many fields in the Completion use the samevalues asthe associated request,
including Traffic Class, Attribute bits, and the original Requester ID (used to
routethecompletionbacktotheRequester).Figure510onpage197showsa
completion returned for a nonposted request, and the 3DW header format it
uses.CompletionsalsosupplytheCompleterIDintheheader.CompleterIDis
not interesting during normal operation, but knowing where the Completion
camefromcouldbeusefulforerrordiagnosisduringsystemdebug.

196
PCIe 3.0.book Page 197 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Figure510:3DWCompletionHeaderFormat

CPU

Root Complex

Switch

Non-Posted
Request Completion TLP
Framing Sequence
PCIe Header Data Digest LCRC Framing
(STP) Number (End)
Endpoint

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R TH T E Attr AT Length
0x0 01010 tr 0 D P 00
Compl. B
Byte 4 Completer ID C
Status M
Byte Count
Byte 8 Requester ID Tag R Lower Address

DefinitionsOfCompletionHeaderFields. Table 57 on page 197


describesthelocationanduseofeachfieldinacompletionheader.

Table57:CompletionHeaderFields

Header
FieldName Function
Byte/Bit

Fmt[2:0] Byte0Bit7:5 PacketFormat(alwaysa3DWheader)


(Format) 000b=Completionwithoutdata(Cpl)
010b=Completionwithdata(CplD)

Type[4:0] Byte0Bit4:0 Packettypeis01010bforCompletions.

197
PCIe 3.0.book Page 198 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table57:CompletionHeaderFields(Continued)

Header
FieldName Function
Byte/Bit

TC[2:0] Byte1Bit6:4 Completionsmustusethesamevalue


(TrafficClass) hereasthecorrespondingRequest.

Attr[2] Byte1Bit2 IndicateswhetherIDbasedOrderingis


(Attributes) tobeusedforthisTLP.Tolearnmore,
seeIDBasedOrdering(IDO)on
page 301.

TH Byte1Bit0 ReservedforCompletions.
(TLPProcessingHints)

TD Byte2Bit7 If=1,indicatesthepresenceofa
(TLPDigest) digestfieldattheendoftheTLP.

EP Byte2Bit6 If=1,indicatesthedatapayloadispoi
(PoisonedData) soned.

Attr[1:0] Byte2Bit5:4 Completionsmustusethesamevalues


(Attributes) hereasthecorrespondingRequest.

AT[1:0] Byte2Bit3:2 AddressTypeisreservedforComple


(AddressType) tionsandmustbezero,butReceivers
arenotrequiredorevenencouragedto
checkthis.

Length[9:0] Byte2Bit1:0 IndicatesdatapayloadsizeinDW.For


Byte3Bit7:0 Completions,thisfieldreflectsthesize
ofthedatapayloadassociatedwiththis
completion.

CompleterID[15:0] Byte4Bit7:0 IdentifiestheCompletertosupport


Byte5Bit7:0 debuggingproblems.
Byte47:0=CompleterBus#
Byte57:3=CompleterDev#
Byte52:0=CompleterFunction#

198
PCIe 3.0.book Page 199 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Table57:CompletionHeaderFields(Continued)

Header
FieldName Function
Byte/Bit

Compl.Status[2:0] Byte6Bit7:5 ThesebitsindicatestatusforthisCom


(CompletionStatus pletion.
Code) 000b=SuccessfulCompletion(SC)
001b=UnsupportedRequest(UR)
010b=ConfigReqRetryStatus(CRS)
100b=Completerabort(CA)
Allothercodesarereserved.SeeSum
maryofCompletionStatusCodeson
page 200.

BCM Byte6Bit4 ThisisonlyusedbyPCIXCompleters


(ByteCountModified) andindicatesthattheByteCountfield
reportsonlythefirstpayloadrather
thanthetotalpayloadremaining.See
UsingTheByteCountModifiedBit
onpage 201.

ByteCount[11:0] Byte6Bit3:0 Bytecountremainingtosatisfyaread


Byte7Bit7:0 request,asderivedfromtheoriginal
requestLengthfield.SeeData
ReturnedForReadRequests:on
page 201forspecialcasescausedby
multiplecompletions.

RequesterID[15:0] Byte8Bit7:0 CopiedfromtheRequestforuseasthe


Byte9Bit7:0 returnaddress(target)forthisComple
tion.
Byte8,7:0=RequesterBus#
Byte9,7:3=RequesterDevice#
Byte9,2:0=RequesterFunction#

Tag[7:0] Byte10Bit7:0 ThismustbetheTagvaluereceived


withtheRequest.Requesterassociates
thisCompletionwithapending
RequestbasedontheTag.

199
PCIe 3.0.book Page 200 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table57:CompletionHeaderFields(Continued)

Header
FieldName Function
Byte/Bit

LowerAddress[6:0] Byte11Bit6:0 Thelower7bitsofaddressforthefirst


datareturnedforareadrequest.Calcu
latedfromRequestLengthandByte
Enables,itassistsbuffermanagement
byshowinghowmanybytescanbe
transferredbeforereachingthenext
ReadCompletionBoundary.SeeCal
culatingLowerAddressFieldon
page 200.

SummaryofCompletionStatusCodes.
000b(SC)SuccessfulCompletion:theRequestwasservicedproperly.
001b(UR)UnsupportedRequest:Requestisnotlegalorwasnotrecognized
by the Completer. This is an error condition but how the Completer
respondsdependsonthespecrevisiontowhichitwasdesigned.Beforethe
1.1spec,thiswereconsideredanuncorrectableerror,butfor1.1andlater
itstreatedasanAdvisoryNonFatalError.SeetheUnsupportedRequest
(UR)Statusonpage 663fordetails.
010b (CRS) Configuration Request Retry Status: Completer is temporarily
unable to service a configuration request, and the request should be
attemptedagainlater.
100b (CA) Completer Abort: Completer should have been able to service
therequestbuthasfailedforsomereason.Thisisanuncorrectableerror.

CalculatingTheLowerAddressField.ThisfieldissetupbytheCom
pletertoreflectthebytealignedaddressofthefirstenabledbyteofdatabeing
returned in the Completionpayload. Hardware calculatesthisby considering
both the DW start address and the Byte Enable pattern in the First DW Byte
Enablefieldprovidedintheoriginalrequest.

ForMemoryReadRequests,theaddressisanoffsetfromtheDWstartaddress:
IftheFirstDWByteEnablefieldis1111b,allbytesareenabledinthefirst
DWandtheoffsetis0.ThisfieldmatchestheDWalignedstartaddress.
IftheFirstDWByteEnablefieldis1110b,theupperthreebytesareenabled
inthefirstDWandtheoffsetis1.ThisfieldistheDWstartaddress+1.
IftheFirstDWByteEnablefieldis1100b,theuppertwobytesareenabled

200
PCIe 3.0.book Page 201 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

inthefirstDWandtheoffsetis2.ThisfieldistheDWstartaddress+2.
IftheFirstDWByteEnablefieldis1000b,onlytheupperbyteisenabledin
thefirstDWandtheoffsetis3.ThisfieldistheDWstartaddress+3.

Oncecalculated,thelower7bitsareplacedintheLowerAddressfieldofthe
Completionheadertofacilitatethecaseinwhichthereadcompletionissmaller
thantheentirepayloadandneedstostopatthefirstRCB.Breakingatransac
tionmustbedoneonRCBs,andthenumberofbytestransferredtoreachthe
firstoneisbasedonstartaddress.

ForAtomicOpCompletions,theLowerAddressfieldisreserved.Forallother
Completiontypes,itssettozero.

UsingTheByteCountModifiedBit.ThisbitisonlysetbyPCIXCom
pleters,buttheycouldexistinaPCIetopologyifabridgefromPCIetoPCIXis
used.Rulesforitsassertioninclude:
1. Its only setby a PCIX Completer if a read request is going to be broken
intomultiplecompletions.
2. ItsonlysetforthefirstCompletionoftheseries,andonlythentoindicate
thatthe first CompletioncontainsaByteCountfield thatreflects thefirst
Completionpayloadratherthanthetotalremaining(asitnormallywould).
The Requester understands that, even though the Byte Count appears to
showthatthisisthelastCompletionforthisrequest,thisCompletionwill
insteadbefollowedbyotherstosatisfytheoriginalrequestasrequired.
3. ForsubsequentCompletionsintheseries,theBCMbitmustbedeasserted
andtheByteCountfieldwillreflectthetotalremainingcountasitnormally
would.
4. Devices receiving Completions with the BCM bit set must interpret this
caseproperly.
5. TheLowerAddressfieldissetbytheCompleterduringcompletionswith
datatoreflecttheaddressofthefirstenabledbyteofdatabeingreturned

DataReturnedForReadRequests:
1. A readrequestmay require multiplecompletionstobefulfilled,but total
datatransfermusteventuallyequalthesizeoforiginalrequest,oraCom
pletionTimeouterrorwillprobablyresult.
2. AgivenCompletioncanonlyserviceoneRequest.
3. IOandConfigurationreadsarealways1DW,andwillalwaysbesatisfied
withasingleCompletion
4. A Completion with a Status Codeother than SC (successful) terminates a
transaction.

201
PCIe 3.0.book Page 202 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

5. TheReadCompletionBoundary(RCB)mustbeobservedwhenhandlinga
readrequestwithmultiplecompletions.TheRCBis64bytesor128bytes
fortheRootComplex,sinceitisallowedtomodifythesizeofpacketsflow
ingbetweenitsports,andthevalueusedisvisibleinaconfigurationregis
ter.
6. BridgesandendpointsmayimplementabitforselectingtheRCBsize(64or
128bytes)undersoftwarecontrol.
7. CompletionsthatareentirelywithinanalignedRCBboundarymustcom
plete in one transfer, since the transfer wont reach the RCB, which is the
onlyplaceitcanlegallystopearly.
8. MultipleCompletionsforasinglereadrequestmustreturndatainincreas
ingaddressorder.

ReceiverCompletionHandlingRules:
1. A received Completion that doesnt match a pending request is an Unex
pectedCompletionandtreatedasanerror.
2. CompletionswithacompletionstatusotherthanSCorCRSwillbehandled
aserrorsandbufferspaceassociatedwiththemwillbereleased.
3. WhentheRootComplexreceivesaCRSstatusduringaconfigurationcycle,
the request is terminated. What happens next is implementation specific,
but if the Root supports it, the action is defined by the setting of its CRS
SoftwareVisibilitybitintheRootControlregister.
IfCRSSoftwareVisibilityisnotenabled,theRootwillreissuetheconfig
request for an implementationspecific number of times before giving
upandconcludingthetargethasaproblem.
If CRS Software Visibility is enabled, software designed to support it
willalwaysreadbothbytesoftheVendorIDfieldfirst.Ifthehardware
thenreceivesaCRSforthatRequest,itreturnsthevalue0001hforthe
VendorID.Thisvalue,reservedforthisusebythePCISIG,doesntcor
respondtoanyvalidVendorIDandinformssoftwareaboutthisevent.
Thisallowssoftwaretogoontosomeothertaskwhilewaitingforthe
targettobecomeready(whichcouldtakeaslongas1secondafterreset)
ratherthanbeingstalled.Anyotherconfigreadorwritewillsimplybe
automaticallyretriedbytheRootasanewRequestforthedesignspe
cificnumberofiterations.
4. ACRSstatusinresponsetoarequestotherthanconfigurationisillegaland
maybereportedasaMalformedTLP.
5. Completionswithstatus=reservedcodearetreatedasifthecodewasUR.
6. IfaReadCompletionoranAtomicOpCompletionisreceivedwithastatus
otherthanSC,nodataisincludedwiththecompletionandtheRequester
must consider this Request terminated. How the Requester handles this
caseisimplementationspecific.

202
PCIe 3.0.book Page 203 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

7. Intheeventmultiplecompletionsarebeingreturnedforareadrequest,a
completion status other than SC ends the transaction. Device handling of
datareceivedpriortotheerrorisimplementationspecific.
8. ForcompatibilitywithPCI,aRootComplexmayberequiredtosynthesize
areadvalueofall1swhenaconfigurationcycleendswithacompletion
indicating an Unsupported Request. This is analogous to a PCI Master
Abort that happens when enumeration software attempts to read from
devicesthatarenotpresent.

Message Requests
MessageRequestsreplacemanyoftheinterrupt,error,andpowermanagement
sideband signals used on PCI and PCIX. All Message Requests use the 4DW
headerformat,butnotallofthefieldsareusedineveryMessagetype.Fieldsin
bytes8through15arenotdefinedforsomeMessagesandarereservedforthose
cases. Messages are treated much like posted Memory Write transactions but
theirroutingcanbebasedonaddress,ID,andinsomecasestheroutingcanbe
implicit. The routing subfield (Byte 0, bits 2:0) in the packet header indicates
which routing method is used and which additional header registers are
defined.ThegeneralMessageRequestheaderformatisshowninFigure511on
page203.

Figure511:4DWMessageRequestHeaderFormat

4DW Header for Messages

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R TH T E Attr AT Length
0x1 1 0 r r r tr 0 D P 0 0 0 0
Byte 4 Requester ID Tag Message
Code
Byte 8 Bytes 8-11 Vary with Message Code Field
Byte 12 Bytes 12-15 Vary with Message Code Field

203
PCIe 3.0.book Page 204 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

MessageRequestHeaderFields.

Table58:MessageRequestHeaderFields

HeaderByte/
FieldName Function
Bit

Fmt[2:0] Byte0Bit7:5 PacketFormat.Alwaysa4DWheader


(Format) 001b=MessageRequestwithoutdata
011b=MessageRequestwithdata

Type[4:0] Byte0Bit4:0 TLPpackettypefield.Setto:


Bit4:3:
10b=Msg
Bit2:0(MessageRoutingSubfield)
000b=ImplicitlyRoutedtoRC(Root
Complex)
001b=Routedbyaddress
010b=RoutedbyID
011b=ImplicitlyBroadcastfromRC
100b=Local;terminateatreceiver
101b=Gather&routetoRC
0thers=Reserved,treatedasLocal

TC[2:0] Byte1Bit6:4 TCisalwayszeroformostMessage


(TrafficClass) Requests,ensuringthattheydontinter
ferewithhighprioritypackets.

Attr[2] Byte1Bit2 IndicateswhetherIDbasedOrderingis


(Attributes) tobeusedforthisTLP.Tolearnmore,see
IDBasedOrdering(IDO)onpage 301.

TH Byte1Bit0 Reserved,exceptasnoted.
(TLPProcessingHints)

TD Byte2Bit7 If=1,indicatesthepresenceofa
digestfield(1DW)attheendoftheTLP
(precedingLCRCandEND)

EP Byte2Bit6 If=1,indicatesthedatapayload(if
present)ispoisoned.

204
PCIe 3.0.book Page 205 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Table58:MessageRequestHeaderFields(Continued)

HeaderByte/
FieldName Function
Bit

Attr[1:0] Byte2Bit5:4 Exceptasnoted,thesearealways


(Attributes) reservedinMessageRequests.

AT[1:0] Byte2Bit3:2 AddressTypeisreservedforMessages


(AddressType) andmustbezero,butReceiversarenot
requiredorevenencouragedtocheck
this.

Length[9:0] Byte2Bit1:0 IndicatesdatapayloadsizeinDW.For


Byte3Bit7:0 MessageRequests,thisfieldisalways0
(nodata)or1(oneDWofdata)

RequesterID[15:0] Byte4Bit7:0 IdentifiestheRequestersendingthemes


Byte5Bit7:0 sage.
Byte4,7:0=RequesterBus#
Byte5,7:3=RequesterDevice#
Byte5,2:0=RequesterFunction#

Tag[7:0] Byte6Bit7:0 SinceallMessageRequestsareposted


anddontreceiveCompletions,notagis
assignedtothem.Thesebitsshouldbe
zero.

MessageCode[7:0] Byte7Bit7:0 Thisfieldcontainsthecodeindicating


thetypeofmessagebeingsent.
00000000b=UnlockMessage
00010000b=Lat.ToleranceReporting
00010010b=OptimizedBufferFlush/Fill
0001xxxxb=PowerMgt.Message
00100xxxb=INTxMessage
001100xxb=ErrorMessage
0100xxxxb=IgnoredMessages
01010000b=SetSlotPowerMessage
0111111xb=VendorDefinedMessages

205
PCIe 3.0.book Page 206 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table58:MessageRequestHeaderFields(Continued)

HeaderByte/
FieldName Function
Bit

Address[63:32] Byte8Bit7:0 Ifaddressroutingwasselectedforthe


Byte9Bit7:0 message(seeType4:0fieldabove),then
Byte10Bit7:0 thisfieldcontainstheupper32bitsofthe
Byte11Bit7:0 64bitstartingaddress.
Otherwise,thisfieldisnotused.

Address[31:2] Byte12Bit7:0 Ifaddressroutingisselected(seeType


Byte13Bit7:0 fieldabove),thenthisfieldcontainsthe
Byte14Bit7:0 lowerpartofthe64bitstartingaddress.
Byte15Bit7:2 IfIDroutingisselected,Bytes8and9
formthetargetID.
Otherwise,thisfieldisnotused.

MessageNotes:Thefollowingtablesspecifythemessagecodingusedfor
eachoftheninemessagegroups,andisbasedonthemessagecodefieldlisted
inTable 58onpage 204.Thedefinedmessagegroupsinclude:

1. INTxInterruptSignaling
2. PowerManagement
3. ErrorSignaling
4. LockedTransactionSupport
5. SlotPowerLimitSupport
6. VendorDefinedMessages
7. IgnoredMessages(relatedtoHotPlugsupportinspecrevision1.1)
8. LatencyToleranceReporting(LTR)
9. OptimizedBufferFlushandFill(OBFF)

INTxInterruptMessages.ManydevicesarecapableofusingthePCI2.3
Message Signaled Interrupt (MSI) method of delivering interrupts, but older
devicesmaynotsupportit.Forthesecases,PCIedefinesavirtualwirealter
nativeinwhichdevicessimulatetheassertionanddeassertionofthePCIinter
rupt pins (INTAINTD) by sending Messages. The interrupting device sends
the first Message to inform the upstream device that an interrupt has been
asserted.Oncetheinterrupthasbeenserviced,theinterruptingdevicesendsa
secondMessagetocommunicatethatthesignalhasbeenreleased.Formoreon
this protocol, refer to the section called Virtual INTx Signaling on page 805
fordetails.

206
PCIe 3.0.book Page 207 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Table59:INTxInterruptSignalingMessageCoding

Message
INTxMessage Routing2:0
Code7:0

Assert_INTA 00100000b

Assert_INTB 00100001b 100b


Assert_INTC 00100010b (Local
TerminateatRx)
Assert_INTD 00100011b

Deassert_INTA 00100100b

Deassert_INTB 00100101b

Deassert_INTC 00100110b

Deassert_INTD 00100111b

RulesregardingtheuseofINTxMessages:

1. TheyhavenodatapayloadandsotheLengthfieldisreserved.
2. Theyre only issued by Upstream Ports. Checking this rule for received
packetsisoptionalbut,ifchecked,violationswillbehandledasMalformed
TLPs.
3. TheyrerequiredtousethedefaulttrafficclassTC0.Receiversmustcheck
forthisandviolationswillbehandledasMalformedTLPs.
4. ComponentsatbothendsoftheLinkmusttrackthecurrentstateofthefour
virtual interrupts. If the logical state of one interrupt changes at the
UpstreamPort,itmustsendtheappropriateINTxmessage.
5. INTxsignalingisdisabledwhentheInterruptDisablebitoftheCommand
Registerisset=1(aswouldbethecaseforphysicalinterruptlines).
6. IfanyvirtualINTxsignalsareactivewhentheInterruptDisablebitissetin
the device, the Upstream Port must send corresponding Deassert_INTx
messages.
7. Switches must track the state of the four INTx signals independently for
eachDownstreamPortandcombinethestatesfortheUpstreamPort.
8. The Root Complex must track the state of the four INTx lines indepen
dentlyandconvertthemintosysteminterruptsinanimplementationspe
cificway.

207
PCIe 3.0.book Page 208 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

9. TheyusetheroutingtypeLocalTerminateatReceivertoallowaSwitch
toremapthedesignatedinterruptpinwhennecessary(seeMappingand
CollapsingINTxMessagesonpage 808).Consequently,theRequesterID
inanINTxmessagemaybeassignedbythelasttransmitter.

PowerManagementMessages.PCI Express is compatible with PCI


power management, and adds hardwarebased Link power management as
well.Messagesareusedtoconveysomeofthisinformation,buttolearnhow
theoverallPCIepowermanagementprotocolworks,refertoChapter16,enti
tledPowerManagement,onpage703.Table 510onpage 208summarizesthe
fourpowermanagementmessagetypes.

Table510:PowerManagementMessageCoding

PowerManagementMessage MessageCode7:0 Routing2:0

PM_Active_State_Nak 00010100b 100b

PM_PME 00011000b 000b

PM_Turn_Off 00011001b 011b

PME_TO_Ack 00011011b 101b

PowerManagementMessageRules:

1. Power Management Messages dont have a data payload, so the Length


fieldisreserved.
2. TheyrerequiredtousethedefaulttrafficclassTC0.Receiversmustcheck
forthisandhandleviolationsasMalformedTLPs.
3. PM_Active_State_Nak is sent from a Downstream Port after it observes a
requestfromtheLinkneighbortochangetheLinkpowerstatetoL1butit
haschosennottodoso(LocalTerminateatReceiverrouting).
4. PM_PMEissentupstreambythecomponentrequestingaPowerManage
mentEvent(ImplicitlyRoutedtotheRootComplex).
5. PM_Turn_Off is sent downstream to all endpoints (Implicitly Broadcast
fromtheRootComplexrouting).
6. PME_TO_Ack is sent upstream by endpoints. For switches with multiple
Downstream Ports, this message wont be forwarded upstream until all
DownstreamPortshavereceivedit(GatherandRoutetotheRootComplex
routing).

208
PCIe 3.0.book Page 209 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

ErrorMessages.ErrorMessagesaresentupstream(ImplicitlyRoutedtothe
RootComplex)byenabledcomponentsthatdetecterrors.Toassistsoftwarein
knowinghowtoservicetheerror,theErrorMessageidentifiestherequesting
agentintheRequester ID fieldofthe messageheader.Table 511on page 209
describesthethreeerrormessagetypes.

Table511:ErrorMessageCoding

ErrorMessage MessageCode7:0 Routing2:0

ERR_COR(Correctable) 00110000b

ERR_NONFATAL 00110001b 000b


(Uncorrectable,Nonfatal)

ERR_FATAL 00110011b
(Uncorrectable,Fatal)

ErrorSignalingMessageRules:

1. TheyrerequiredtousethedefaulttrafficclassTC0.Receiversmustcheck
forthisandhandleviolationsasMalformedTLPs.
2. Theydonthaveadatapayload,sotheLengthfieldisreserved.
3. TheRootComplexconvertsErrorMessagesintosystemspecificevents.

LockedTransactionSupport.TheUnlockMessageisusedaspartofthe
Locked transaction protocol defined for PCI and still available to Legacy
Devices.TheprotocolbeginswithaMemoryReadLockedRequest.Whenthat
RequestisseenbyPortsalongthepathtothetargetdevice,theyimplementan
atomicreadmodifywriteprotocolbylockingoutotherRequestersfromusing
VC0untiltheUnlockMessageisreceived.ThisMessageissenttothetargetto
releaseallthePortsinthepathtoitandfinishtheLockedTransactionsequence.
Table 512onpage 209summarizesthecodingforthismessage.

Table512:UnlockMessageCoding

UnlockMessage MessageCode7:0 Routing2:0

Unlock 00000000b 011b

209
PCIe 3.0.book Page 210 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

UnlockMessageRules:

1. TheyrerequiredtousethedefaulttrafficclassTC0.Receiversmustcheck
forthisandhandleviolationsasMalformedTLPs.
2. Theydonthaveadatapayload,andtheLengthfieldisreserved.

SetSlotPowerLimitMessage.This is sent from a Downstream Port to


thedevicepluggedintotheslot.Thispowerlimitisstoredintheendpointinits
DeviceCapabilitiesRegister.Table513summarizesthemessagecoding.

Table513:SlotPowerLimitMessageCoding

SlotPowerLimitMessage MessageCode7:0 Routing2:0

Set_Slot_Power_Limit 01010000b 100b

Set_Slot_Power_LimitMessageRules:

1. TheyrerequiredtousethedefaulttrafficclassTC0.Receiversmustcheck
forthisandhandleviolationsasMalformedTLPs.
2. The data payload is 1 DW and so the Length field is set to one. Only the
lower10bitsofthe32bitdatapayloadareusedforslotpowerscaling;the
upperpayloadbitsmustbesettozero.
3. ThismessageissentautomaticallyanytimetheDataLinkLayertransitions
toDL_UpstatusorifaconfigurationwritetotheSlotCapabilitiesRegister
occurswhiletheDataLinkLayerisalreadyreportingDL_Upstatus.
4. If the card in the slot already consumes less power than the power limit
specified,itsallowedtoignoretheMessage.

VendorDefinedMessage0and1.These are intended to allow expan


sionofthePCIemessagingcapabilitieseitherbythespecorbyvendorspecific
extensions.TheheaderforthemisshowninFigure512onpage211,andthe
codesaregiveninFigure514onpage211.

210
PCIe 3.0.book Page 211 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Figure512:VendorDefinedMessageHeader

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x1 1 0 r r r tr H D P
Byte 4 Requester ID Tag Message Code
0111111x
Byte 8
Target BDF if ID Routing used, Vendor ID
otherwise Reserved
Byte 12 For Vendor Definition

Table514:VendorDefinedMessageCoding

VendorDefinedMessage MessageCode7:0 Routing2:0

VendorDefinedMessage0 01111110b 000b,010b,


011b,100b
VendorDefinedMessage1 01111111b

VendorDefinedMessageRules:

1. Adatapayloadmayormaynotbeincludedwitheithertype.
2. MessagesaredistinguishedbytheVendorIDfield.
3. Attributebits[2]and[1:0]arenotreserved.
4. IftheReceiverdoesntrecognizetheMessage:
Type1Messagesaresilentlydiscarded
Type0MessagesaretreatedasanUnsupportedRequesterrorcondi
tion

IgnoredMessages.Listing an entire category of Messages that are to be


ignoredsoundsalittlestrangewithoutthecontextforit.Thesewereformerly
HotPlugSignalingmessagesthatsupporteddevicesthathadHotPlugindica
torsandpushbuttonsontheaddincarditselfratherthanonthesystemboard.
ThisMessagetypewasdefinedthroughspecrev1.0a,butthisoptionwasno
longer supported beginning with the 1.1 spec release, so the details are only
included here for reference. As the name now suggests, Transmitters are

211
PCIe 3.0.book Page 212 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

strongly encouraged not to send these messages, and Receivers are strongly
encouragedtoignorethemiftheyareseen.Iftheyrestillgoingtobeusedany
way,theymustconformtothe1.0aspecdetails.

Table515:HotPlugMessageCoding

ErrorMessage MessageCode7:0 Routing2:0

Attention_Indicator_On 01000001b 100b

Attention_Indicator_Blink 01000011b 100b

Attention_Indicator_Off 01000000b 100b

Power_Indicator_On 01000101b 100b

Power_Indicator_Blink 01000111b 100b

Power_Indicator_Off 01000100b 100b

Attention_Button_Pressed 01001000b 100b

HotPlugMessageRules:
TheyaredrivenbyaDownstreamPorttothecardintheslot.
TheAttentionButtonMessageisdrivenupstreambyaslotdevice.

LatencyToleranceReportingMessage.LTR Messages are used to


optionally reportacceptable read/write service latenciesfor a device. To learn
more about this power management technique, see the section called LTR
(LatencyToleranceReporting)onpage 784.

Figure513:LTRMessageHeader

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
001 10100 tr H D P
Byte 4 Requester ID Tag Message Code
00010000
Byte 8 Reserved
Byte 12 No-Snoop Latency Snoop Latency

212
PCIe 3.0.book Page 213 Sunday, September 2, 2012 11:25 AM

Chapter 5: TLP Elements

Table516:LTRMessageCoding

LatencyToleranceReportingMessage MessageCode7:0 Routing2:0

LTR 00010000 100

LTRMessageRules:

1. TheyrerequiredtousethedefaulttrafficclassTC0.Receiversmustcheck
forthisandhandleviolationsasMalformedTLPs.
2. Theydonthaveadatapayload,andtheLengthfieldisreserved.

OptimizedBufferFlushandFillMessages.OBFF Messages are used


toreportplatformpowerstatustoEndpointsandfacilitatemoreeffectivesys
tempowermanagement.Tolearnmoreaboutthistechnique,seethediscussion
calledOBFF(OptimizedBufferFlushandFill)onpage 776.

Figure514:OBFFMessageHeader

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
001 10100 tr H D P
Byte 4 Requester ID Tag Message Code
0001 0010
Byte 8 Reserved
Byte 12 Reserved OBFF
Code

Table517:LTRMessageCoding

OptimizedBufferFlush/FillMessage MessageCode7:0 Routing2:0

OBFF 00010010 100

213
PCIe 3.0.book Page 214 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

OBFFMessageRules:

1. TheyrerequiredtousethedefaulttrafficclassTC0.Receiversmustcheck
forthisandhandleviolationsasMalformedTLPs.
2. Theydonthaveadatapayload,andtheLengthfieldisreserved.
3. TheRequesterIDmustbesettotheTransmittingPortsID.

214
PCIe 3.0.book Page 215 Sunday, September 2, 2012 11:25 AM

6 FlowControl
The Previous Chapter
The previous chapter discusses the three major classes of packets: Transaction
Layer Packets (TLPs), Data Link Layer Packets (DLLPs) and Ordered Sets. This
chapterdescribestheuse,format,anddefinitionofthevarietyofTLPsandthe
detailsoftheirrelatedfields.DLLPsaredescribedseparatelyinChapter9,enti
tledDLLPElements,onpage307.

This Chapter
ThischapterdiscussesthepurposesanddetailedoperationoftheFlowControl
Protocol.FlowcontrolisdesignedtoensurethattransmittersneversendTrans
action Layer Packets (TLPs) that a receiver cant accept. This prevents receive
bufferoverrunsandeliminatestheneedforPCIstyleinefficiencieslikediscon
nects,retries,andwaitstates.

The Next Chapter


ThenextchapterdiscussesthemechanismsthatsupportQualityofServiceand
describesthemeansofcontrollingthetimingandbandwidthofdifferentpack
ets traversing the fabric. These mechanisms include applicationspecific soft
warethatassignsapriorityvaluetoeverypacket,andoptionalhardwarethat
mustbebuiltintoeachdevicetoenablemanagingtransactionpriority.

Flow Control Concept


Ports at each end of every PCIe Link must implement Flow Control. Before a
packet can be transmitted, flow control checks must verify that the receiving
porthassufficientbufferspacetoacceptit.InparallelbusarchitectureslikePCI,
transactionsareattemptedwithoutknowingwhetherthetargetispreparedto
handle the data. If the request is rejected due to insufficient buffer space, the
transactionisrepeated(retried)untilitcompletes.ThisistheDelayedTransac
tionModelofPCIandwhileitworkstheefficiencyispoor.

215
PCIe 3.0.book Page 216 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

FlowControlmechanismscanimprovetransmissionefficiencyifmultipleVir
tualChannels(VCs)areused.EachVirtualChannelcarriestransactionsthatare
independentfromthetrafficflowinginotherVCsbecauseflowcontrolbuffers
aremaintainedseparately.Therefore,afullFlowControlbufferinoneVCwill
notblockaccesstootherVCbuffers.PCIesupportsupto8VirtualChannels.

The Flow Control mechanism uses a creditbased mechanism that allows the
transmittingporttobeawareofbufferspaceavailableatthereceivingport.As
partofitsinitialization,eachreceiverreportsthesizeofitsbufferstothetrans
mitter on the other end of the Link, and then during runtime it regularly
updatesthenumberofcreditsavailableusingFlowControlDLLPs.Technically,
ofcourse,DLLPsareoverheadbecausetheydontconveyanydatapayload,but
theyarekeptsmall(always8symbolsinsize)tominimizetheirimpactonper
formance.

Flow control logic is actually a shared responsibility between two layers: the
TransactionLayercontainsthecounters,buttheLinkLayersendsandreceives
theDLLPsthatconveytheinformation.Figure61onpage217illustratesthat
sharedresponsibility.Intheprocessofmakingflowcontrolwork:

DevicesReportAvailableBufferSpaceThereceiverofeachportreports
the size of its Flow Control buffers in units called credits. The number of
creditswithinabufferissentfromthereceivesidetransactionlayertothe
transmitside of the Link Layer. At the appropriate times, the Link Layer
creates a Flow Control DLLP to forward this credit information to the
receiverattheotherendoftheLinkforeachFlowControlBuffer.
Receivers Register Credits The receiver gets Flow Control DLLPs and
transfersthecreditvaluestothetransmitsideofthetransactionlayer.The
completesthetransferofcreditsfromonelinkpartnertotheother.These
actionsareperformedinbothdirectionsuntilallflowcontrolinformation
hasbeenexchanged.
Transmitters Check Credits Before it can send a TLP, a transmitter
checks the Flow Control Counters to learn whether sufficient credits are
available.Ifso,theTLPisforwardedtotheLinkLayerbut,ifnot,thetrans
actionisblockeduntilmoreFlowControlcreditsarereported.

216
PCIe 3.0.book Page 217 Sunday, September 2, 2012 11:25 AM

Chapter 6: Flow Control

Figure61:LocationofFlowControlLogic

PCIe Device A PCIe Device B


Device Core Device Core

PCIe-Core PCIe-Core
Hardware/Software Hardware/Software
Interface Interface

Transaction Layer Transaction Layer


FC Counters FC Buffers FC Counters FC Buffers

Data Link Layer Data Link Layer

Physical Layer Physical Layer


(TX) (RX) (TX) (RX)
Link

Flow Control Buffers and Credits


Flow control buffers are implemented for each VC resource supported by a
port.RecallthatportsateachendoftheLinkmaynotsupportthesamenumber
ofVCs,thereforethemaximumnumberofVCsconfiguredandenabledbysoft
wareisthehighestcommonnumberbetweenthetwoports.

217
PCIe 3.0.book Page 218 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

VC Flow Control Buffer Organization


Each VC Flow Control buffer at the receiver is managed for each category of
transactionflowingthroughthevirtualchannel.Thesecategoriesare:
PostedTransactionsMemoryWritesandMessages
NonPosted Transactions Memory Reads, Configuration Reads and
Writes,andI/OReadsandWrites
CompletionsReadandWriteCompletions
Inaddition,eachofthesecategoriesisseparatedintoheaderanddataportions
fortransactionsthathavebothheaderanddata.Thisyieldssixdifferentbuffers
eachofwhichimplementsitsownflowcontrol(seeFigure62onpage218).
Sometransactions,likereadrequests,consistofaheaderonlywhileothers,like
writerequests,havebothaheaderanddata.Thetransmittermustensurethat
bothheaderanddatabufferspaceisavailableasneededforatransactionbefore
itcanbesent.NotethattransactionorderingmustbemaintainedwithinaVC
FlowControlbufferwhenthetransactionsareforwardedtosoftwareortoan
egressportinthecaseofaswitch.Consequently,thereceivermustalsotrack
theorderofheaderanddatacomponentswithinthebuffer.

Figure62:FlowControlBufferOrganization

PCIe Device A PCIe Device B Flow Control Buffers (Receiver)

Device Core Device Core


(PH) (PD) (NPH) (NPD) (CPLH) (CPLD)
PCIe-Core PCIe-Core
Hardware/Software Hardware/Software
Interface Interface

Transaction Layer Transaction Layer


FC Counters RCV Buffers P Posted Request
P NP CPL P NP CPL
NP Non-Posted Request
Data Link Layer Data Link Layer CPL Completion

Physical Layer Physical Layer


(TX) (RX) (TX) (RX)
Link

218
PCIe 3.0.book Page 219 Sunday, September 2, 2012 11:25 AM

Chapter 6: Flow Control

Flow Control Credits


BufferspaceisreportedbythereceiverinunitscalledFlowControlcredits.The
unitvalueofFlowControlCredits(FCCs)forheaderanddatabuffersare:

Headercreditsmaximumheadersize+digest
4DWsforcompletions
5DWsforrequests
Datacredits4DWs(aligned16bytes)

Flow Control DLLPs communicate this information, and do not require Flow
Controlcreditsthemselves.Thatsbecausetheyoriginateandterminateatthe
LinkLayeranddontusetheTransactionLayerbuffers.

Initial Flow Control Advertisement


DuringFlowControlinitialization,PCIedevicescommunicatetheirbuffersizes
byadvertisingtheirbufferspaceviaflowcontrolcredits.PCIealsodefinesan
infiniteFlowControlcreditvaluethatisrequiredforsomebuffers.Areceiver
that advertises infinite buffer space is effectively guaranteeing that its buffer
spacewillneveroverflow.

Minimum and Maximum Flow Control Advertisement


Thespecificationdefinestheminimumnumberofcreditsthatcanbereported
for the different Flow Control buffer types as listed in Table 61. However,
devices normally advertise considerably more credits than the minimum.
Table 62onpage 220liststhemaximumadvertisementallowedbythespecifi
cation.

Table61:RequiredMinimumFlowControlAdvertisements

CreditType MinimumAdvertisement

PostedRequestHeader(PH) 1unit.CreditValue=one4DWHDR+Digest=5DW.

PostedRequestData(PD) LargestpossiblesettingoftheMax_Payload_Sizein
credits.Example:IfthelargestMax_Payload_Sizevalue
supportedis1024bytes,thesmallestpermittedinitial
creditvaluewouldbe040h.

219
PCIe 3.0.book Page 220 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table61:RequiredMinimumFlowControlAdvertisements(Continued)

CreditType MinimumAdvertisement

NonPostedRequestHDR(NPH) 1unit.CreditValue=one4DWHDR+Digest=5DW.

NonPostedRequestData(NPD) 1unit.CreditValue=4DW.

2unit.ReceiverssupportingAtomicOproutingor
AtomicOpCompletercapabilityhavecreditvalueof02h

CompletionHDR(CPLH) 1unit.CreditValue=one3DWHDR+Digest=4DW;
forRootComplexwithpeertopeersupportand
Switches.

Infiniteunits.InitialCreditValue=all0sforRootCom
plexwithnopeertopeersupportandEndpoints.

CompletionData(CPLD) nunit.Valueoflargestpossiblesettingof
Max_Payload_SizeorsizeoflargestReadRequest
(whicheverissmaller)dividedbyFCUnitSize(4DW);
forRootComplexwithpeertopeersupportand
Switches.

Infiniteunits.InitialCreditValue=all0s;forRoot
ComplexwithnopeertopeersupportandEndpoints.

Table62:MaximumFlowControlAdvertisements

CreditType MaximumAdvertisement

PostedRequestHeader(PH) 128units.128credits@5DWs=2,560bytes.
PostedRequestData(PD) 2048units.ValueoftheMax_Payload_Size(4096bytes)
includingallfunctionssupportedbydevice(8)divided
bythecreditsize(4DWs)=32,768bytes
2048credits@4DWs=32,768bytes
NonPostedRequestHDR(NPH) 128units.128credits@5DWs=2,560bytes.
NonPostedRequestData(NPD) Theauthorscouldnotfindaprecisevalueforthemaxi
mumnumberofcreditsforNonPostedData.Themaxi
mumnumberofcreditslistedforDatais2048.However,
amorereasonableapproachmightusetheNonPosted
headerlimitof128credits,becauseNonPostedDatais
alwaysassociatedwithNonPostedHeaders.

220
PCIe 3.0.book Page 221 Sunday, September 2, 2012 11:25 AM

Chapter 6: Flow Control

Table62:MaximumFlowControlAdvertisements(Continued)

CreditType MaximumAdvertisement

CompletionHDR(CPLH) 128units.128credits@5DWs=2,560bytes.Thisin
thelimitforportsthatdonotoriginatetransactions(e.g.,
RootComplexwithpeertopeersupportandSwitches).

Infiniteunits.InitialCreditValue=all0sforportsthat
originatetransactions(e.g.,RootComplexwithnopeer
topeersupportandEndpoints).

CompletionData(CPLD) 2048units.ValueoftheMax_Payload_Size(4096bytes)
includingallfunctionssupportedbyadevice(8)
dividedbythecreditsize(4DWs)=32,768bytes
2048credits@4DWs=32,768bytes

Infiniteunits.InitialCreditValue=all0sforportsthat
originatetransactions(e.g.,RootComplexwithnopeer
topeersupportandEndpoints).

Infinite Credits
Notethataflowcontrolvalueof00hwillbeunderstoodtomeaninfinitecredits
duringinitialization.FollowingFlowControlinitializationnofurtheradvertise
mentsaremade.Devicesthatoriginatetransactionsmustreservebufferspace
for the data or status information that will return during split transactions.
Thesetransactioncombinationsinclude:

NonpostedReadrequestsandreturnofCompletionData
NonpostedReadrequestsandreturnofCompletionStatus
NonpostedWriterequestsandreturnofCompletionStatus

Special Use for Infinite Credit Advertisements.


Thespecificationpointsoutaspecialconsiderationfordevicesthatimplement
onlyVC0.Forexample,theonlyNonPostedwritesareI/OWritesandConfigu
rationWritesbothofwhicharepermittedonlyonVC0.Thus,NonPosteddata
buffersarenotusedforVC1VC7andaninfinitevaluecanbeadvertisedfor
those values. However, the NonPosted Header must still operate and header
creditsmuststillneedtobeupdated.

221
PCIe 3.0.book Page 222 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Flow Control Initialization

General
Priortosendinganytransactions,flowcontrolinitializationisneeded.Infact,
TLPs cannot be sent across the Link until Flow Control Initialization is per
formed successfully. Initialization occurs on every Link in the system and
involves a handshake between the devices at each end of a link. This process
begins as soon as the Physical Layer link training has completed. The Link
LayerknowsthePhysicalLayerisreadywhenitobservestheLinkUpsignalis
active,asillustratedinFigure63.

Figure63:PhysicalLayerReportsThatItsReady

PCIe Device A PCIe Device B


Device Core Device Core

PCIe Core PCIe Core


Hardware/Software Hardware/Software
Interface Interface

Transaction Layer Transaction Layer

DLL DLCMSM DLL DLCMSM


DLCMSM
LinkUp LinkUp

Phy Phy
Layer LTSSM Layer LTSSM
(RX) (TX) (RX) (TX)

Link

Oncestarted,theFlowControlinitializationprocessisfundamentallythesame
for all Virtual Channels and is controlled by hardware once a VC has been
enabled. VC0 is always enabled by default, so its initialization is automatic.

222
PCIe 3.0.book Page 223 Sunday, September 2, 2012 11:25 AM

Chapter 6: Flow Control

That allows configuration transactions to traverse the topology and carry out
the enumeration process. Other VCs only initialize when configuration soft
warehassetupandenabledthematbothendsoftheLink.

The FC Initialization Sequence


The flow control initialization process involves the Link Layers DLCMSM
(DataLinkControlandManagementStateMachine).AsshowninFigure64on
page223,aresetputsthestatemachineintotheDL_Inactivestate.Whileinthe
DL_Inactivestate,DL_DownissignaledtoboththeLinkandTransactionLay
ers.Meanwhile,itwaitstoseeLinkUpfromthePhysicalLayertoindicatethat
theLTSSMhasfinisheditsworkandthePhysicalLayerisready.Thatcausesa
transitiontotheDL_Initsubstate,whichcontainstwostagesthathandleflow
controlinitialization:FC_INIT1andFC_INIT2.

Figure64:TheDataLinkControl&ManagementStateMachine

Reset

DL_Inactive Report DL_Down to Link


and Transaction Layers

Physical LinkUp=1
Physical LinkUp=0 &
Link Enabled andr

DL_Init
Report DL_Down
FC_Init1
(during FC_Init1)

Report DL_Up to remaining


FC_Init2
Link and Transaction Layers
(during FC_Init2)

FC_Init Complete
&
Physical LinkUp=1

DL_Active Report DL_Up

223
PCIe 3.0.book Page 224 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

FC_Init1 Details
DuringtheFC_INIT1state,devicescontinuouslysendasequenceof3InitFC1
Flow Control DLLPs advertising their receiver buffer sizes (see Figure 65).
According to the spec, the packets must be sent in this order: Posted, Non
posted,andCompletionsasillustratedinFigure66onpage225.Thespecifica
tionstronglyencouragesthattheseberepeatedfrequentlytomakeiteasierfor
the receiving device to see them, especially if there are no TLPs or DLLPs to
send.Eachdeviceshouldalsoreceivethissequencefromitsneighborsoitcan
registerthebuffersizes.Onceadevicehassentitsownvaluesandreceivedthe
completesequenceenoughtimestobeconfidentthatthevalueswereseencor
rectly,itsreadytoexitFC_INIT1.Todothat,itrecordsthereceivedvaluesinits
transmitcounters,setsaninternalflag(FL1),andchangestotheFC_INIT2state
tobeginthesecondinitializationstep.

Figure65:INIT1FlowControlDLLPFormatandContents

+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
V[2:0]
Byte 0 xxxx 0
VC ID
R DataFC
HdrFC R DataFC
DataFC

Byte 4 16 Bit CRC

0100 Init 1 Posted


0101 Init 1 Non Posted
0110 Init 1 Completion
1100 Init 2 Posted
1101 Init 2 Non Posted
1110 Init 2 Completion

224
PCIe 3.0.book Page 225 Sunday, September 2, 2012 11:25 AM

Chapter 6: Flow Control

Figure66:DevicesSendInitFC1intheDL_InitState

PCIe Device A PCIe Device B


Device Core Device Core

PCIeX-Core PCIe-Core
Hardware/Software Hardware/Software
Interface Interface

Transaction Layer Transaction Layer


FC Counters RCV Buffers FC Counters RCV Buffers
P NP CPL P NP CPL P NP CPL P NP CPL

Data Link Layer Data Link Layer

Physical Layer Physical Layer


(TX) (RX) (TX) (RX)

InitFC1-P InitFC1-NP InitFC1-Cpl

InitFC1-Cpl InitFC1-NP InitFC1-P

- Note required order of InitFC transmission


InitFC1 P

FC_Init2 Details
InthisstateadevicecontinuouslysendsInitFC2DLLPs.Thesearesentinthe
same sequence as the InitFC1s and contain the same credit information, but
they also confirm thatFCinitialization hassucceededatthesender.Sincethe
devicehasalreadyregisteredthevaluesfromtheneighboritdoesntneedany
morecreditinformationandwillignoreanyincomingInitFC1swhileitwaitsto
see InitFC2s. It can even send TLPs at this point, even though initialization
hasnt completed for the other side of the Link, and this is indicated to the
TransactionLayerbytheDL_Upsignal(SeeFigure67).

225
PCIe 3.0.book Page 226 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Whyisthissecondinitializationstepneeded?Thesimpleansweristhatneigh
boringdevicesmayfinishFCinitializationatdifferenttimesandthismethod
ensures that the late one will continue to receive the FC information it needs
eveniftheneighborfinishesearly.OnceadevicereceivesanFC_INIT2packet
for any buffer type, it sets an internal flag (Fl2). (It doesnt wait to receive an
FC_Init2foreachtype.)NotethatFL2isalsosetuponreceiptofanUpdateFC
packetorTLP.WhenbothsidesaredoneandhavesentInitFC2s,theDLCMSM
transitionstotheDL_ActivestateandtheLinkLayerisreadyfornormalopera
tion.

Figure67:FCValuesRegisteredSendInitFC2s,ReportDL_Up

PCIe Device A PCIe Device B


Device Core Device Core

PCIe Core PCIe Core


Hardware/Software Hardware/Software
Interface Interface

Transaction Layer Transaction Layer


DL_Up

DLL DLCMSM DLL DLCMSM


DLCMSM

Phy Phy
Layer LTSSM Layer LTSSM
(RX) (TX) (RX) (TX)

InitFC2-Cpl InitFC2-NP InitFC2-P

Rate of FC_INIT1 and FC_INIT2 Transmission


The specification defines the latency between sending FC_INIT DLLPs as fol
lows:

226
PCIe 3.0.book Page 227 Sunday, September 2, 2012 11:25 AM

Chapter 6: Flow Control

VC0.Hardwareinitiatedflow control of VC0 requires that FC_INIT1 and


FC_INIT2packetsbetransmittedcontinuouslyatthemaximumratepossi
ble.Thatis,theresendtimerissettoavalueofzero.
VC1VC7.WhensoftwareinitiatesflowcontrolinitializationforotherVCs,
theFC_INITsequenceisrepeatedwhennootherTLPsorDLLPsareavail
ablefortransmission.However,thelatencybetweenthebeginningofone
sequencetothenextcanbenogreaterthan17s.

Violations of the Flow Control Initialization Protocol


Aviolationoftheflowcontrolinitializationprotocolcanbeoptionallychecked
byadevice.An errordetected canbe reported as aData LinkLayer protocol
error.

Introduction to the Flow Control Mechanism

General
The specification defines the requirements of the Flow Control mechanism
usingregisters,counters,andmechanismsforreporting,tracking,andcalculat
ingwhetheratransactioncanbesent.Theseelementsarenotrequiredandthe
actualimplementationislefttothedevicedesigner.Thissectionintroducesthe
specificationmodelandservestoexplaintheconceptsandtodefinetherequire
ments.

The Flow Control Elements


Figure68illustratestheelementsusedformanagingflowcontrol.Thediagram
showstransactionsflowinginasingledirectionacrossaLink,andanotherset
of these elements supports transfers in the opposite direction. The primary
functionofeachelementislistedbelow.WhiletheseFlowControlelementsare
duplicatedforallsixreceivebuffers,forsimplicitythisexampleonlydealswith
nonpostedheaderflowcontrol.

One final element associated with managing flow control is the Flow Control
UpdateDLLP.ThisistheonlyFlowControlpacketthatisusedduringnormal
transmission.TheformatoftheFCUpdatepacketisillustratedinFigure69on
page229.

227
PCIe 3.0.book Page 228 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure68:FlowControlElements

Device A Device B
FC Gating Logic
PTLP
Transactions CC+PTLP =CR
Pending
Buffer Send
CL-CR < 28/2
(VC0) Error
Credits
Consumed Credit Limit VC0
Incr Check FC
Buffer
Link Packet optional incr
Control
incr Credits Rcv CredAlloc (NP Hdr)
optional
Link Packet
Control

transmit receive transmit receive

FC DLLPs
TLP Link

Transmitter Elements
Transactions Pending Buffer holds transactions that are waiting to be
sentinthesamevirtualchannel.
Credits Consumed counter contains the credit sum of all transactions
sentforthisbuffer.ThiscountisabbreviatedCC.
CreditLimitcounterinitializedbythereceiverwiththesizeofthecorre
sponding Flow Control buffer. After initialization, Flow Control update
packets are sent periodically to update the Flow Control credits as they
becomeavailableatthereceiver.ThisvalueisabbreviatedCL.
FlowControlGatingLogicperformsthecalculationstodetermineifthe
receiver has sufficient Flow Control credits to accept the pending TLP
(PTLP).Inessence,thislogicchecksthattheCREDITS_CONSUMED(CC)
plusthecreditsrequiredforthenextPendingTLP(PTLP)doesnotexceed
theCREDIT_LIMIT(CL).Thisspecificationdefinesthefollowingequation
forperformingthecheck,withallvaluesrepresentedincredits.

228
PCIe 3.0.book Page 229 Sunday, September 2, 2012 11:25 AM

Chapter 6: Flow Control

FieldSize FieldSize
CL CC + PTLP mod2 2 2

Foranexampleapplicationofthisequation,SeeStage1FlowControlFol
lowingInitializationonpage 230.

Receiver Elements
FlowControlBufferstoresincomingheadersordata.
Credit Allocated tracks the total Flow Control credits that have been
allocated(madeavailable).Itsinitializedbyhardwaretoreflectthesizeof
theassociatedFlowControlbuffer.Thebufferfillsastransactionsarrivebut
thentheyareeventuallyremovedfromthebufferbythecorelogicatthe
receiver. When they are removed, the number of Flow Control credits is
added to the CREDIT_ALLOCATED counter. Thus the counter tracks the
numberofcreditscurrentlyavailable.
CreditsReceivedcounter(optional)tracksthetotalcreditsofallTLPs
received into the Flow Control buffer. When flow control is functioning
properly, the CREDITS_RECEIVED count should be equal to or less than
theCREDIT_ALLOCATEDcount.Ifthistesteverbecomesfalse,aflowcon
trolbufferoverflowhasoccurredandanerrorisdetected.Thespecrecom
mends that this optional mechanism be implemented and notes that a
failureherewillbeconsideredafatalerror.

Figure69:TypesandFormatofFlowControlDLLPs

+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
V[2:0]
Byte 0 xxxx 0
VC ID
R DataFC
HdrFC R DataFC
DataFC

Byte 4 16 Bit CRC

1000 Update Posted


1001 Update Non Posted
1010 Update Completion

229
PCIe 3.0.book Page 230 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Flow Control Example


The following example describes the nonposted header Flow Control buffer,
andattemptstocapturethenuancesoftheflowcontrolimplementationinsev
eralsituations.ThediscussionofFlowControlisdescribedwithaseriesofbasic
stagesasfollows:

StageOneImmediatelyfollowinginitializationatransactionistransmitted
andtrackedtoexplainthebasicoperationofthecountersandregisters.

Stage Two The transmitter sends transactions faster than the receiver can
processthemandthebufferbecomesfull.

StageThreeWhencountersrollovertozero,themechanismstillworksbut
thereareacoupleofissuestoconsider.

StageFourTheoptionalreceivererrorcheckforabufferoverflow.

Stage 1 Flow Control Following Initialization


Onceflowcontrolinitializationhascompleted,thedevicesarereadyfornormal
operation.TheFlowControlbufferinourexampleis2KBinsize.Weredescrib
ingthenonpostedheaderbuffer,andeachcreditis5dwordsinsizeor20bytes.
Thatmeans102d(66h)FlowControlunitsareavailable.Figure610onpage231
illustrates the elements involved, including the values that would be in each
counterandregisterfollowingflowcontrolinitialization.

WhenthetransmitterisreadytosendaTLP,itmustfirstcheckFlowControl
credits.Ourexampleissimplebecauseanonpostedheaderistheonlypacket
beingsentanditalwaysrequiresjustoneFlowControlcredit,andwearealso
assumingthatnodataisincludedinthetransaction.

The header credit check is made using unsigned arithmetic (2s complement),
andmustsatisfythefollowingformula:
FieldSize FieldSize
CL CC + PTLP mod2 2 2

SubstitutingvaluesfromFigure610yields:
66h 00h + 01h mod2 8 2 8 2
66h 01h mod256 80h

230
PCIe 3.0.book Page 231 Sunday, September 2, 2012 11:25 AM

Chapter 6: Flow Control

Figure610:FlowControlElementsFollowingInitialization

PTLP
Transactions CC+PTLP=CR
Pending
Buffer Send
CL-CR < 28/2
(VC0) Error
CC = 00h CL = 66h
VC0
Incr
FC
Check
Buffer
Link Packet optional incr
Control
incr CrRcv=00h CrAl=66h (NP Hdr)
optional
Link Packet
Control

transmit receive transmit receive

FC Packets

Transaction Link
CC = Credits Consumed CrAl = Credits Allocated
CL = Credit Limit CrRcv = Credits Received
PTLP = Pending TLP

In this case, the current CREDITS_CONSUMED count (CC) is added to the


PTLP credits required, to determine the CREDITS_REQUIRED (CR), and that
gives00h+01h=01h.TheCREDITS_REQUIREDcountissubtractedfromthe
CREDIT_LIMIT count (CL) to determine whether or not sufficient credits are
available.

The following description incorporates a brief review of 2s complement sub


traction.Whenperformingsubtractionusing2scomplementthenumbertobe
subtractediscomplemented(1scomplement)and1isadded(2scomplement).
Thisvalueisthenaddedtothenumberfromwhichwewishtosubtract.Any
carryduetotheadditionisdropped.

231
PCIe 3.0.book Page 232 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

CreditCheck:

CL 01100110b (66h) - CR 00000001b (01h) = n

CRisconvertedto2scomplement:

00000001b (CR)
11111110b (CRinverted)
11111110b +1
11111111b (2scomplement)

2scomplementaddedtoCL:

01100110 (CL)
11111111 (2s complement of CR)
01100101 = 65h (carry bit is dropped)

Isresult<=80h?Yes.Ifthesubtractionresultisequaltoorlessthanhalfthemax
value,whichistrackedwithamodulo256counter(128),thenweknowthereis
sufficientspaceinthereceiverbufferandthispacketcanbesent.Thedecision
to useonly half thecounter valueavoidsapotentialcount aliasproblem. See
Stage3CountersRollOveronpage 234.

Figure611:FlowControlElementsAfterFirstTLPSent

Device A Device B

PTLP
Transactions CC+PTLP=CR
Pending
Buffer Send
CL-CR < 28/2
(VC0) Error
VC0
CC = 01h CL = 66h FC
Incr Check Buffer
Link Packet optional incr
Control (NP Hdr)
incr CrRcv=01h CrAl=66h
optional
Link Packet
Control

transmit receive transmit receive

FC Packets
Transaction Link

232
PCIe 3.0.book Page 233 Sunday, September 2, 2012 11:25 AM

Chapter 6: Flow Control

Stage 2 Flow Control Buffer Fills Up


Assumenowthatthereceiverhasbeenunabletoremovetransactionsfromthe
Flow Control buffer for some time. Perhaps the device core logic was tempo
rarily busy and unable to process transactions. Eventually, the Flow Control
buffer becomes completely full, as shown in Figure 612 on page 234. If the
transmitterwishestosendanotherTLPandcheckstheFlowControlcredits:

CreditLimit(CL)=66h
CreditsRequired(CR)=67h

CL01100110(66)
CR 10011001(add2scomplementof67h)
11111111 = FFh<=80h(nottrue;dontsendpacket)

ThischannelisblockeduntilanUpdateFlowControlDLLPisreceivedwitha
new CREDIT_LIMIT value of 67h or greater. When the new valued is loaded
intotheCLregisterthetransmittercreditcheckwillpassthetestandaTLPcan
besent.

CL 01100111(67)
CR 10011001add2scomplementof67
00000000 = 00h<=80h(true,sendtransaction

233
PCIe 3.0.book Page 234 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure612:FlowControlElementswithFlowControlBufferFilled

Device A Device B

PTLP
Transactions CC+PTLP=CR
Pending
Buffer Send
CL-CR < 28/2
(VC0) Error
CC = 66h CL = 66h
Incr Check
Link Packet optional incr
Control
incr CrRcv=66h CrAl=66h
optional
Link Packet
Control

transmit receive transmit receive

FC Packets

Transaction Link

Stage 3 Counters Roll Over


SincetheCreditLimit(CL)andCreditsRequired(CR)countsonlyincrement
upward,theyeventuallyrolloverbacktozero.WhenCLrollsoverandCRhas
not,thecreditcheck(CLCR)resultsinasmallCLvalueandalargeCRvalue.
However,whatmightappeartobeaproblemisnotwhenusingunsignedarith
metic.Asdescribedinthepreviousexamplestheresultsarehandledcorrectly
whenperforming2scomplementsubtraction.Figure613onpage235shows
theCLandCRcountsbeforeandafterCLrolloveralongwiththe2scomple
mentresults.

234
PCIe 3.0.book Page 235 Sunday, September 2, 2012 11:25 AM

Chapter 6: Flow Control

Figure613:FlowControlRolloverProblem

Before CL Rollover After CL Rollover


FFh

NTS =CL
FF8h
= F8h
(4088d) AS = CR
FE8h (4072d)
= F8h

Available
Credit Available
NTS Credit is the
AS =CR
FE8h
= E8h
(4072d) Rollover sum of these
two parts

NTS =CL
FF8h
= 08h
(4088d)
00h

Using 2's complement: Using 2's complement:


CL 11111000 (F8h) CL 00001000 (08h)
+ CR 00011000 (E8h 2s complement) + CR 00001000 (F8h 2s complement)
= 00010000 (0Fh) = 00010000 (0Fh)

Stage 4 FC Buffer Overflow Error Check


Althoughitsoptionaltodoso,thespecificationrecommendsimplementation
oftheFCbufferoverflowerrorcheckingmechanism.Figure614onpage236
showstheelementsassociatedwiththeoverflowerrorcheckthatinclude:

CreditsReceived(CR)counter
CreditsAllocated(CA)counter
ErrorCheckLogic

ThispermitsthereceivertotrackFlowControlcreditsinthesamemanneras
the transmitter. If flow control is working correctly, the transmitters Credits
ConsumedcountwillneverexceeditsCreditLimit,andthereceiversCredits
ReceivedcountwillneverexceeditsCreditsAllocatedcount.

235
PCIe 3.0.book Page 236 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Anoverflowconditionisdetectedifthefollowingformulaevaluatestrue.Note
thatthefieldsizeiseither8(headers)or12(data):


FieldSize FieldSize
CA CR mod2 2 2

Ifitdoesevaluatetrue,thenmorecreditshavebeensenttotheFCbufferthan
wereavailableandanoverflowhasoccurred.Notethatthe1.0aversionofthe
specification defines the equation as rather than > as shown above. That
appearstobeanerror,becausewhenCA=CRnooverflowconditionexists.

Figure614:BufferOverflowErrorCheck

Device A Device B

PTLP
Transactions CC+PTLP=CR
Pending
Buffer Send xxxxxxxxxxxxx
CL-CR < 28/2 xxxxxxxxxxxxx
(VC0) Error xxxxxxxxxxxxx

CC = 66h CL = 69h
Incr Check
Link Packet optional incr
Control
incr CrRcv=67h CrAl=66h
optional
Link Packet
Control

transmit receive transmit receive

FC Update

Transaction Link

236
PCIe 3.0.book Page 237 Sunday, September 2, 2012 11:25 AM

Chapter 6: Flow Control

Flow Control Updates


The receiver must regularly update its neighboring device with Flow Control
credits that become available when transactions are removed from the buffer.
Figure615onpage238illustratesanexamplewherethetransmitterwasprevi
ouslyblockedfromsendingheadertransactionsbecausethebufferwasfull.In
theillustration,thereceiverhasjustremovedthreeheadersfromtheFlowCon
trolbuffer.Morespaceisnowavailable,buttheneighboringdeviceisunaware
of this. As headers are removed from the buffer, the CREDITS_ALLOCATED
count increments from 66h to 69h. This new count is reported to the
CREDIT_LIMITregisteroftheneighboringdeviceusingaFlowControlupdate
packet.Oncethecreditlimithasbeenupdated,transmissionofadditionalTLPs
canproceed.

AninterestingnotehereisthattheupdatereportstheactualvalueoftheCred
itsAllocatedregister.Itwouldhaveworkedtoreportjustthechangeinthereg
ister,asperhaps+3creditsonNPHeadersforexample,butthatrepresentsa
potentialproblem.Tounderstandtherisk,considerwhatwouldhappenifthe
DLLPcontainingthatincrementinformationwaslostforsomereason.Thereis
no replay mechanism for DLLPs; if an error occurs the packet is simply
dropped.Inthiscase,theincrementinformationwouldbelostwithoutameans
ofrecoveringit.

If,ontheotherhand,theactualvalueoftheregisterisreportedinsteadandthe
DLLPfails,thenextDLLPthatsucceedswillgetthecountersbackinsynchroni
zation.Inthatcasesometimemightbewastedifthetransmitteriswaitingon
theFCcreditsbeforeitcansendthenextTLP,butnoinformationislost.

237
PCIe 3.0.book Page 238 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure615:FlowControlUpdateExample

Device A Device B

PTLP
Transactions CC+PTLP=CR
Pending
Buffer Send xxxxxxxxxxxxx
CL-CR < 28/2 xxxxxxxxxxxxx
(VC0) Error xxxxxxxxxxxxx

CC = 66h CL = 69h
Incr Check
Link Packet optional incr
Control
incr CrRcv=66h CrAl=69h
optional
Link Packet
Control

transmit receive transmit receive

FC Update

Transaction Link

FC_Update DLLP Format and Content


Recall that Flow Control update packets, like the Flow Control initialization
packets,containtwocreditfields,oneforheaderandonefordata,asshownin
Figure616onpage239.ThereceiverscreditvaluesreportedintheHdrFCand
DataFC fields may have been updated many times or not at all since the last
updatepacketwassent.

238
PCIe 3.0.book Page 239 Sunday, September 2, 2012 11:25 AM

Chapter 6: Flow Control

Figure616:UpdateFlowControlPacketFormatandContents

   
                               
9>@
%\WH [[[[ 
9&,'
5 'DWD)&
+GU)& 5 'DWD)&
'DWD)&

%\WH %LW&5&

8SGDWH3RVWHG &5(',76B$//2&$7(' &5(',76B$//2&$7('


 FRXQWIURP+HDGHU FRXQWIURP'DWD
8SGDWH1RQ3RVWHG
)ORZ&RQWURO/RJLF )ORZ&RQWURO/RJLF
8SGDWH&RPSOHWLRQ

Flow Control Update Frequency


Thespecificationdefinesavarietyofrulesandsuggestedimplementationsthat
governwhenandhowoftenFlowControlUpdateDLLPsshouldbesent.These
aremotivatedbyadesireto:

Notifythetransmittingdeviceasearlyaspossibleaboutnewcreditsallo
cated,especiallyifanytransactionswerepreviouslyblocked.
EstablishworstcaselatencybetweenFCPackets.
Balancetherequirementsassociatedwithflowcontroloperation,suchas:
theneedtoreportcreditsoftenenoughtopreventtransactionblocking
thedesiretoreducetheLinkbandwidthneededforFC_UpdateDLLPs
selectingtheoptimumbuffersize
selectingthemaximumdatapayloadsize
DetectviolationsofthemaximumlatencybetweenFlowControlpackets.

FlowControlupdatesarepermittedonlywhentheLinkisintheactivestate(L0
or L0s). All other Link states represent more aggressive power management
thathavelongerrecoverylatencies.

Immediate Notification of Credits Allocated


When a Flow Controlbuffer isso full thatmaximumsized packetscannotbe
sent, the spec requires immediate delivery of a FC_Update DLLP when more
spacebecomesavailable.Twocasesexist:

239
PCIe 3.0.book Page 240 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Maximum Packet Size = 1 Credit. When packet transmission is blocked


due to a buffer full condition for noninfinite NPH, NPD, PH, and CPLH
buffer types, an UpdateFC packet must be scheduled for Transmission
when one or more credits are made available (allocated) for that buffer
type.
Maximum Packet Size = Max_Payload_Size. Flow Control buffer space
maydecreasetotheextentthatamaximumsizedpacketcannotbesentfor
noninfinite PD and CPLD credit types. In this case, when one or more
additional credits are allocated, an Update FCP must be scheduled for
transmission.

Maximum Latency Between Update Flow Control DLLPs


ThetransmissionfrequencyofUpdateFCPsforeachFCcredittype(noninfi
nite)mustbescheduledfortransmissionatleastonceevery30s(0%/+50%).If
theExtendedSyncbitwithintheControlLinkregisterisset,updatesmustbe
schedulednolaterthanevery120s(0%/+50%).NotethatUpdateFCPsmay
bescheduledfortransmissionmorefrequentlythanisrequired.

Calculating Update Frequency Based on Payload Size and


Link Width
Thespecificationoffersaformulaforcalculatingthefrequencyatwhichupdate
packetsneedtobesentformaximumdatapayloadsizesandLinkwidths.The
formula,shownbelow, defines FCUpdate deliveryintervals in symboltimes.
Forreference,asymboltimeisdefinedasthetimeittakestodeliveronesym
bol:4nsforGen1,2nsforGen2,1nsforGen3.Table63,Table64andTable65
showtheunadjustedFCUpdatevaluesforeachspeed.


-----------------------------------------------------------------------------------------------------------------------------------------
MaxPayloadSize + TLPOverhead UpdateFactor- + InternalDelay
LinkWidth

MaxPayloadSize=ThevalueintheMax_Payload_SizefieldoftheDevice
Controlregister
TLPOverhead=theconstantvalue(28symbols)representingtheadditional
TLP components that consume Link bandwidth (TLP Prefix, Sequence
Number,PacketHeader,LCRC,FramingSymbols)
UpdateFactor=thenumberofmaximumsizeTLPssentduringtheinterval
between UpdateFC Packets received. This number is intended to balance
Linkbandwidthefficiencyandreceivebuffersizesthevaluevarieswith
Max_Payload_SizeandLinkwidth

240
PCIe 3.0.book Page 241 Sunday, September 2, 2012 11:25 AM

Chapter 6: Flow Control

LinkWidth=ThenumberofLanestheLinkisusing
InternalDelay = a constant value of 19 symbol times that represents the
internalprocessingdelaysforreceivedTLPsandtransmittedDLLPs
The relationship defined by the formula shows that the frequency of update
packetdeliverydecreasesastheLinkwidthincreasesandsuggestsatimerthat
triggersschedulingofupdatepackets.Notethatthisformuladoesnotaccount
for delays associated with the receiver or transmitter being in the L0s power
managementstate.

Table63:Gen1UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)

x1 x2 x4 x8 x12 x16 x32


MaxPayload
Link Link Link Link Link Link Link

128Bytes 237 128 73 67 58 48 33


UF=1.4 UF=1.4 UF=1.4 UF=2.5 UF=3.0 UF=3.0 UF=3.0

256Bytes 416 217 118 107 90 72 45


FC=1.4 FC=1.4 UF=1.4 UF=2.5 UF=3.0 UF=3.0 UF=3.0

512Bytes 559 289 154 86 109 86 52


UF=1.0 UF=1.0 UF=1.0 UF=1.0 UF=2.0 UF=2.0 UF=2.0

1024Bytes 1071 545 282 150 194 150 84


UF=1.0 UF=1.0 UF=1.0 UF=1.0 UF=2.0 UF=2.0 UF=2.0

2048Bytes 2095 1057 538 278 365 278 148


UF=1.0 UF=1.0 UF=1.0 UF=1.0 UF=2.0 UF=2.0 UF=2.0

4096Bytes 4143 2081 1050 534 706 534 276


UF=1.0 UF=1.0 UF=1.0 UF=1.0 UF=2.0 UF=2.0 UF=2.0

Table64:Gen2UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)

x1 x2 x4 x8 x12 x16 x32


MaxPayload
Link Link Link Link Link Link Link

128Bytes 288 179 124 118 109 99 84


UF=1.4 UF=1.4 UF=1.4 UF=2.5 UF=3.0 UF=3.0 UF=3.0

256Bytes 467 268 169 158 141 123 96


FC=1.4 FC=1.4 UF=1.4 UF=2.5 UF=3.0 UF=3.0 UF=3.0

241
PCIe 3.0.book Page 242 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table64:Gen2UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)(Continued)

x1 x2 x4 x8 x12 x16 x32


MaxPayload
Link Link Link Link Link Link Link

512Bytes 610 340 205 137 160 137 103


UF=1.0 UF=1.0 UF=1.0 UF=1.0 UF=2.0 UF=2.0 UF=2.0

1024Bytes 1122 596 333 201 245 201 135


UF=1.0 UF=1.0 UF=1.0 UF=1.0 UF=2.0 UF=2.0 UF=2.0

2048Bytes 2146 1108 589 329 416 329 199


UF=1.0 UF=1.0 UF=1.0 UF=1.0 UF=2.0 UF=2.0 UF=2.0

4096Bytes 4194 2132 1101 585 757 585 327


UF=1.0 UF=1.0 UF=1.0 UF=1.0 UF=2.0 UF=2.0 UF=2.0

Table65:Gen3UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)

x1 x2 x4 x8 x12 x16 x32


MaxPayload
Link Link Link Link Link Link Link

128Bytes 333 224 169 163 154 144 129


UF=1.4 UF=1.4 UF=1.4 UF=2.5 UF=3.0 UF=3.0 UF=3.0

256Bytes 512 313 214 203 186 168 141


FC=1.4 FC=1.4 UF=1.4 UF=2.5 UF=3.0 UF=3.0 UF=3.0

512Bytes 655 385 250 182 205 182 148


UF=1.0 UF=1.0 UF=1.0 UF=1.0 UF=2.0 UF=2.0 UF=2.0

1024Bytes 1167 641 378 246 290 246 180


UF=1.0 UF=1.0 UF=1.0 UF=1.0 UF=2.0 UF=2.0 UF=2.0

2048Bytes 2191 1153 643 374 461 374 244


UF=1.0 UF=1.0 UF=1.0 UF=1.0 UF=2.0 UF=2.0 UF=2.0

4096Bytes 4239 2177 1146 630 802 630 372


UF=1.0 UF=1.0 UF=1.0 UF=1.0 UF=2.0 UF=2.0 UF=2.0

Thespecificationrecognizesthattheformulawillbeinadequateformanyappli
cationssuchasthosethatstreamlargeblocksofdata.Theseapplicationsmay
require buffer sizes larger than the minimum specified, as well as a more
sophisticated update policy in order to optimize performance and reduce

242
PCIe 3.0.book Page 243 Sunday, September 2, 2012 11:25 AM

Chapter 6: Flow Control

power consumption. Because a given solution is dependent on the particular


requirementsofanapplication,nodefinitionforsuchpoliciesisprovided.

Error Detection Timer A Pseudo Requirement


The specification defines an optional timeout mechanism for Flow Control
packetsthatishighlyrecommendedandmaybecomearequirementinfuture
versions of the specification. The maximum latency between FC packets for a
givencredittypeis120s,andthistimeouthasamaximumlimitof200s.A
separate timer is implemented for each FC credit type (P, NP, Cpl), and each
timerisresetwhenaFCUpdateDLLPofthecorrespondingtypeisreceived.
NotethatatimerassociatedwithinfiniteFCcreditvaluesmustnotreportan
error.

Apartfromtheinfinitecase,atimeoutimpliesaseriousproblemwiththeLink.
If it occurs, the Physical Layer is signaled to go into the Recovery state and
retraintheLinkin hopesofclearing theerrorcondition.Timercharacteristics
include:

OperatesonlywhentheLinkisinanactivestate(L0orL0s).
Maxtimelimitedto200s(0%/+50%)
Timer is reset when any Init or Update FCP is received, or optionally by
receiptofanyDLLP.
Timeout forces the Physical Layer to enter Link Training and Status State
Machine(LTSSM)Recoverystate.

243
PCIe 3.0.book Page 244 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

244
PCIe 3.0.book Page 245 Sunday, September 2, 2012 11:25 AM

7 QualityofService
The Previous Chapter
ThepreviouschapterdiscussesthepurposesanddetailedoperationoftheFlow
Control Protocol. Flow control is designed to ensure that transmitters never
sendTransactionLayerPackets(TLPs)thatareceivercantaccept.Thisprevents
receivebufferoverrunsandeliminatestheneedforPCIstyleinefficiencieslike
disconnects,retries,andwaitstates.

This Chapter
This chapter discusses the mechanisms that support Quality of Service and
describesthemeansofcontrollingthetimingandbandwidthofdifferentpack
ets traversing the fabric. These mechanisms include applicationspecific soft
warethatassignsapriorityvaluetoeverypacket,andoptionalhardwarethat
mustbebuiltintoeachdevicetoenablemanagingtransactionpriority.

The Next Chapter


ThenextchapterdiscussestheorderingrequirementsfortransactionsinaPCI
Expresstopology.TheserulesareinheritedfromPCI.TheProducer/Consumer
programming model motivated many of them, so its mechanism is described
here. The original rules also took into consideration possible deadlock condi
tionsthatmustbeavoided.

Motivation
Many computer systems today dont include mechanisms to manage band
width for peripheral traffic, but there are some applications that need it. One
example is streaming video across a generalpurpose data bus, that requires
data be delivered at the right time. In embedded guidance control systems
timely delivery of video data is also critical to system operation. Foreseeing
those needs, the original PCIe spec included Quality of Service (QoS) mecha
nismsthatcangivepreferencetosometrafficflows.Thebroadertermforthisis

245
PCIe 3.0.book Page 246 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Differentiated Service, since packets are treated differently based on an


assignedpriorityanditallowsforawiderangeofservicepreferences.Atthe
high end of that range, QoS can provide predictable and guaranteed perfor
mance for applications that need it. That level of support is called isochro
nous service, a term derived from the two Greek words isos (equal) and
chronos(time)thattogethermeansomethingthatoccursatequaltimeinter
vals.TomakethatworkinPCIerequiresbothhardwareandsoftwareelements.

Basic Elements
Supportinghighlevelsofserviceplacesrequirementsonsystemperformance.
For example, the transmission rate must be high enough to deliver sufficient
data within a time frame that meets the demands of the application while
accommodating competition from other traffic flows. In addition, the latency
mustbelowenoughtoensuretimelyarrivalofpacketsandavoiddelayprob
lems.Finally,errorhandlingmustbemanagedsothatitdoesntinterferewith
timelypacketdelivery.Achievingthesegoalsrequiressomespecifichardware
elements, one of which is a set of configuration registers called the Virtual
ChannelCapabilityBlockasshowninFigure71.

Figure71:VirtualChannelCapabilityRegisters

0d

63d CapPtr Header

PCI Compatible
PCIeCapabilityBlock Space
PCIe Enhanced Capability Register
Port VC Cap Register 1 Ext VC Cnt 255d
VATOffset PortVCCapRegister2
VirtualChannel
PortVCStatusReg PortVCControlReg
PAT0Offset VCResourceCap(0)
CapabilityStructure
VCResourceControlReg(0)
VCResourceStatus(0) Reserved
PCIe Extended
PATnOffset VCResourceCap(n) CapabilitySpace
VCResourceControlReg(n)
VCResourceStatus(n) Reserved

VCArbitrationTable(VAT)
PortArbitrationTable0(PAT0) 4095d
PortArbitrationTablen(PATn)

246
PCIe 3.0.book Page 247 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Traffic Class (TC)


Thefirstthingweneedisawaytodifferentiatetraffic;somethingtodistinguish
which packets have high priority. This is accomplished by designating Traffic
Classes(TCs)thatdefineeightprioritiesspecifiedbya3bitTCfieldwithineach
TLP header (with ascending priority; TC 07). The 32bit memory request
headerinFigure72revealsthelocationoftheTCfield.Duringinitialization,
the device driver communicates the level of services to the isochronous man
agementsoftware,whichreturnstheappropriateTCvaluestouseforeachtype
ofpacket.ThedriverthenassignsthecorrectTCpriorityforthepacket.TheTC
valuedefaultstozerosopacketsthatdontneedpriorityservicewontacciden
tallyinterferewiththosethatdo.

Figure72:TrafficClassFieldinTLPHeader

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Byte 4 Requester ID Tag Last DW 1st DW
BE BE
Byte 8 Address [31:2] R

ConfigurationsoftwarethatsunawareofPCIewontrecognizethenewregis
tersandwillusethedefaultTC0/VC0combinationforalltransactions.Inaddi
tion,therearesomepacketsthatarealwaysrequiredtouseTC0/VC0,including
Configuration,I/O,andMessagetransactions.Ifthesepacketsarethoughtofas
maintenanceleveltraffic,thenitmakessensethattheywouldneedtobecon
finedtoVC0andkeptoutofthepathofhighprioritypackets.

Virtual Channels (VCs)


VCs are hardware buffers that act as queues for outgoing packets. Each port
must include the default VC0, but may have as many as eight (from VC0 to
VC7).Eachchannelrepresentsadifferentpathavailableforoutgoingpackets.

247
PCIe 3.0.book Page 248 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Themotivationformultiplepathsisanalogoustothatofatollroadinwhich
drivers purchase a radio tag that lets them take one of several high priority
lanesatthetollbooth.Thosewhodontpurchaseatagcanstillusetheroadbut
theyllhavetostopattheboothandpaycasheachtimetheygothrough,and
thattakeslonger.Iftherewasonlyonepath,everyonesaccesstimewouldbe
limited by the slowest driver, but having multiple paths available means that
thosewhohavepriorityarenotdelayedbythosewhodont.

Assigning TCs to each VC TC/VC Mapping


TheTrafficClassvalueassignedtoeachpackettravelsunchangedtothedesti
nationandmustbemappedtoaVCateachservicepointasittraversesthepath
tothetarget.VCmappingisspecifictoaLinkandcanchangefromoneLinkto
another. Configuration software establishes this association during initializa
tionusingtheTC/VCMapfieldoftheVCResourceControlRegister.This8bit
fieldpermitsTCvaluestobemappedtoaselectedVC,whereeachbitposition
representsthecorrespondingTCvalue(bit0=TC0,bit1=TC1,etc.).Settinga
bit assigns the corresponding TC value to the VC ID. Figure 73 on page 249
shows a mapping example where TC0 and TC1 are mapped to VC0 and
TC2:TC4aremappedtoVC3.

Software has a great deal of flexibility in assigning VC IDs and mapping the
TCs,buttherearesomerulesregardingtheTC/VCmapping:

TC/VCmappingmustbeidenticalforthetwoportsattachedoneitherend
ofthesameLink.
TC0willautomaticallybemappedtoVC0.
OtherTCsmaybemappedtoanyVC.
ATCmaynotbemappedtomorethanoneVC.

Thenumberofvirtualchannelsuseddependsonthegreatestcapabilityshared
bythetwodevicesattachedtoagivenlink.SoftwareassignsanIDforeachVC
andmapsoneormoreTCstotheVCs.

248
PCIe 3.0.book Page 249 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Figure73:TCtoVCMappingExample

31 24 23 16 15 0
PCI Express Extended Capability Header
Port VC Capability Register 1
Port VC Capability Register 2
Port VC Status Register Port VC Control Register
PAT Offset VC0 Resource Capability Register
VC0 Resource Control Register
VC0 Resource Status Reg Reserved

PAT Offset VC3 Resource Capability Register


VC3 Resource Control Register
VC3 Resource Status Reg Reserved

31 26 24 19 17 16 15 87 0
VC
C0 ID TC/VC Map
2 0 7 0
0 0 0 0 0 0 0 0 0 1 1

31 26 24 19 17 16 15 87 0
VC3 VC
ID TC/VC Map
2 0 7 0
0 1 1 0 0 0 1 1 1 0 0

Determining the Number of VCs to be Used


SoftwarechecksthenumberofVCssupportedbythedevicesattachedtoacom
mon link and would usually assign the greatest number of VCs that both
devicescansupport.ConsidertheexampletopologyinFigure74onpage250.

249
PCIe 3.0.book Page 250 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Here, theswitchsupportsall 8VCsoneach ofitsports,whileDeviceAsup


portsonlythedefaultVC0,DeviceBsupports4VCs,andDeviceCsupport8
VCs. Note thateventhoughswitch portAsupports all 8VCs,DeviceAonly
supportsVC0,so7VCsareleftunusedinswitchportA.Similarly,only4VCs
areusedbyswitchportB.

Figure74:MultipleVCsSupportedbyaDevice

Root Complex

8 VCs supported
on each switch port
Switch
1 VC A C
B
De A

e
vic

8 VCs
vic

De
C

4 VCs
e

Device
1 VC supported 8 VCs supported
B

4 VCs supported

ConfigurationsoftwaredeterminesthemaximumnumberofVCssupportedby
eachportinterfacebyreadingtheExtendedVCCountfieldintheVirtualChan
nelCapabilityregisters,asshowninFigure75onpage251.Softwarechecksthe
ExtendedVCCountatbothendsoftheLinkandselectsthehighestcommon
count. Using all the available VCs is not mandatory, though. Software may
choosetoenablefewerVCsaswell.

250
PCIe 3.0.book Page 251 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Figure75:ExtendedVCsSupportedField

31 24 23 16 15 0
PCI Express Extended Capability Header
Port VC Capability Register 1
Port VC Capability Register 2
Port VC Status Register Port VC Control Register
PAT Offset VC0 Resource Capability Register
VC0 Resource Control Register
VC0 Resource Status Reg Reserved

PAT Offset VCn Resource Capability Register


VCn Resource Control Register
VCn Resource Status Reg Reserved

2 0
Extended VC Count

0 = only VC0 supported


1-7 = number of additional
VCs supported

Assigning VC Numbers (IDs)


Configurationsoftwareassignsanumber(ID)toeachoftheVCs,exceptVC0
whichisalwayshardwired.AsshowninFigure73onpage249,theVCCapa
bilitiesregistersinclude12bytesofconfigurationregistersforeachVC.Thefirst
setofregistersalwaysappliestoVC0.TheExtendedVCCountfielddefinesthe
numberofadditionalVCsimplementedbythisport,eachofwhichwillhavea
setofregisters.ThevaluenrepresentsthenumberofadditionalVCsimple
mented.Forexample,iftheExtendedVCCountcontainsavalueof3,thenthere
arethreeVCsandregistersetsinadditiontoVC0.

251
PCIe 3.0.book Page 252 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

SoftwareassignsanumberforeachoftheadditionalVCsviatheVCIDfield.
(SeeFigure73onpage249)TheIDsdonthavetobecontiguousbuteachnum
bercanonlybeusedonce.

VC Arbitration

General
IfadevicehasmorethanoneVCandtheyallhaveapacketreadytosend,VC
arbitrationdeterminestheorderofpackettransmission.Anyofseveralschemes
canbechosenbysoftwarefromamongtheoptionsimplementedbyhardware.
Thegoalsaretoimplementthedesiredservicepolicyandensurethatalltrans
actionsaremakingforwardprogresstopreventinadvertenttimeouts.Inaddi
tion, VC Arbitration is affected by the requirements associated with flow
controlandtransactionordering.Thesetopicsarediscussedinotherchapters,
buttheyaffectarbitration,too,because:
EachsupportedVCprovidesitsownbuffersandflowcontrol.
Transactions mapped to the same VC are normally passed along in strict
order (although there are exceptions, such as when a packet has the
RelaxedOrderingattributebitset).
TransactionorderingonlyapplieswithinaVC,sotheresnoorderingrela
tionshipamongpacketsassignedtodifferentVCs.

TheexampleinFigure76onpage253illustratestwoVCs(VC0andVC1)with
atransmissionprioritybasedona3:1ratio,meaningthreeVC1packetsaresent
foreveryoneVC0packet.Thedevicecoresendsrequests(includingaTCvalue)
totheTC/VCMappinglogic.Basedontheprogrammedmapping,thepacketis
placedintotheappropriateVCbufferfortransmission.Finally,theVCarbiter
determinestheVCpriorityforforwardingthepackets.Thisexampleillustrates
theflowinonedirection,butthesamelogicexistsfortransmittingintheoppo
sitedirectionatthesametime.

TheVCcapabilityregistersprovidethreebasicVCarbitrationapproaches:

1. StrictPriorityArbitrationthehighestnumberedVCwithapacketready
alwayswins.
2. Group Arbitration VCs are divided by hardware into one lowpriority
groupandonehighprioritygroup.Thelowprioritygroupusesanarbitra
tionmethodselectedbysoftwarefromtheavailablechoices,whilethehigh
prioritygroupalwaysusesstrictpriorityarbitration.
3. HardwareFixedarbitrationschemebuiltintothehardware.

252
PCIe 3.0.book Page 253 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Figure76:VCArbitrationExample

CPU

VC1 VC0
RootComplex
Memory
TC/VCMapping

VC arbitration in this
example yields a 3 to 1
ratio for transmitting
VC1 and VC0.

Arbiter

VC1 VC0

TC/VCMapping

Device
Core

Strict Priority VC Arbitration


The default priority scheme is based on the inherent priority of VC IDs
(VC0=lowest priority and VC7=highest priority). The mechanism is automatic
andrequiresnoconfiguration.Figure77onpage254illustratesastrictpriority
arbitration example that includes all VCs. The VC ID governs the order in
whichtransactionsaresent.ThemaximumnumberofVCsthatusestrictprior
ityarbitrationcannotbegreaterthanthevalueintheExtendedVCCountfield.

253
PCIe 3.0.book Page 254 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

(SeeFigure75onpage251.)Furthermore,ifthedesignerhaschosenstrictpri
orityarbitrationforallVCssupported,theLowPriorityExtendedVCCountfield
ofPortVCCapabilityRegister1ishardwiredtozero.(SeeFigure78onpage
255.

Figure77:StrictPriorityArbitration

VC Resources Priority Order

8th VC VC7 Highest

7th VC VC6
6th VC VC5
5th VC VC4
4th VC VC3
3rd VC VC2
2nd VC VC1
1st VC VC0 Lowest

StrictpriorityrequiresthathighernumberedVCsalwaysgetprecedenceover
lowerpriorityVCs.Forexample,ifalleightVCsaregovernedbystrictpriority,
thenpacketsinVC0canonlybesentwhennootherVCshavepacketspending.
This achieves the goal of giving the highest priority packets very high band
width with minimal latencies. However, strict priority has the potential to
starvelowprioritychannelsforbandwidth,socaremustbetakentoensurethis
doesnthappen.Thespecrequiresthathighprioritytrafficberegulatedtoavoid
starvation,andgivestwopossiblemethodsofregulation:

Theoriginatingportcanrestricttheinjectionrateofhighprioritypacketsto
allowmorebandwidthforlowerprioritytransactions.
Switchescanregulatemultipletrafficflowsattheegressport.Thismethod
may limit the throughput from high bandwidth applications and devices
thatattempttoexceedthelimitationsoftheavailablebandwidth.

A device designer may also limit the number of VCs that participate in strict
prioritybysplittingtheVCsintoalowprioritygroupandahighprioritygroup
asdiscussedinthenextsection.

254
PCIe 3.0.book Page 255 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Group Arbitration
Figure78illustratestheLowPriorityExtendedVCCountfieldwithinVCCapa
bilityRegister1.ThisreadonlyfieldspecifiesaVCIDthatidentifiestheupper
limitof the lowpriority arbitrationgroup for thisdevice. For example, if this
valueis4,thenVC0VC4aremembersofthelowprioritygroupandVC5VC7
areinthehighprioritygroup.NotethataLowPriorityExtendedVCCountof7
meansthatnostrictpriorityisused.

Figure78:LowPriorityExtendedVCs

31 24 23 16 15 0
PCI Express Extended Capability Header 00h
Port VC Capability Register 1 04h
Port VC Capability Register 2 08h
Port VC Status Register Port VC Control Register 0Ch
PAT Offset VC0 Resource Capability Register 10h
VC0 Resource Control Register 14h
VC0 Resource Status Reg Reserved 18h

PAT Offset VCn Resource Capability Register 10h+(n*0Ch)


VCn Resource Control Register 14h+(n*0Ch)
VCn Resource Status Reg Reserved 18h+(n*0Ch)
n = one of the extended VCs

31 12 11 10 9 8 7 6 43 2 0

RsvdP
Port Arbitration Table Entry Size
Reference Clock
RsvdP
Low Priority Extended VC Count
RsvdP
Extended VC Count

255
PCIe 3.0.book Page 256 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

AsdepictedinFigure710onpage257,thehighpriorityVCscontinuetouse
strict priority arbitration,whilethe lowpriorityarbitrationgroup uses one of
theotherarbitrationmethodssupportedbythedevice.VCCapabilityRegister2
reportswhichalternatemethodsaresupportedforthisgroup,asshowninFig
ure79,andtheVCControlRegisterpermitsselectionofthemethodtobeused.
Thelowpriorityarbitrationschemesinclude:

HardwareBasedFixedArbitration
WeightedRoundRobinArbitration(WRR)

Figure79:VCArbitrationCapabilities

31 24 23 87 0
VC Arbitration VC Arbitration
Table Offset RsvdP Capability

7 4 3 2 1 0
RsvdP
WRR with 128 Phases (011b)
WRR with 64 Phases (010b)
WRR with 32 Phases (001b)
Hardware Fixed Arbitration Scheme (000b)

Port VC Control Register


15 4 3 10
RsvdP
VC Arbitration Select (000b-111b)
Load VC Arbitration Table

256
PCIe 3.0.book Page 257 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Figure710:VCArbitrationPriorities

VC Resources VC IDs Split Priority


Highest
8th VC VC7
7th VC VC6 High-Priority (Strict Priority Scheme)

6th VC VC5 Low-Priority VC ID = 4


5th VC VC4
4th VC VC3
Low-Priority (Alternate Priority Scheme)
3rd VC VC2 (Selected by Software)
2nd VC VC1
1st VC VC0 Lowest

Hardware Fixed Arbitration Scheme


Thisselectiondefinesahardwarebasedmethodandrequiresnoadditional
software setup. This method can be anything the hardware designer
chooses to build in, and could be based on anticipated loading or band
widthneedsforthedevice.Asimpleexamplemightbeanordinaryround
robin sequence, in which each VC gets an equal turn at sending packets
duringtherotation.

Weighted Round Robin Arbitration Scheme


ThisisaschemeinwhichsomeVCscanbeweightedmore(givenhigher
priority)thanothersbygivingthemmoreentriesinthesequencethanoth
ers.ThespecdefinesthreeWRRoptions,eachwithadifferentnumberof
entries(calledphases).Thetablesizeisselectedbywritingthecorrespond
ingvalueintotheVCArbitrationSelectfieldofthePortVCControlRegister
(seeFigure79onpage256).Eachentryinthetablerepresentsonephase
that software loads with a low priority VC number. The VC arbiter will
repeatedly scan all table entries in a sequential fashion and send packets
fromtheVCspecifiedinthetableentries.Onceapackethasbeensent,the

257
PCIe 3.0.book Page 258 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

arbiter immediately proceeds to the next phase. Figure 711 on page 258
showsanexampleofaWRRarbitrationtablewith64entries.

Figure711:WRRVCArbitrationTable

Phase VC ID
0 VC 4

Arbitration Logic Scans Table Entries


1 VC 3
2 255VC 2
(16KB)
3 VC 1
4 VC 4
5 VC 3
6 VC 0
7 64VC 4
(4KB)
8 VC 3
9 128VC 2
(8KB)
10 VC 1
11
11 VC 4

62
1 VC 3
63
2 VC 0

Setting up the Virtual Channel Arbitration Table


The location of the VC Arbitration Table (VAT) in configuration space is
givenasanoffsetfromthebaseaddressoftheVCCapabilityStructure,as
showninFigure712onpage259.

AsshowninFigure713onpage260,eachentryintheVATisa4bitfield
thatidentifiestheVCnumberofthebufferthatisscheduledtodeliverdata
during that phase. The table length is selected by the arbitration option
showninFigure79onpage256.

258
PCIe 3.0.book Page 259 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Figure712:VCArbitrationTableOffsetandLoadVCArbitrationTableFields

Port VC Capability Register 2


31 24 23 87 0
VC Arbitration VC Arbitration
Table Offset RsvdP Capability

0d
CapPtr Header
63d
PCICompatible
PCIe Cap Structure (CapID=10h) Space
255d

PCIEXEnhancedCapabilityRegister
PortVCCapRegister1 ExtVCCnt
VATOffset PortVCCapRegister2
PortVCStatusReg PortVCControlReg
PAT0Offset VC0 Resource Cap Reg
VC Resource Control Register PCIEXExtended
VC Resource Status Reg Reserved CapabilitySpace
PATnOffset VCn Resource Cap Reg
VC Resource Control Register
VC Resource Status Reg Reserved

VC Arbitration Table (VAT)

4095d

Thetableisloadedbyconfigurationsoftwaretoachievethedesiredpriority
orderforthevirtualchannels.HardwaresetstheVCArbitrationTableStatus
bitwheneveranychangesaremadetothetable,givingsoftwareawayto
verify whether changes have been made but not yet applied to the hard
ware.Oncethetableisloaded,softwaresetstheLoadVCArbitrationTablebit

259
PCIe 3.0.book Page 260 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

inthePortVCControlregister.Thatcauseshardwaretoload,orapply,the
newvaluestotheVCArbiter.HardwareclearstheVCArbitrationTableSta
tus bit when table loading is complete, signaling to software that loading
hasfinished.Thismethodisprobablymotivatedbythedesiretochangethe
tablecontentsduringruntimewithoutdisruption.Theproblemisthatcon
figuration writes are only able to update a dword at a time and are rela
tively slow transactions, which means it could take a long time to finish
makingchanges,duringwhichthetableisonlypartiallyupdated.That,in
turn, could result in unexpected behavior bythe device asit continues to
operateduringthistime.Toavoidthat,thismechanismallowssoftwareto
completeallthechangestothetableandthenapplythemallatoncetothe
hardwarearbiter.

Figure713:LoadingtheVCArbitrationTableEntries

32 Phase Virtual Channel Arbitration Table


31 28 27 24 23 20 19 16 15 12 11 87 43 0

Phase[7] Phase[6] Phase[5] Phase[4] Phase[3] Phase[2] Phase[1] Phase[0] 00h

Phase[15] Phase[14] Phase[13] Phase[12] Phase[11] Phase[10] Phase[9] Phase[8] 04h

Phase[23] Phase[22] Phase[21] Phase[20] Phase[19] Phase[18] Phase[17] Phase[16] 08h

Phase[31] Phase[30] Phase[29] Phase[28] Phase[27] Phase[26] Phase[25] Phase[24] 0Ch

1. Configuration Software loads the VC Arbitration Table.


3 2 1 0
2. The VC Arbitration Table Status bit is set when any
table entry is updated. RsvdP VC ID
3. Software sets the Load VC Arbitration Table bit.
4. Hardware applies table contents to VC Arbiter.
5. Hardware clears the VC Arbitration Table status bit
when the table has been loaded into the Arbiter.

Port VC Status Register Port VC Control Register


15 10 15 4 3 10
RsvdZ RsvdP

VC Arbitration Select (000b-111b)


VC Arbitration Table Status Load VC Arbitration Table

260
PCIe 3.0.book Page 261 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Port Arbitration

General
Switchportsandrootportswilloftenreceiveincomingpacketsthatneedtobe
routedtoanotherport.Sincepacketsarrivingfrommultipleportscanalltarget
thesameVCinthesameoutgoingport,arbitrationisneededtodecidewhich
incoming ports packet gets next access to that VC. Like VC arbitration, port
arbitrationhasseveraloptionalschemesavailableforselectionbyconfiguration
software.ThecombinationofTCs,VCs,andarbitrationsupportarangeofser
vicelevelsthatfallintotwobroadcategories:

1.AsynchronousPacketsgetbesteffortserviceandmayreceivenoprefer
enceatall.Manydevicesandapplications,likemassstoragedevices,haveno
stringentrequirementsforbandwidthorlatencyanddontneedspecialtiming
mechanisms.Ontheotherhand,packetsgeneratedbymoredemandingappli
cationscanstillbeprioritizedwithoutmuchtroublebyestablishingahierarchy
oftrafficclassesfordifferentpackets.Differentiatedserviceisstillconsideredto
beasynchronousuntilthelevelofservicerequiresguarantees.Naturally,asyn
chronousserviceisalwaysavailableanddoesntneedanyspecialsoftwareor
hardwareoptions.

2. Isochronous When latency and bandwidth guarantees are needed, we


moveintotheisochronouscategory.Thiswouldbeusefulwhenasynchronous
connection would normally be required between two devices. For example, a
CDROMsourcingdatafromamusicCDusesasynchronousconnectionwhen
aheadsetispluggeddirectlyintothedrive.However,whentheaudiomustbe
routedacrossa generalpurposebus likePCIetogetto externalspeakers,the
connection cannot be synchronous because other traffic may also need to use
thesamedatastream.Toachieveanequivalentresult,isochronousservicemust
guaranteeproperdeliveryofthetimesensitiveaudiodatawithoutpreventing
othertrafficfromusingtheLinkduringthesametime.Notsurprisingly,spe
cializedsoftwareandhardwareareneededtosupportthis.

TheconceptofportarbitrationispicturedinFigure714onpage262.Notethat
portarbitrationexistsinseveralplacesinasystem:

Egressportsofswitches
RootComplexportswhenpeertopeertransactionsaresupported
RootComplexegressportsthatleadtotargetssuchasmainmemory

261
PCIe 3.0.book Page 262 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Portarbitrationwillusuallyneedsoftwareconfigurationforeachvirtualchan
nelsupportedbyaswitchorrootegressport.Intheexamplebelow,rootport2
supportspeertopeertransfersfromrootports1and2andthereforeneedsport
arbitration.Itshouldbenoted,though,thatpeertopeersupportbetweenroot
portsisoptional,soitmaybethatnoteveryrootegressportwouldneedport
arbitration.

The connection to system memory is an interesting path. There will likely be


packetsfrommultipleingressportstryingtoaccessthisportatthesametime,
soitneedstosupportportarbitration.However,itdoesntuseaPCIeport,soit
doesnthavethesetofPCIeregisterstosupportarbitrationthatweredescrib
inghere.Instead,therootwillneedtosupplyavendorspecificsetofregisters
calledaRootComplexRegisterBlock(RCRB)toprovidethesamefunctionality.

Because port arbitration is managed independently for each VC of the egress


port,aseparatetableisrequiredforeachVCthatsupportsprogrammableport
arbitration,asshowninFigure715onpage263.Portarbitrationtablesaresup
portedonlybyswitchesandrootportsandarenotallowedinendpoints.

Figure714:PortArbitrationConcept

CPU
Port Arbitration
(configured via RCRB)
RootComplex
Memory
1 2 3

VC0

Port Arbitration
Switch (configured via PPB)

VC0 VC0 VC0 VC0

Endpoint Endpoint Endpoint Endpoint


A B C D

262
PCIe 3.0.book Page 263 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Figure715:PortArbitrationTablesforEachVC

Extended Capability Header


Port VC Capability 1 Ext. VC Count
VAT Offset Port VC Capability 2
Port VC Status Port VC Control
PAT0 Offset VC Resource Cap (0)
VC Resource Control (0)
VC Resource Status (0) RsvdP
PATn Offset VC Resource Cap (n)
VC Resource Control (n)
VC Resource Status (n) RsvdP

VC Arbitration Table (VAT)


Port Arbitration Table 0 (PAT0)
Port Arbitration Table n (PATn)

Althoughitisntstatedinthespec,theprocessofarbitratingbetweendifferent
packetstreamsalsoimpliestheuseofadditionalbufferstoaccumulatetraffic
fromeachportintheegressportasillustratedinFigure716onpage264.This
exampleillustratestwoingressports(1and2)whosetransactionsareroutedto
anegressport(3).Theactionstakenbytheswitchincludethefollowing:

1. Packets arriving at the ingress ports are directed to the appropriate flow
controlbuffers(VC)basedontheTC/VCmapping.
2. Packets are forwarded from the flow control buffers to the routing logic,
whichdeterminesandroutesthemtotheproperegressport.
3. Packetsroutedtotheegressport(3)useTC/VCmappingtodetermineinto
whichVCbuffertheyshouldbeplaced.
4. A set of buffers is associated with each of the ingress ports, allowing the
ingressportnumbertobetrackeduntilportarbitrationcanbedone.
5. Port arbitration logic determines the order in which transactions are sent
fromeachgroupofingressbuffers.

263
PCIe 3.0.book Page 264 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure716:PortArbitrationBuffering

Ingress Ports Egress Port

Port Arbiter
TC/VC Mapping

Routing Logic Port 2


Port 1
TC0:TC3
1 VC0 VC0
Port 2 VC0 VC Arbiter
Port 3 VC0

TC/VC Mapping

3
TC/VC Mapping

Port 1
Routing Logic

VC0
TC0:TC1 Port 1
2
TC2:TC4 VC7
VC5
Port 2 VC7
Port 3
VC7

Port Arbiter

Port Arbitration Mechanisms


Theactualportarbitrationmechanismsdefinedaresimilartothemodelsused
forVCarbitration.Configurationsoftwaredeterminesthecapabilityforaport
byreadingtheregistersshowninFigure717onpage265andselectstheport
arbitrationschemetouseforeachVC.

264
PCIe 3.0.book Page 265 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service


Figure717:SoftwareSelectsPortArbitrationScheme

VCn Resource Capability Register


31 24 23 22 16 15 14 13 87 0
Port Arbitration Maximum Time RsvdP Port Arbitration
Table Offset Slots Capability

RsvdP
Reject Snoop Transactions
Undefined
7 6 5 4 3 2 1 0
Rsvd

WRR with 256 Phases (101b)


Time-based WRR with 128 Phases (100b)
WRR with 128 Phases (011b)
WRR with 64 Phases (010b)
WRR with 32 Phases (001b)
Hardware Fixed Arbitration Scheme (000b)

VCn Resource Control Register


31 26 24 19 17 16 15 87 0
VC RsvdP TC/VC Map
RsvdP ID RsvdP

Load Port Arbitration Table


Port Arbitration Select
VC Enable

Hardware-Fixed Arbitration
Thismechanismdoesntrequiresoftwaresetup.Onceselected,itsmanaged
solely by hardware. The actual arbitration scheme is chosen by the hard
waredesigner,possiblybasedontheexpecteddemandsforthedevice.This
maysimplyensurefairnessoritmayoptimizesomeaspectofthedesign,
butitdoesntsupportdifferentiatedorisochronousservices.

Weighted Round Robin Arbitration


JustliketheweightedroundrobinmechanisminVCarbitration,software
cansetuptheportarbitrationtablesothatsomeportsreceivemoreoppor

265
PCIe 3.0.book Page 266 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

tunitiesthanothers.Thisapproachassignsdifferentweightstotrafficcom
ingfromdifferentports.

Asthetableisscanned,eachphasespecifiestheportnumberfromwhich
the next packet is received. Once the packet is delivered, the arbitration
logicimmediatelyproceedstothenextphase.Ifnotransactionispending
transmissionfortheselectedport,thearbiteradvancesimmediatelytothe
nextphase.Thereisnotimevalueassociatedwiththeseentries.
Four table lengths are given for WRR port arbitration, determined by the
numberofphasesusedbythetable.Presumably,alargernumberofentries
inthetableallowsformoreinterestingratiosofarbitrationselection.Onthe
other hand, a smaller number of entries would use less storage and cost
less.

Time-Based, Weighted Round Robin Arbitration (TBWRR)


Thismechanismisrequiredforisochronoussupport.Asthenameimplies,
timebasedweightedroundrobinaddstheelementoftimetoeacharbitra
tionphase.JustasinWRRtheportarbiterdeliversonetransactionfromthe
ingressportVCbufferindicatedbythePortNumberofthecurrentphase.
Now though, rather than immediately advancing to the next phase, the
timebased arbiter waits until the current virtual timeslot elapses before
advancing.Thisensuresthattransactionsareacceptedfromtheingressport
bufferatregularintervals.Iftheselectedportdoesnothaveapacketready
to send then nothing will be sent until the next timeslot. Note that the
timeslotdoesnotgovernthedurationofthetransfer,butrathertheinterval
between transfers. The maximum duration of a transaction is the time it
takestocompletetheroundrobinandreturntotheoriginaltimeslot.The
lengthofthetimeslotmaychangeinthefuture,butcurrentlyhasthevalue
of100ns.
Timebased WRR arbitration supports a maximum table length of 128
phases,buttheactualnumberoftableentriesavailableforagivenVCmay
belessthanthat.ThevalueishardwareinitializedandreportedintheMax
imum Time Slots field of each virtual channel that supports TBWRR, as
showninFigure718onpage267.

266
PCIe 3.0.book Page 267 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Figure718:MaximumTimeSlotsRegister

31 24 23 22 16 15 14 13 87 0
Port Arbitration Maximum Time RsvdP Port Arbitration
Table Offset Slots Capability

RsvdP
Reject Snoop Transactions
Undefined

Loading the Port Arbitration Tables


TheactualsizeandformatofthePortArbitrationTablesareafunctionofthe
number of phases and the number of ingress ports supported by the Switch,
RCRB,orRootPortthatsupportspeertopeertransfers.Themaximumnumber
ofingressportssupportedbythePortArbitrationTableis256ports.Theactual
numberofbitswithineachtableentryisdesigndependentandgovernedbythe
numberofingressportswhosetransactionscanbedeliveredtotheegressport.
Thesizeofeachtableentryisreportedinthe2bitPortArbitrationTableEntry
SizefieldofPortVCCapabilityRegister1.Thepermissiblevaluesare:

00b1bit(selectsbetween2ports)
01b2bits(4ports)
10b4bits(16ports)
11b8bits(256ports)

Configurationsoftwareloadseachtablewithportnumberstoaccomplishthe
desired port priority for each VC supported. As illustrated in Figure 719 on
page268,thetableformatdependsonthesizeofeachentryandthenumberof
phasessupportedbythisdesign.

267
PCIe 3.0.book Page 268 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure719:FormatofPortArbitrationTables

32-Phase Port Arbitration Table


with 4-bit entries
31 28 27 24 23 20 19 16 15 12 11 87 43 0

Phase[7] Phase[6] Phase[5] Phase[4] Phase[3] Phase[2] Phase[1] Phase[0] 00h

Phase[15] Phase[14] Phase[13] Phase[12] Phase[11] Phase[10] Phase[9] Phase[8] 04h

Phase[23] Phase[22] Phase[21] Phase[20] Phase[19] Phase[18] Phase[17] Phase[16] 08h

Phase[31] Phase[30] Phase[29] Phase[28] Phase[27] Phase[26] Phase[25] Phase[24] 0Bh

1. Configuration Software loads the Port Arbitration Table.


2. Changes to the table automatically set the Port Arbitration
00b PAT entry is 1 bit (2 ports)
Table Status bit.
01b PAT entry is 2 bits (4 ports)
3. Software sets the Load Port Arbitration Table bit to
10b PAT entry is 4 bits (16 ports)
apply the table contents to the hardware.
11b PAT entry is 8 bits (256 ports)
4. Hardware loads table contents into the Port Arbiter, then
automatically clears the Port Arbitration Table
status bit when the table has been loaded.

VC Resource Status Register Port VC Capability Register 1


15 2 1 0 31 12 11 10 9 8 7 6 43 2 0

RsvdP RsvdP
VC Negotiation Pending Port Arbitration Table Entry Size
Port Arbitration Table Status Reference Clock
RsvdP
Low Priority Extended VC Count
RsvdP
Extended VC Count

VC Resource Capability Register


31 26 24 19 17 16 15 87 0
VC RsvdP TC/VC Map
RsvdP ID RsvdP

Load Port Arbitration Table


Port Arbitration Select
VC Enable

268
PCIe 3.0.book Page 269 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Switch Arbitration Example


LetsconsideranexampleofathreeportswitchtoillustratebothPortandVC
arbitration.Theexamplepresumesthatpacketsarrivingoningressports0and
1 are moving in the upstream direction and port 2 is the egress port facing
upstream(towardtheRootComplex).RefertoFigure720onpage270during
thefollowingdiscussion.

1. Packetsarrivingatingressport0areplacedinareceiverVCbasedonthe
TC/VCmappingforport0.Asshown,TLPswithtrafficclassTC0orTC1
aresenttotheVC0buffers.TLPscarryingtrafficclassTC3orTC5aresent
totheVC1buffers.NootherTCsarepermittedonthislink.Asanaside,ifa
packetdoesarrivewithaTCthathasnotbeenmappedtoanexistingVC,it
willbetreatedasanerror.
2. Packetsarrivingatingressport1areplacedinaVCbasedonTC/VCmap
ping, too, but its not the same for this port. As indicated, TLPs carrying
trafficclassTC0aresenttoVC0,whileTLPscarryingtrafficclassTC2TC4
aresenttoVC3.NootherTCsarepermittedonthislink.
3. Inbothports,thetargetegressportisdeterminedfromroutinginformation
in each packet. For example, address routing is used in memory or IO
requestTLPs.
4. Allpacketsdestinedforegressport2aresubmittedtotheTC/VCmapping
logicforthatport.Asshown,TLPscarryingtrafficclassTC0TC2areplaced
intobuffersforVC0thatarelabeledwiththeiringressportnumber,while
TLPscarryingtrafficclassTC3TC7aremanagedforVC1.
5. PortArbitrationisappliedindependentlytoqueueduppacketstodecide
whichportspacketswillgetloadednextintotherealVC.
6. Finally,VCarbitrationdeterminestheorderinwhichtransactionsintheVC
bufferswillbesentacrossthelink.
7. Notethatthe VCarbiterselectspacketsfor transmissiononlyifsufficient
flowcontrolcreditsexist.

269
PCIe 3.0.book Page 270 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure720:ArbitrationExamplesinaSwitch

Switch
(1)
TC/VC TC3,5 INRESS EGRESS
TC3,5 VC1
Mapping
0
Of Ingress TC0,1 (5)Egress Port 2
TC0,1 VC0
Port 0
Port Arbitration: VC0
FC Buffer VC0 FC Buffer VC1
TLP2 Routing TC TLP4 Routing TC (4) Port 0
VC0
ARB (6)
TLP1 Routing TC TLP3 Routing TC Packets VC0
Egress Port 2
To Port 1
TC/VC VC Arbitration (7)
Determine Egress Port Mapping
To Port 2
TC0-2 VC0
(Using Routing Info) (3) To Port 3
Of Egress
Port 2
VC0
ARB 2
(2) VC1 TC3-7 VC1
TC2-4 VC3 TC/VC TC2-4 TC0-2=>VC0 (5)Egress Port 2
Mapping TC3-7=>VC1
1 Port Arbitration: VC1
Of Ingress TC0
TC0 VC0
Port 1
Port 1
Packets VC1
FC Buffers VC0 FC Buffers VC3 ARB
TLP3 Routing TLP4 Routing VC1
TLP1 Routing TLP2 Routing

To Port 0
Determine Egress Port To Port 2 This logic replicated for each egress port
(Using Routing Info) (3) To Port 3

Arbitration in Multi-Function Endpoints


AnothersetofregisterscalledMultiFunctionVirtualChannel(MFVC)capabil
ity is defined for the specific case of endpoints that will implement QoS in a
device with multiple functions. Not surprisingly, this case presents the same
arbitrationissuesinternallythataswitchportmusthandle.

Therearetwocasesdescribedinthespecforthisarbitration.Inthefirstcase,
showninFigure721onpage271,therearetwoFunctionsbutonlyFunction0
includesVCCapabilityregistersandtheassignmentsmadethereareimplicitly
thesameforallfunctions.Forthisoption,arbitrationbetweenthefunctionswill
behandledinsomevendorspecificmanner.Thatsthesimplestapproach,but
doesnt include a standard structure to define priority between requests from
differentfunctionsandsoitdoesntsupportQoS.

270
PCIe 3.0.book Page 271 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Figure721:SimpleMultiFunctionArbitration

Function 0 Vendor-Specific
Internal Link
Arbitration
VC
Capability
0002h
Egress Port

Function 1 Internal Link

IfQoSsupportisdesired,thenanMFVCisimplementedinVC0andeachfunc
tion has its own unique set of VC Capability registers. To preserve software
backwardcompatibility,thespecstatesthattheVCCapabilityIDforadevice
thatdoesnotuseMFVCmustbe0002h,whiletheVCCapabilityIDforadevice
thatdoesimplementanMFVCstructuremustbe0009h.

Figure722onpage272showstheMFVCregisterblockandablockdiagramof
anexamplewithtwofunctionsinanendpointwhoseportsupportstwoVCs.
EachfunctionhasaTransactionLayeranditsownVCCapabilityregisters,but
doesnt implement the lower layers. Instead, they connect to the Transaction
Layer of the shared port that does have all the layers. Sharing the hardware
interfaceresultsinlowercost,ofcourse,andtheadditionofMFVCallowsthe
functionstohandleisochronoustraffic.

Ascanbeseeninthefigure,theMFVCregistersresideinFunction0onlyand
definetheVCsandarbitrationmethodstobeusedforthisinterface.TheMFVC
registerslookvery muchthesameasVCcapabilityregistersandsupportVC
arbitrationandFunctionarbitration.Sincepacketsfrommultiplefunctionscan
attempttoaccessthesameVCatthesametime,FunctionArbitrationdecides
the priorities among them. That should look familiar by now because its the
same concept as port arbitration and even uses the same arbitration options,
includingTBWRR.VCarbitrationoptionsarealsothesameastheyareinthe
singlefunctionVCregisters.

271
PCIe 3.0.book Page 272 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure722:QoSSupportinMultiFunctionArbitration

Extended Capability Header Cnt


Port VC Capability 1 Ext. VC Count
VAT Offset Port VC Capability 2
Port VC Status Port VC Control
Func 0 Offset VC Resource Cap (0)
VC Resource Control (0)
VC Resource Status (0) RsvdP
Func n Offset VC Resource Cap (n)
VC Resource Control (n)
VC Resource Status (n) RsvdP

VC Arbitration Table (VAT)


Function Arbitration Table 0
Function Arbitration Table n

Function
Function 0 Arbiter

MFVC Port 1
Capability
VC0
0008h Internal Link
Port 2 VC0 VC Arbiter
VC
Capability VC0
0009h
TC/VC Mapping

Egress
Port

Function 1

Port 1
VC7
Internal Link
VC Port 2 VC7
Capability VC7
0009h

Isochronous Support
Asmentionedearlier,noteverymachineorapplicationneedsisochronoussup
port,buttherearesomethatcantgetbywithoutit.SincePCIewasdesignedto
supportitfromthebeginning,letsconsiderwhatwouldneedtobeinplaceto
makethiswork.

272
PCIe 3.0.book Page 273 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Timing is Everything
ConsidertheexampleshowninFigure723onpage274,whereasynchronous
connectionwouldbedesirablebutisntpossible.Instead,weemulateasynchro
nouspathwithisochronousmechanisms.Inthisexample,isochronydefinesthe
amountofdatathatwillbedeliveredwithineachServiceIntervaltoachievethe
requiredservice.Thefollowingsequencedescribestheoperation:

1. Thesynchronoussource(videocameraandPCIExpressinterface)accumu
latesdatainBufferAduringthefirstoftheequalserviceintervals(SI1).
2. Thecameradeliversalloftheaccumulateddataacrossthegeneralpurpose
bus during the next service interval (SI 2) while it accumulates the next
blockofdatainBufferB.
Clearly, the system must be able to guarantee that the entire contents of
bufferAcanbedeliveredduringtheserviceinterval,regardlessofwhether
othertrafficisinflightontheLink.Thisishandledbyassigningahighpri
oritytothetimesensitivepacketsandprogrammingarbitrationschemesso
theyllbehandledfirstanytimethereiscompetitionwithothertraffic.Also
note that, as long as all the data is delivered within the time window, it
doesnt matter exactly when it arrives. It might be spread out across the
interval or bunched up in one place inside it. As long as its all delivered
withtheServiceIntervaltheguaranteescanstillbemet.
3. During SI 2, the tape deck receives and buffers the incoming data, which
can then be delivered to storage for recording during SI 3. The camera
unloads Buffer B onto the Link during SI 3 while accumulating new data
intoBufferA,andthecyclerepeats.

273
PCIe 3.0.book Page 274 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure723:ExampleApplicationofIsochronousTransaction

Camera

SI 1 Data accumulated
in Buffer A

SI 2 Data from Buffer A


delivered while
next data accumulates
in Buffer B

Buffer A

Buffer B
SI 3 Data from Buffer B
delivered while next
data accumulates in
Buffer A
PCI Express
Interface

SI 1 SI 2 SI 3
Service Interval (SI)
Buffer A

Buffer B

SI 2 Data received into


Buffer A

SI 3 Data from Buffer A


delivered to Storage
while data received
into Buffer B
Storage (e.g.: tape)

How Timing is Defined


IsochronoustimingisdefinedinPCIebythetimeslotusedintheTimeBased
WeightedRoundRobinport arbitration scheme.Atpresent,the time foreach
slot is100ns, andrepresentsone entry ofthe 128entriesinthe TBWRR table.
Once set up, the arbiter will repeatedly cycle through this table once every
12.8s,whichrepresentstheoverallServiceInterval.

Making an isochronous path work as intended requires a few considerations.


First, the data packets must be delivered with predictable timing at regular
intervals. Second, the maximum amount of isochronous data to be delivered
mustbeknownaheadoftimeandpacketsmustnotbeallowedtoexceedthat
limit. Third, the Link bandwidth must be sufficient to support the amount of
datathatneedstobedeliveredinagiventimeslot.

274
PCIe 3.0.book Page 275 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Considerthefollowingexample.AsingleLaneLinkrunningat2.5Gbpsdeliv
ersonesymbolevery4ns.Thatallowsittosend25symbolswithina100nstime
slot,butisthatenoughtobeuseful?Inmanycasesitsnot,becauseaTLPmay
need 28 bytes of overhead for the combination of header, sequence number,
LCRC,andsoforth.Thatwouldmeanthereisnteventimetofinishsendingthe
overhead,muchlessanydatapayloadin100ns.Ifweneededtosend128bytes
ofdata,thenthebandwidthrequirementwouldbe128+overhead=156bytes.
OneoptionforsolvingthisproblemwouldbetoincreasetheLinkwidthto8
Lanes, allowing eight times as many bytes to be sent at once. That change
woulddeliver200bytesin100nsandallowasingletimeslottodeliverallthe
isochronousdata.AnothersolutionwouldbetouseasingleLanebutgivethe
portmoretimeslots,since8timeslotsatthelowerLinkwidthwoulddeliver
the same amountof data. The choice of solution depends on cost and perfor
mance constraints, but the system designer must know the timing and band
widthrequirementsoftheisochronouspathtobeabletosetitupcorrectly.

How Timing is Enforced


Whentimingis anintegralpartofthe properoperationofadesign,asinthe
previousexample,itisenforcedbythecombinationofthingswevediscussed
sofar.First,highpriorityTCsmustbeselectedinsoftwareandVCssetupin
hardware with the mappings between them defined so that only the correct
packetswillbeplacedintothehighpriorityVCs.Thenthedesiredtimingisa
matter of programming the arbitration schemes to accommodate the needed
bandwidth in the specified time. For example, the choice for VC arbitration
wouldprobablybetheStrictPriorityoption,sinceitstheonlychoicethatcan
ensurethatahighprioritypacketwontbedelayedbyotherpackets.ForPort
arbitrationthechoicemustbeTBWRRtoenforcetiming.

Software Support
Supporting isochronous service requires some coordination between the soft
wareelementsinthesystem.InaPCsystem,devicedriverswillreportisochro
nous requirements and capabilities to the OS, which will then evaluate the
overall system demands and allocate resources appropriately. Embedded sys
tems will be different, because the all the pieces are known at the outset and
softwarecanbesimpler.InthefollowingdiscussionwelldescribethePCcase
sinceanembeddedsystemshouldsimplybeasimplersubsetofthat.

275
PCIe 3.0.book Page 276 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Device Drivers
Adevicedrivermustbeabletoreportitstimingrequirementstothesoftware
thatoverseesisochronousoperationandobtainpermissionbeforetryingtouse
isochronouspackets.Itsimportanttonotethatdriverlevelsoftwareshouldnot
directlychangehardwareassignmentsorarbitrationpoliciesonitsown,even
though it could, because the result would be chaos. If multiple drivers were
eachindependentlytryingtodothis,thelastonetomakechangeswouldover
writeanypreviousassignments.Toavoidthat,anOSlevelprogramcalledan
Isochronous Broker receives the timing requests from the system devices and
assignssystemresourcesinacoordinatedwaythataccommodatesthemall.

Isochronous Broker
Thisprogrammanagestheendtoendflowofisochronouspackets.Itreceives
the isochronous timing requests from device drivers and allocates system
resourcesinawaythataccommodatestherequeststhroughthetargetpath.In
thespecthisisreferredtoasestablishinganisochronouscontractbetweenthe
requester/completerpairandthePCIefabric.Doingsorequiresverifyingthat
the intended path can indeed support isochronous traffic, and then program
mingtheappropriatearbitrationschemestoensureitworkswithinthespeci
fiedtimingrequirements.

Bringing it all together


Bynowitshouldbereasonablyclearwhatneedstobedonetosupportisochro
noustrafficflowinasystem,butletslookatonelastexampletobringallthe
piecestogether.Ifweexpandonthepreviousvideocaptureexampletoshowa
morecomplexsystem,liketheoneinFigure724onpage277,wellbeableto
discussallthepartsthatmustbeinplaceifthevideocameraisgoingtobeable
todelivercaptureddataintosystemmemory.Thiswouldbeadifficultenviron
mentforisochronousservicebecausetherearesomanydevicesthatcancom
pete for bandwidth in the path, but that also makes it useful to illustrate the
variousthingsthatmustbeconsidered.

Endpoints
Startingatthebottom,whatwillbeneededinthePCIeinterfaceforthevideo
endpointdeviceitself?Inhardware,morethanoneVCwillberequiredifwere
goingtodifferentiatepackets.Letsassumeasinglefunctiondeviceforsimplic
ity.Thedevicedriverwouldneedtoreportthedevicecapabilitiesandisochro
nous timing requirements to the OSlevel Isochronous broker, which would
evaluatethesystemandthenreportbackwhetheranisochronouscontractwas
possibleandwhichTCsthesoftwareshoulduse.

276
PCIe 3.0.book Page 277 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Figure724:ExampleIsochronousSystem

Processor

GFX Root Complex

System
Memory
Switch 2

Switch 1

Slot

Video SCSI
Camera
Lower
Time- priority
sensitive data
data

ThedriverwouldthenprogramVCnumbersandmaptheappropriateTCsto
eachVC.ItwouldalsomostlikelyprogramtheVCarbitrationtobeStrictPrior
ity for the highpriority channels. The one caveat here is that the arbitration
must still be fair, meaning the lowpriority channels wont get starved for
access.ThatmeansthehighpriorityVCscanthavetrafficpendingconstantly
butinsteadmustspreadoutpacketinjectionovertime.

One other observation regarding Link operation is necessary before we finish


our discussion of endpoints, and that is regarding Flow Control. The receive
buffersofdevicesintheisochronouspathmustbelargeenoughtohandlethe
expectedpacketflowwithoutcausinganybackpressureaslongaspacketsare
injected uniformly according to the Isochronous Contract. In addition, Flow
ControlUpdatesmustbereturnedquicklyenoughtoavoidstalls.

277
PCIe 3.0.book Page 278 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Switches
Next,considerwhatwouldneedtobepresentineachoftheswitchesthatreside
between the endpoint and the Root Complex. Switches dont commonly have
devicedrivers,soitwouldfalltoOSlevelsoftwareliketheIsochronousBroker
to read their configuration information and determinewhatservicetheysup
port. First, all the ports in the isochronous path must support more than one
VC,andtheTC/VCmappingmustmatchonbothendsofeachLink.Remember
thatoncethepacketgetsintotheTransactionLayeroftheSwitchport,onlythe
TC remainswiththe packet,and theVCassignment for that TC isspecific to
eachport.TheTC/VCmappingofthedownstreamportofSwitch1mustmatch
themappingoftheendpoint,buttheotherswitchportmappingsmaybediffer
enttomatchtheotherendoftheirLinks.

ArbitrationIssues.The choices for arbitration are straightforward. In


ourexample,theisochronouspathisshownascarryingtrafficinonlyone
directionforsimplicity.Itispossibletohaveisochronoustrafficflowingin
bothdirectionsinthecaseofamemoryread,forexample,butourexample
waschosentoresemblethevideostreamingcase.

VCarbitrationfortheisochronousegressportwillmostlikelyneedtouse
theStrictPriorityschemeforthesamereasonstheendpointdoes.Portarbi
trationwillneedtousetheTimeBasedWRRscheme,andthatmeanssoft
ware must understand the proper access ratios and program the Port
Arbitration Tables to implement them. This might not be as simple as it
soundsifmultipleswitchesareinthepathbecauseeventhoughtheyllall
usethesameTBWRRarbitrationscheme,itsnotclearhowtheserviceinter
valsforeachofthemwouldbecoordinated.IftheSIsarenotaligned,mean
ingtimingguaranteescouldbemoredifficultdependingonthehowbusy
the Links are. Coordinating the service intervals wasnt considered in the
spec, though, so it would again involve a nonstandard method. Clearly,
thisproblemwouldbemuchsimplerifwedidnthavemultipleswitchesin
anisochronouspath.

TimingIssues.Figure725onpage279showsthetimingofpacketsbeing
delivered by the two endpoints for our example. Packets from the video
device, withaknownsize and deliveredinregular and predictableinter
vals,areshownastheheavierarrows.Thesmaller,lighterarrowsrepresent
packetsfromtheSCSIdrivethatarelowerpriorityandwhosetimingisnot
predictable.Intheendpoint,thepacketssimplyneedtohavetheproperTC
assignedtothem,butaswitchneedstoensurethatthepropertimingpolicy
isenforced.ThisisdonebyusingTBWRR,whichspecifieswhichportwill
have access at a given time and for how long. Knowing the size and fre

278
PCIe 3.0.book Page 279 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

quencyoftheisochronouspacketsallowssoftwaretoproperlyarrangethe
timing,butwhatkindoftimingisneeded?

Figure725:InjectionofIsochronousPackets

SI = Service Interval
SI 1 SI 2 SI 3
time

First, lets review the parameters involved by considering a simple example.


RecallthatPCIebasesatimeslotonthereferenceclockperiodisgivenbythe
Port Capability Register 1 field called Reference Clock. At present the only
optionforthatfieldis100ns,andtheTBWRRtablehasnooptionsbesides128
entries. The length of the Service Interval is the multiple of those, making it
12.8s. The bandwidth for a given device can be expressed by the equation
below,whereYisthedatatobedeliveredinonetimeslot(thespecstatesthat
theMaxPayloadSizeprogrammedduringconfigurationmustbeusedforthis
bandwidthcalculation),Misthenumberoftimeslots,andTistheoverallSer
vice Interval. If we choose 128 bytes as the payload, and we know that SI is
12.8s,thentheBW=10MB/sforeachtimeslotallocated.
YM
BW = --------------
T
Nowletsconsideramorerealisticexample.LetssaythatourLinksarerunning
attheGen2speed,thatthevideodeviceneedstohaveaguaranteedbandwidth
of100MB/s,andthatitwillsend512bytepackets.Fillingintheequationshows
M=2.5instancesof512bytesareneeded.Buthowmuchdatacanactuallybe
6 6
512 M
6 100 10 12.8 10
100 10 = ------------------------- M = ----------------------------------------------------- = 2.5
6 512
12.8 10
sentinonetimeslot?TheanswerdependsonspeedandLinkwidth,orcourse.
At5.0Gb/sittakes2nstosendeach10bitsymbol,so50symbolscanbedeliv
eredperLanein100ns.Ifthepacketsizeis512bytesofdataplusanother28or
sofortheheader,then11timeslotswouldbeneededtodeliver550symbolsfor
one packet using a x1 Link. It is possible to give one port several contiguous

279
PCIe 3.0.book Page 280 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

slotsifneeded,sothatsonesolution.Sincethepacketsizethatwillbesentis
alwaysthesame,wecantreallyprogram2.5instancesofit,sowedhavetouse
3instead.Fromourequation,3instancesof512byteseachresultsinanactual
bandwidthof120MB/s.Thatshigherthanweneed,butitsolvestheproblem.
Thenumberoftimeslotsusedwouldthenbe11x3=33,leaving95forother
useintheServiceInterval.Eachgroupof11timeslotswouldneedtobecontig
uousbutthegroupscouldbespacedoutovertheserviceinterval.
Another solution would be increase the Link width. Although the hardware
would cost more, using 11 Lanes would allow delivery of all the data in one
timeslot.TheCEMspecdoesntcurrentlysupportax11option,butax12option
isavailableandwouldworkforourexample.UsingawideLinklikethatmeans
software would only need to program one time slot for each packet, and just
three over the whole service interval to support isochronous traffic for this
device. Unlike the x1 case, now we wouldnt need contiguous time slots.
Instead,theycouldbespacedovertheserviceintervalinsomeoptimalfashion.
BandwidthAllocationProblems.The TBWRR table must be pro
grammedtoguaranteesufficienttimelybandwidthforisochronoustraffic,
andthatothertrafficwontbeallowedtointerfere.InFigure725onpage
279,theSCSIcontrollerisshownassendingonepacketinSI1andanother
inSI3.IfthetimingwassuchthatonepacketfromthatendpointperSIwas
allowedthenthisworksfine.
NowletssaytheSCSIcontrollerattemptstoinjectmorepacketsthanithas
permissiontodoinSI1,illustratedinFigure726onpage280.Thisisthe
first of two bandwidth allocation problems mentioned in the spec and is
called oversubscription. This could interfere with isochronous traffic
flow, but programming the TBWRR table readily avoids that problem
becausethearbitrationonlyallowsapacketfromthatportatspecifictimes.
If more packets from that port are queued up, they simply have to wait
untilthenextavailabletime,whichmightbeinSI2,asshowninthisexam
ple.Eventually,thiscanresultinflowcontrolbackpressureatthesending
agent
Figure726:OverSubscribingtheBandwidth

SI = Service Interval
SI 1 SI 2 SI 3
time

280
PCIe 3.0.book Page 281 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Thesecondtimingproblemiscalledcongestionandhappenswhentoo
manyisochronousrequestsaresentwithinagiventimewindow,asshown
in Figure 727 on page281.This isa similarproblembutnow there isno
simplesolution.Unlikethepreviouscase,postponinghighprioritypackets
untilanothertimeslotisnotanoption,sothesystemmustmakeaneffortto
handlethemall.Theresultisthatsomerequestsmayexperienceexcessive
servicelatencies.Tocorrectthis,softwarewouldneedtochangethedistri
butionofpacketssothattheycanbesupportedbytheavailablehardware
bandwidth.

Figure727:BandwidthCongestion

SI = Service Interval
SI 1 SI 2 SI 3
time

LatencyIssues.Managing latency for packet delivery is an important


partofisochrony,andinvolvesthecombinationofthefabriclatencyandthe
Completerlatency.Fabriclatency depends onallthecharacteristicsofthe
Link between the various components in the system, especially the Link
widthandfrequencyofoperation.Asimplewaytominimizethisvalueis
to constrain the complexity of the PCIe topology for isochronous paths.
Completerlatencydependsonthetargetendpointinternalcharacteristics,
suchasmemoryspeedandinternalarbitration.

Root Complex
TheRChasthesamearbitrationandtimingrequirementsasaswitch.Itreceives
packetsonseveraldownstreamportsandforwardsthemtothetargetinaway
thatsconsistentwiththerulesforisochronydescribedearlier.However,much
ofhowthisisdonewillbevendorspecificbecausethespecdoesntdefinethe
RCorhowitshouldbeprogrammed.

Problem:Snooping.Oneinterestingthingaffectingtimingandlatencyin
therootthatwehaventyetdiscussedistheprocessofsnooping.Normally,
anytimeanaccesstosystemmemorytakesplaceitwillbetoalocationthat
theprocessorconsiderscacheable,meaningithaspermissiontostoreatem

281
PCIe 3.0.book Page 282 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

porary copy in its local caches. If an external device attempts to accesses


that area of memory, the chipset must first check the processor caches
beforeallowingtheaccessbecauseacachedcopymayhavebeenmodified.
If so, the modified data will need to be written back to memory before it
will be available for the device access. Although its necessary to ensure
memory coherency, the problem is that snooping takes time. How long it
takesistypicallyboundedbutnotpredictablebecauseitdependsonwhat
else the CPUs are doing at that time. Depending on the timing require
ments,thatkindofuncertaintycouldruinanisochronousdataflow.

SnoopingSolutions.One way to avoid snooping is for devices to only


accessareasofmemorythathavebeendesignatedasuncacheable.Another
optionisforsoftwaretosettheNoSnoopattributebitinthehighpriority
packetheaders.Thatforcesthechipsettoskipthesnoopstepregardlessof
thememorytypeandgodirectlytomemorybecausesoftwarehasguaran
teedthatdoingsowontcauseaproblem.Toenforcethisasarequirement
fortheisochronouspath,anotherbitcanbeinitializedbyhardwareinthe
rootportfor the highpriorityVCcalled Reject Snoop Transactions(see
theVCResourceCapabilityRegisterinFigure717onpage265).Thepur
poseofthisistoallowonlytransactionsforthatVCthathavetheNoSnoop
attributeset.Anyincomingpacketsthatdonthaveitsetarediscardedto
ensurethatthetimingwillneverbeviolatedbywaitingforasnoop.

Power Management
Its a simple observation, but if timing is important for a path in PCIe, then
powermanagement(PM)mechanismsfordevicesinthatpathwillneedtohan
dled carefully. Configuration software can read the latencies associated with
everyPMconditionandselectthosecasesthatthetimingbudgetwillpermit.
Thesimplestapproach, though,wouldjustbetodisableallPMoptionsinan
isochronous path. Fortunately, this is easily done using existing configuration
registers.DevicescanbeplacedintothedevicestateD0andleftthere,whilethe
hardwarecontrolledLinkPMmechanismcanbedisabled(formoreonPM,see
Chapter16,entitledPowerManagement,onpage703).

Error Handling
Finally,thereisonelastissue:whattodowhenerrorsoccurontheLink.The
ACK/NAK protocol, covered in Chapter 7, provides an automatic, hardware
based retry mechanism to correct packets that encounter transmission prob
lems.Thisotherwisedesirablefeaturepresentsaproblemforisochronybecause
ittakestimetodoit.Andhowlongittakestoresolveanerrorcanvarywidely
dependingonthingslikehowtheproblemwasdetected.

282
PCIe 3.0.book Page 283 Sunday, September 2, 2012 11:25 AM

Chapter 7: Quality of Service

Todecidethisquestionwehavetoknowhowmuchtimeuncertaintythesys
temcantolerateandstilldeliverisochronousdata.Ifthelatencybudgetistoo
tight,theresimplywontbetimeforretryingfailedpacketsandtheACK/NAK
protocolwillhavetobedisabled.Interestingly,thespecwritersevidentlydidnt
consider that possibility because no configuration bits are included for dis
ablingitordecidinghowtohandlepacketsthatwouldhavebeenretriedbut
nowwontbe.Thereforedisablingthiswillrequirenonstandardmechanisms
likevendorspecificregisters.

If there isnt enough time available for retries, the target agent may simply
choose to discard any bad packets. Another option would be to use the bad
packets as they are, errors and all. For some applications using isochronous
supportthatisntascounterintuitiveasitsounds.Anerrorinvideostreaming,
forexample,mightcauseanoccasionalglitchonthedisplay,butthatcouldbe
consideredanacceptablerisk.

IfthereisenoughtimeintheServiceIntervaltoallowretries,alimitcouldbe
placed on the possible latency they might add by adding a timer to track the
timeuntiltheendoftheServiceIntervalandusethattodecidewhetheraretry
couldbeattempted.Errorsshouldnthappenveryoften,ofcourse,sothismight
besufficienttocorrecttheoccasionaltransmissionfaultwhilestillmaintaining
isochronoustiming.

283
PCIe 3.0.book Page 284 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

284
PCIe 3.0.book Page 285 Sunday, September 2, 2012 11:25 AM

8 Transaction
Ordering
The Previous Chapter
ThepreviouschapterdiscussesthemechanismsthatsupportQualityofService
anddescribesthemeansofcontrollingthetimingandbandwidthofdifferent
packets traversing the fabric. These mechanisms include applicationspecific
software that assigns a priority value to every packet, and optional hardware
thatmustbebuiltintoeachdevicetoenablemanagingtransactionpriority.

This Chapter
This chapter discusses the ordering requirements for transactions in a PCI
Expresstopology.TheserulesareinheritedfromPCI.TheProducer/Consumer
programming model motivated many of them, so its mechanism is described
here. The original rules also took into consideration possible deadlock condi
tionsthatmustbeavoided.

The Next Chapter


Thenextchapterdescribes,DataLinkLayerPackets(DLLPs).Wedescribethe
use, format, and definition of the DLLP packet types and the details of their
related fields. DLLPs are used to support Ack/Nak protocol, power manage
ment,flowcontrolmechanismandcanbeusedforvenderdefinedpurposes.

Introduction
Aswithotherprotocols,PCIExpressimposesorderingrulesontransactionsof
thesametrafficclass(TC)movingthroughthefabricatthesametime.Transac
tions with different TCs do not have ordering relationships. The reasons for
theseorderingrulesrelatedtotransactionsofthesameTCinclude:

Maintainingcompatibilitywithlegacybuses(PCI,PCIX,andAGP).
Ensuring that the completion of transactions is deterministic and in the
sequenceintendedbytheprogrammer.

285
PCIe 3.0.book Page 286 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Avoidingdeadlockconditions.
Maximizeperformanceandthroughputbyminimizingreadlatenciesand
managingreadandwriteordering.

Implementation of the specific PCI/PCIe transaction ordering is based on the


followingfeatures:

1. Producer/Consumerprogrammingmodelonwhichthefundamentalorder
ingrulesarebased.
2. Relaxed Ordering option that allows an exception to this when the
Requesterknowsthatatransactiondoesnothaveanydependenciesonpre
vioustransactions.
3. ID Ordering option that allows a switches to permit requests from one
device to move ahead of requests from another device because unrelated
threadsofexecutionarebeingperformedbythesetwodevices.
4. MeansforavoidingdeadlockconditionsandsupportingPCIlegacyimple
mentations.

Definitions
Therearethreegeneralmodelsfororderingtransactionsinatrafficflow:

1. Strong Ordering: PCI Express requires strong ordering of transactions


flowingthroughthefabricthathavethesameTrafficClass(TC)assignment.
TransactionsthathavethesameTCvalueassignedtothemaremappedtoa
given VC, therefore the same rules apply to transactions within each VC.
Consequently,whenmultipleTCsareassignedtothesameVCalltransac
tions are typically handled as a single TC, even though no ordering rela
tionshipexistsbetweendifferentTCs.
2. WeakOrdering:Transactionsstayinsequenceunlessreorderingwouldbe
helpful.Maintainingthestrongorderingrelationshipbetweentransactions
canresultinalltransactionsbeingblockedduetodependenciesassociated
with a given transaction model (e.g., The Producer/Consumer Model).
Someoftheblockedtransactionsverylikelyarenotrelatedtothedepen
denciesandcansafelybereorderedaheadofblockingtransactions.
3. Relaxed Ordering: Transactions can be reordered, but only under certain
controlledconditions.Thebenefitisimprovedperformanceliketheweak
ordered model, but only when specified by software so as to avoid prob
lemswithdependencies.Thedrawbackisthatonlysometransactionswill
be optimized for performance. There is some overhead for software to
enabletransactionsforRelaxedOrdering(RO).

286
PCIe 3.0.book Page 287 Sunday, September 2, 2012 11:25 AM

Chapter 8: Transaction Ordering

Simplified Ordering Rules


The 2.1 revision of the spec introduced a simplified version of the Ordering
TableasshowninTable 81onpage 289.Thetablecanbesegmentedonaper
topicbasisasfollows:

Producer/Consumerrules(page 290)
RelaxedOrderingrules(page 296)
WeakOrderingrules(page 299)
IDOrderingrules(page 301)
Deadlockavoidance(page 303)

Thesesectionsprovidedetailsassociatedwiththeorderingmodels,operation,
rationales,conditionsandrequirement.

Ordering Rules and Traffic Classes (TCs)


PCIExpressorderingrulesapplytotransactionsofthesameTrafficClass(TC).
TransactionsmovingthroughthefabricthathavedifferentTCshavenoorder
ing requirement and are considered to be associated with unrelated applica
tions. As a result, there is no transaction ordering related performance
degradationassociatedwithpacketsofdifferentTCs.

PacketsthatdosharethesameTCmayexperienceperformancedegradationas
they flowthroughthe PCIe fabric.Thisisbecauseswitchesand devices must
supportorderingrulesthatmayrequirepacketstobedelayedorforwardedin
frontofpacketspreviouslysent.

AsdiscussedinChapter7,entitledQualityofService,onpage245,transac
tionsofdifferentTCmaymaptothesameVC.TheTCtoVCmappingconfigu
ration determines which packets of a given TC map to a specific VC. Even
thoughthetransactionorderingrulesapplyonlytopacketsofthesameTC,it
maybesimplertodesignendpointdevices/switches/rootcomplexesthatapply
thetransactionorderingrulestoallpacketswithinaVCeventhoughmultiple
TCsaremappedtothesameVC.

Asonewouldexpect,therearenoorderingrelationshipsbetweenpacketsthat
maptodifferentVCsnomattertheirTC.

287
PCIe 3.0.book Page 288 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Ordering Rules Based On Packet Type


OrderingrelationshipsdefinedbythePCIespecarebasedonTLPtype.TLPs
aredividedintothreecategories:1)Posted,2)Completionand3)NonPosted
TLPs.

ThePostedcategoryofTLPsincludememorywriterequests(MWr)andMes
sages(Msg/MsgD).CompletioncategoryofTLPsincludeCplandCplD.Non
Postedcategory of TLPsincludeMRd,IORd,IOWr,CfgRd0, CfgRd1,CfgWr0
andCfgWr1.

Thetransactionorderingrulesaredescribedbyatableinthefollowingsection
The Simplified Ordering Rules Table on page 288. As you will notice, the
tableshowsTLPslistedaccordingtothethreecategoriesmentionedabovewith
theirorderingrelationshipsdefined.

The Simplified Ordering Rules Table


ThetableisorganizedinaRowPassColumnfashion.Alloftherulesaresum
marized following the Simplified Ordering Table. Each rule or group of rules
definetheactionsthatarerequired.

InTable 81onpage 289,columns25representtransactionsthathaveprevi


ouslybeendeliveredbyaPCIExpressdevice,whilerowADrepresentsanew
transactionthathasjustarrived.Foroutboundtransactions,thetablespecifies
whetheratransactionrepresentedintherow(AD)isallowedtopassaprevi
oustransactionrepresentedbythecolumn(25).ANoentrymeansthetrans
actionintherowisnotallowedtopassthetransactioninthecolumn.AYes
entrymeansthetransactionintherowmustbeallowedtopassthetransaction
inthecolumntoavoidadeadlock.AYes/Noentrymeansatransactionina
rowisallowedtopassthetransactioninthecolumnbutisnotrequiredtodoso.
Theentriesinthefollowinghavethemeaning.

288
PCIe 3.0.book Page 289 Sunday, September 2, 2012 11:25 AM

Chapter 8: Transaction Ordering

Table81:SimplifiedOrderingRulesTable

Posted Non-Posted Request Completion


Row pass Request (Col 5)
Read NPR with
Column? (Col 2)
Request Data
(Col 1)
(Col 3) (Col 4)

Posted a)No Yes Yes a)Y/N


Request b)Y/N b)Yes
(Row A)
Read a)No Y/N Y/N Y/N
Request
Non-Posted

b)Y/N
Request

(Row B)

NPR with a)No Y/N Y/N Y/N


Data b)Y/N
(Row C)

Completion a)No Yes Yes a)Y/N


(Row D) b)Y/N b)No

A2a,B2a,C2a,D2atoenforcetheProducer/Consumermodel,asubse
quenttransactionisnotallowedtopassaPostedRequest.
A2,D2bIfROisset,thenaReadCompletionispermittedtopassaprevi
ouslyqueuedMemoryWriteorMessageRequest.
A2b, B2b, C2b, D2b if the optional IDO is being used, a subsequent
transactionisallowedtopassaPostedRequest,aslongastheirRequester
IDsaredifferent
A3, A4 A Memory Write or Message Request must be allowed to pass
NonPostedRequeststoavoiddeadlocks.
A5aPostedRequestispermittedbutnotrequiredtopassCompletions
A5bDeadlockavoidancecase.InaPCIetoPCI/PCIXbridge,fortrans
actionsgoingfromPCIetoPCIorPCIX,aPostedRequestmustbeableto
passaCompletion,oradeadlockmayoccur.
B3, B4, B5, C3, C4, C5, These cases implement weak ordering without
riskinganyorderingrelatedproblems.
D3,D4CompletionsmustbeallowedtopassReadandI/OorConfigura
tionWriteRequests(NonPostedRequests)toavoiddeadlocks.
D5aCompletionswithdifferentTransactionIDsmaypasseachother.
D5bCompletionswiththesameTransactionIDarenotallowedtopass
eachother.Thisensuresthatmultiplecompletionsforasinglerequestwill
remaininascendingaddressorder.

289
PCIe 3.0.book Page 290 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Producer/Consumer Model
ThissectiondescribestheoperationoftheProducer/Consumermodelandthe
associatedorderingrulesrequiredforproperoperation.Figure81onpage291
simply illustrates a sample topology. Subsequent examples of this topology
describetheoperationoftheProducer/Consumermodelwithproperordering,
followedbyanexampleofthemodelfailingduetoimproperordering.
TheProducer/ConsumermodelisthecommonmethodfordatadeliveryinPCI
andPCIe.ThemodelcomprisesfiveelementsasdepictedinFigure81:
Producerofdata
Memorydatabuffer
FlagsemaphoreindicatingdatahasbeensendbytheProducer
Consumerofdata
StatussemaphoreindicatingConsumerhasreaddata
The specification states that the Producer/Consumer model will work regard
lessofthearrangementofalltheelementsinvolved.Inthisexample,theFlag
andStatuselementsresideinthesamephysicaldevice,butcouldbelocatedin
differentdevices.

290
PCIe 3.0.book Page 291 Sunday, September 2, 2012 11:25 AM

Chapter 8: Transaction Ordering

Figure81:ExampleProducer/ConsumerTopology

Consumer
(Processor)

P
Root Complex NP
CPL
Memory
P
NP
CPL
CPL
CPL

NP
NP

P
P

CPL

CPL
NP

NP
P

P Posted
NP Non-Posted
PCIe Switch CP CPL Completion
P L
NP NP
L P
CP
CP
P L
NP
NP
L P
CP
CPL
CPL

CPL
CPL

NP
NP
NP

NP

P
P

Flag
Producer
Status

Producer/Consumer Sequence No Errors


RefertoFigure82onpage293duringthefollowingdiscussion.Theexample
presumesthattheFlagandStatuselementareclearedtostartwith.Thesesema
phores are included within the same device in this example. The sequence of
numberedeventsinthedescriptionbelowanddepictedinFigure82onpage
293reflectthecorrectorderinginthisPart1sequence.

291
PCIe 3.0.book Page 292 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

1. Intheexample,adevicecalledtheProducerperformsoneormoreMemory
Write transactions (Posted Requests) targeting a Data Buffer in memory.
SomedelaycanoccurasthedataflowsthroughPostedbuffers.
2. The Consumer periodically checks the Flag by initiating a Memory Read
transaction (NonPosted Request) to determine if data has been delivered
bytheProducer.
3. TheFlagsemaphoreisreadbythedeviceandaMemoryReadCompletion
is returned to the Consumer, indicating that notification of data delivery
hasnotbeenperformedbytheProducer(Flag=0)yet.
4. The Producer sends a Memory Write Transaction (Posted Request) to
updatetheFlagto1.
5. Onceagain,theConsumercheckstheFlagbyperformingthesametransac
tionperformedinstep2.
6. WhenFlagsemaphoreisreadthistime,theFlagissetto1,indicatingtothe
Consumer,viatheCompletion,thatallofthedatahasbeendeliveredbythe
Producertomemory.
7. Next,theConsumerperformsaMemoryWritetransaction(PostedRequest)
tocleartheFlagsemaphorebacktozero.
Figure83onpage294continuestheexampleinthisPart2sequence.
8. The Producer, having more data to send, periodically checks the Status
semaphorebyinitiatingaMemoryReadtransaction(NonPostedRequest).
9. TheStatussemaphoreisreadbytheProducerandaMemoryReadComple
tionisreturnedtotheProducer,indicatingthattheConsumerhasnotread
thememorybuffercontentsandupdatedStatus(Status=0).
10. The Consumer, knowing that the memory buffer has data available, per
forms one or more Memory Read Requests (NonPosted Requests) to get
thecontentsfromthebuffer.
11. MemorycontentsarereadandreturnedtotheConsumer.
12. Uponcompletingthedatatransfer,theConsumerinitiatesaMemoryWrite
Request(PostedRequest)tosettheStatussemaphoretoa1.
13. Once again, the Producer checks the Status semaphore by delivering a
MemoryReadRequest(NonPostedRequest).
14. ThedevicereadstheStatusandthistimeitissetto1.TheCompletionis
returnedtotheProducer,therebyindicatingdatacanbesenttoMemory.
15. TheProducersendsaMemoryWritetoCleartheStatussemaphoreto0.
16. Thesequenceofeventsstartingwithstep1.isrepeatedbytheProducer.

292
PCIe 3.0.book Page 293 Sunday, September 2, 2012 11:25 AM

Chapter 8: Transaction Ordering

Figure82:Producer/ConsumerSequenceExamplePart1

Consumer
(Processor)
2 3
7
5 6

Root Complex P Memory


NP
CPL

P
NP
CPL
CPL
CPL

NP
NP

P
P

CPL

CPL
NP

NP
P

P Posted Request
1 NP Non-Posted Request
CP CPL Completion
P L
NP NP
L P
CP
CP
P 4 L
NP
NP
L P
CP

7 5
1 4 4 2 3 6
CPL
CPL

CPL
CPL

NP
NP
NP

NP

P
P

Producer Flag 0 1
Status 0

293
PCIe 3.0.book Page 294 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Figure83:Producer/ConsumerSequenceExamplePart2

Consumer
(Processor)

12 10

Root Complex P Memory


NP 11
CPL

P
NP
CPL
CPL
CPL

NP
NP

P
P

CPL

CPL
NP

NP
P

P Posted Request
NP Non-Posted Request
CP CPL Completion
P L
NP NP
L P
CP
CP
P L
NP
NP
L P
CP

13 14
13
14 15
8 15 8 9
CPL
CPL

CPL
CPL

NP
NP
NP

NP

P
P

Producer Flag 0 1
Status 0 1

294
PCIe 3.0.book Page 295 Sunday, September 2, 2012 11:25 AM

Chapter 8: Transaction Ordering

Producer/Consumer Sequence Errors


Thepreviousexamplewashandledcorrectlywithoutadiscussionoftheorder
ingrules;howeveritmayhavebeenapparentthatraceconditionscancausethe
Producer/Consumersequencetofail.Figure84onpage296illustratesasimple
sequencetodemonstrateoneofseveralproblemsthatcanarisewithoutorder
ingrulesbeingenforced.RefertoFigureFigure84onpage296duringthefol
lowingdiscussion.

1. ProducerperformsaMemoryWriterequest(PostedRequest)tothemem
orybuffer.Letusassumethatthememorywritedataistemporarilystuck
intheSwitchupstreamportPostedFlowControlbuffer.
2. The Producer sends a Memory Write Transaction (Posted Request) to
updatetheFlagto1.
3. TheConsumerinitiatesaMemoryReadRequest(NonPostedRequest)to
checkiftheFlaghasbeensetto1.
4. ThecontentsoftheFlagisreturnedtotheConsumerviaaCompletion.
5. Knowingthatdatahasbeendeliveredtomemory,theConsumerperforms
a memory read request to fetch the data. However, the Consumer is
unawarethatthedataistemporarilystuckinaPostedFlowControlbuffer
due to lack of flow control credits associated with the link between the
upstreamswitchportandtheRootComplex.Consequently,theConsumer
receivesolddatawhentheCompletionisreturnedtotheConsumer.

TheproblemisavoidedwithorderingrulessupportedbyvirtualPCIbridges
withinthetopology.Inthisexample,whentheConsumerperformedtheMem
ory Read transaction in steps 3 and 4, the Virtual PCI bridge at the upstream
switchportshouldnotallowthecontentsoftheflag(Completion4)tobefor
wardedaheadofthepreviouslyposteddata.

295
PCIe 3.0.book Page 296 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Figure84:Producer/ConsumerSequencewithError

Consumer
(Processor)

3 5 4 6

Root Complex P Memory


NP
CPL

P
NP
CPL
CPL
CPL

NP
NP

P
P

Retries Slow Delivery of data


CPL

CPL
NP

NP
P

P Posted Request
1 NP Non-Posted Request
CP CPL Completion
P L
NP NP
L P
CP
CP
P 2 L
NP
NP
L P
CP

1 2 2 3 4
CPL
CPL

CPL
CPL

NP
NP
NP

NP

P
P

Producer Flag 0 1
Status 0

Relaxed Ordering
PCIExpresssupportstheRelaxedOrdering(RO)mechanismaddedforPCIX.
ROallowsswitchesinthepathbetweentheRequesterandCompletertoreor
dersometransactionswhendoingsowouldimproveperformance.

296
PCIe 3.0.book Page 297 Sunday, September 2, 2012 11:25 AM

Chapter 8: Transaction Ordering

The ordering rules that support the Producer/Consumer model may result in
transactions being blocked in cases when theyre unrelated to any Producer/
Consumer transaction sequence. To alleviate this problem, a transaction can
haveitsROattributebitset,indicatingthatsoftwareverifiesittobeunrelatedto
other transactions, and that allows it to be reordered ahead of other transac
tions.Forexample,ifapostedwriteisdelayedbecausethetargetsbufferspace
is unavailable, then all subsequent transactions must wait until that finally
resolvesandthewriteisdelivered.Ifasubsequenttransactionwasknownby
softwaretobeunrelatedtopreviousonesandtheRObitwassettoshowthat,
thenitcouldbeallowedtogobeforethewritewithoutriskingaproblem.

TheRObit(bit5ofbyte2ofdword0intheTLPheaderasshowninFigure85
onpage297)maybeusedbythedeviceifitsdevicedriverhasenabledittodo
so. Request packets are then allowed to use this attribute as directed by soft
warewhenitrequeststhatapacketbesent.WhenswitchesortheRootCom
plexseeapacketwiththisattributebitset,theyhavepermissiontoreorderit
althoughitsnotrequiredthattheyshould.

Figure85:RelaxedOrderingBitina32bitHeader

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Byte 4 Requester ID Tag Last DW 1st DW
BE BE
Byte 8 Address [31:2] R

RO Effects on Memory Writes and Messages


SwitchesandRootComplexesmustobservethesettingoftheRObitintransac
tions.MemorywritesandMessagesarebothpostedwrites,botharereceived
intothesamePostedbuffer,andbotharesubjecttothesameorderingrequire
ments.WhentheRObitisset,switcheshandlethesetransactionsasfollows:

Switches are permitted to reorder memory write transactions just posted


aheadofpreviouslypostedmemorywritetransactionsormessagetransac
tions.Similarly,messagetransactionsjustpostedmaybeorderedaheadof

297
PCIe 3.0.book Page 298 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

previously posted memory write or message transactions. Switches must


also forward the RO bit unmodified. The RO bit is ignored by PCIX
bridges, which always forward writes in order (there would be little pur
poseinallowingthemtogooutoforderanyway;ifoneisblockedforsome
reason, the next will be blocked, too). Another difference is that message
transactionshadnotbeendefinedforPCIX,either.
TheRootComplexispermittedtoreorderpostedwritetransactions(hereit
makessensebecausetheRootcouldwritetodifferentareasofmemoryso,
if one area is busy it can write to a different one). Also, when receiving
writeswithROset,theRootispermittedtowriteeachbytetomemoryin
anyaddressorder.

RO Effects on Memory Read Transactions


AllreadtransactionsinPCIExpressarehandledassplittransactions.Whena
deviceissuesamemoryreadrequestwiththeRObitset,theCompleterreturns
therequestedreaddatainaseriesofoneormoresplitcompletiontransactions,
andusesthesameROsettingasintherequest.Switchbehaviorinthiscaseisas
follows:

1. AswitchthatreceivesamemoryreadwithROforwardstherequestinthe
orderreceived,andmustnotreorderitaheadofmemorywritetransactions
that were previously posted. That guarantees that all write transactions
movinginthedirectionofthereadrequestarepushedaheadoftheread.
ThisispartoftheProducer/Consumerexampleshownearlier,andsoftware
maydependonthisflushingactionforproperoperation.TheRObitmust
notbemodifiedbytheswitch.
2. When the Completer receives the memory read, it fetches the requested
dataanddeliversoneormoreCompletionsthatalsohavetheRObitset(its
valueiscopiedfromtheoriginalrequest).
3. A switch receiving the Completions is allowed to reorder them ahead of
previouslypostedmemorywritesmovinginthedirectionoftheComple
tion.Ifthewriteswereblocked(forexample,duetoflowcontrol),thenthe
Completionswillbeallowedtogoaheadofthem.Relaxedorderinginthis
caseimprovesreadperformance.Table82summarizestherelaxedorder
ingbehaviorallowedbyswitches.

298
PCIe 3.0.book Page 299 Sunday, September 2, 2012 11:25 AM

Chapter 8: Transaction Ordering

Table82:TransactionsThatCanBeReorderedDuetoRelaxedOrdering

TheseTransactionswithRO=1CanPass TheseTransactions

MemoryWriteRequest MemoryWriteRequest

MessageRequest MemoryWriteRequest

MemoryWriteRequest MessageRequest

MessageRequest MessageRequest

ReadCompletion MemoryWriteRequest

ReadCompletion MessageRequest

Weak Ordering
Temporarytransactionblockingcanoccurwhenstrongorderingrulesarerigor
ously enforced. Modifications that dont violate the Producer/Consumer pro
gramming model can eliminate some blocking conditions and improve link
efficiency.ImplementingtheWeaklyOrderedmodelcanalleviatethisproblem.

Transaction Ordering and Flow Control


The motivation behind splitting VC buffers of a given number into flow con
trolled subbuffers P, NP and CPL is because it simplifies processing of the
transaction ordering rules once TLPs have been parsed or binned into their
respectivebuffers.Thetransactionorderingprocessinglogicthenappliesorder
ingrulesbetweenthesethreesubbuffersortoeachsubbuffer.

SinceTLPsarebinnedintotheirrespectivethreesubbuffersinordertoprocess
transactionorderingrules,itisnecessarytodefinetheflowcontrolmechanism
between each virtual channel subbuffer (P, NP, CPL) of neighboring ports at
oppositeendsoftheLink.Infact,youmayrecallthatthereisanindependent
flow control mechanism between Header (Hdr) and Data (D) subbuffers of
eachsubbuffercategory(P,NP,CPL)ofeachvirtualchannelnumber.

299
PCIe 3.0.book Page 300 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Transaction Stalls
Strongorderingcanresultininstanceswherealltransactionsareblockeddueto
asinglefullreceivebuffer.Forexample,theorderingrequirementsforthePro
ducer/Consumermodel cannotbe changed, but ordering for transactionsthat
arentpartofthatmodelcan.Toimproveperformance,letsconsideraweakly
orderedscheme;onethatputstheminimalrequirementsontransactionorder
ing.

Thisexampledepictstransmitandreceivebuffersassociatedwiththedelivery
oftransactionsinasingledirectionforasingleVC.Recallthateachofthetrans
action types (Posted, NonPosted, and Completions) have independent flow
control within the same VC. The numbers in the transmit buffers show the
order in which these transactions were issued, and the nonposted receive
bufferiscurrentlyfull.Considerthefollowingsequence.

1. Transaction1(memoryread)isthenexttransactiontosend,buttherearent
enoughflowcontrolcreditssoitmustwait.
2. Transaction2(postedmemorywrite)isthenextsubsequenttransaction.If
strong ordering is enforced, a memory write must not pass a previously
queuedreadtransaction.
3. This restriction applies to all subsequent transactions, too, with the result
thattheyreallstalleduntilthefirstonefinishes.

Figure86:StronglyOrderedExampleResultsinTemporaryStall

Numbers indicate the


order of transactions
pending transfer
Posted Posted
7 4 2
Non-Posted Non-Posted
5 1 Full
Completions Completions
8 6 3

Tx Rx

Rx Tx

300
PCIe 3.0.book Page 301 Sunday, September 2, 2012 11:25 AM

Chapter 8: Transaction Ordering

VC Buffers Offer an Advantage


TransactionorderingismanagedwithinVirtualChannelbuffers.Thesebuffers
are grouped into Posted, NonPosted, and Completion transactions, and flow
controlismanagedindependentlyforeachgroup.Thatmakesweakordering
moreusefulbecause,asinourexample,evenifonebufferwasfull,otherscould
stillhavespaceavailable.

ID Based Ordering (IDO)


Another opportunity for optimizing ordering and improving performance is
related to the nature of traffic streams. Packets from different requesters are
very unlikely to have dependencies; after all, one device could hardly know
whentheotherhadfinishedcertainstepsbasedonorderingbecausetheycould
havedifferentpathstotheirsharedresource.Bearingthisinmind,the2.1revi
sionofthePCIespecintroducedwhatiscalledIDbasedOrderingtoimprove
performance.

The Solution
Ifthepacketsourceisnttakenintoaccountfortransactionorderingthenperfor
mancecansuffer,asshowninFigure87onpage302.Intheillustration,trans
action 1 makes it way to the upstream port of the switch but is blocked from
furtherprogressbyabufferfullconditionforthatpackettypeintheRootport
(whichwouldbeindicatedbyinsufficientFlowControlcredits).Tousethespec
terminology,packetsfromthesameRequesterarecalledaTLPstream.Inthis
example,thepathshownforTransaction1mightincludeseveralTLPsaspartof
a TLP stream. Transaction 2 then arrives at the same egress port and is also
blockedfrommovingforwardbecauseitmuststayinorderwithTransaction1.
Since the packets came from different sources, (different TLP streams) this
delayisalmostcertainlyunnecessary;itsveryunlikelytheycouldhavedepen
dencies between them, but the normal ordering model doesnt take this into
account.Togetimprovedperformance,weneedanotheroption.

Thesolutionissimple:allowpacketstobereorderediftheydontusethesame
RequesterID(orCompleterID,forCompletionpackets).Thisoptionalcapabil
ityallowssoftwaretoenableadevicetouseIDOandaswitchportcanrecog
nizethatthepacketsarepartofdifferentTLPstreams.Thisisdonebysetting
theenablebitsinDeviceControl2Register.

301
PCIe 3.0.book Page 302 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Figure87:DifferentSourcesareUnlikelytoHaveDependencies

Write Buffer
Full

Posted Write

When to use IDO


ThespechighlyrecommendsthatbothIDOandRObeusedwheneversafely
possible.Forexample,itshouldbesafeforEndpointstouseIDOforallTLPs
whencommunicatingdirectlywithonlyoneotherentity,suchastheRootCom
plex.Ontheotherhand,itwouldnotbesafetouseitiftheEndpointiscommu
nicating with multiple agents. An example failure case for this from the spec
beginswithonedevicedoingaDMAwritetomemoryandthendoingapeer
topeerwritetoaflaginanotherdevice.Whentheseconddevicereceivesthe
flag, it also initiates a DMA write to the same area of memory. Normally, the
twoDMAoperationswouldstayinorder,butwithIDOthatorderingcantbe
guaranteed because upstream devices will see them as coming from different
device IDs. Similarly, it would not be safe to use RO with packets that are
involvedincontroltraffic.

ForCompleters,ifIDOisenableditsrecommendedthatitbeusedforallCom
pletionsunlessthereisaspecificreasonnottodoso.

302
PCIe 3.0.book Page 303 Sunday, September 2, 2012 11:25 AM

Chapter 8: Transaction Ordering

Software Control
SoftwarecanenabletheuseofIDOforRequestsorCompletionsfromagiven
portbysettingtheappropriatebitsinitsDeviceControl2Register.AswithRO,
there are no capability bits to let software find out what the device supports,
justenablebits,sosoftwarewouldneedtoknowbysomeothermeansthatthe
device was capable of doing this. These bits enable the use of IDO for that
packettype,butsoftwaremuststilldecidewhethereachindividualpacketwill
haveitsIDObitset.AnewattributebitintheheaderindicateswhetheraTLPis
usingIDO,asshowninFigure88onpage303.Thisbringsupanotherrelated
point: Completions normally inherit all the attribute bits of the Request that
generated them, but this may not be true for IDO, since this can be enabled
independently by the Completer. In other words, Completions may use IDO
eveniftheRequestthatinitiatedthemdidnot.

Figure88:IDOAttributein64bitHeader

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x1 tr H D P
Byte 4 Requester ID Tag Last DW 1st DW
BE BE
Byte 8 Address [63:32]
Byte 12 Address [31:2] R

Deadlock Avoidance
Because the PCI bus employs delayed transactions or because PCI Express
memoryreadrequestmaybeblockedduetolackofflowcontrolcredits,several
deadlockscenarioscandevelop.Thesedeadlockavoidancerulesareincluded
inPCIExpressorderingtoensurethatnodeadlocksoccurregardlessoftopol
ogy. Adhering to the ordering rules prevent problems when boundary condi
tions develop due to unanticipated topologies (e.g., two PCI Express to PCI
bridgesconnectedacrossthePCIExpressfabric).RefertotheMindSharebook
entitledPCISystemArchitecture,FourthEdition(publishedbyAddisonWesley)
foradetailedexplanationofthescenariosthatarethebasisforthePCIExpress

303
PCIe 3.0.book Page 304 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

ordering rules related to deadlock avoidance. Table 81 on page 289 lists the
deadlockavoidanceorderingruleswhichareidentifiedasentriesA3,A4,D3,
D4andA5b.NotethatavoidingthedeadlocksinvolvesYesentriesineachof
these 5 cases. If blocking occurs due to lack of flow control credits associated
with the NonPosted Request buffer identified in column 3 or 4, the Posted
Requests associated with row A or the Completions associated with row D
mustbemovedaheadoftheNonPostedRequestsspecifiedinthecolumn3or4
wheretheYesentryexists.NotealsothattheYesentryinA5bappliesonly
toPCIExpresstoPCIorPCIXBridges.

Essentially,thisdeadlockavoidancerulecanbesummarizedaslaterarriving
Memory Write Requests or Completions must be allowed to pass earlier
blockedNonPostedRequestsotherwiseadeadlockcouldresult.

304
PCIe 3.0.book Page 305 Sunday, September 2, 2012 11:25 AM

PartThree:

DataLinkLayer
PCIe 3.0.book Page 306 Sunday, September 2, 2012 11:25 AM
PCIe 3.0.book Page 307 Sunday, September 2, 2012 11:25 AM

9 DLLPElements
The Previous Chapter
Thepreviouschapterdiscussedtheorderingrequirementsfortransactionsina
PCI Express topology. These rules are inherited from PCI, and the Producer/
Consumerprogrammingmodelmotivatedmanyofthem,soitsmechanismis
described here. The original rules also took into consideration possible dead
lockconditionsthatmustbeavoided,butdidnotincludeanymeanstoavoid
theperformanceproblemsthatcouldresult.

This Chapter
Inthischapterwedescribetheothermajorcategoryofpackets,DataLinkLayer
Packets(DLLPs).Wedescribetheuse,format,anddefinitionoftheDLLPpacket
typesandthedetailsoftheirrelatedfields.DLLPsareusedtosupportAck/Nak
protocol, power management, flow control mechanism and can even be used
forvendordefinedpurposes.

The Next Chapter


ThefollowingchapterdescribesakeyfeatureoftheDataLinkLayer:anauto
matic, hardwarebased mechanism for ensuring reliable transport of TLPs
acrosstheLink.AckDLLPsconfirmgoodreceptionofTLPswhileNakDLLPs
indicateatransmissionerror.Wedescribethenormalrulesofoperationwhen
noTLPorDLLPerrorisdetectedaswellaserrorrecoverymechanismsassoci
atedwithbothTLPandDLLPerrors.

General
TheDataLinkLayercanbethoughtofasmanagingthelowerlevelLinkproto
col.ItsprimaryresponsibilityistoassuretheintegrityofTLPsmovingbetween
devices, but it also plays a part in TLP flow control, Link initialization and
power management, and conveys information between the Transaction Layer
aboveitandthePhysicalLayerbelowit.

307
PCIe 3.0.book Page 308 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Inperformingthesejobs,theDataLinkLayerexchangespacketswithitsneigh
bor known as Data Link Layer Packets (DLLPs). DLLPs are communicated
betweentheDataLinkLayersofeachdevice.Figure91onpage308illustrates
aDLLPexchangedbetweendevices.

Figure91:DataLinkLayerSendsADLLP

PCIe Device A PCIe Device B


Device Core Device Core

PCIe Core PCIe Core


Hardware/Software Hardware/Software
Interface Interface

Transaction Layer Transaction Layer

Data Link Layer Data Link Layer

Physical Layer Physical Layer


(RX) (TX) (RX) (TX)

Framing C Framing
DLLP R
(SDP) C (END)

DLLPs Are Local Traffic


DLLPshaveasimplepacketformatandareafixedsize,8bytestotal,including
the framing bytes. Unlike TLPs, they carry no target or routing information
becausetheyareonlyusedfornearestneighborcommunicationsanddontget
routed at all. Theyre also not seen by the Transaction Layer since theyre not
partoftheinformationexchangedatthatlevel.

308
PCIe 3.0.book Page 309 Sunday, September 2, 2012 11:25 AM

Chapter 9: DLLP Elements

Receiver handling of DLLPs


WhenDLLPsarereceived,severalrulesapply:

1. TheyreimmediatelyprocessedattheReceiver.Inotherwords,theirflow
cannotbecontrolledthewayitisforTLPs(DLLPsarenotsubjecttoflow
control).
2. Theyrecheckedforerrors;firstatthePhysicalLayer,andthenattheData
LinkLayer.The16bitCRCincludedwiththepacketischeckedbycalculat
ingwhattheCRCshouldbeandcomparingittothereceivedvalue.DLLPs
thatfailthischeckarediscarded.HowwilltheLinkrecoverfromthiserror?
DLLPsstillarriveperiodically,andthenextoneofthattypethatsucceeds
willupdatethemissinginformation.
3. UnlikeTLPs,theresnoacknowledgementprotocolforDLLPs.Instead,the
specdefinestimeoutmechanismstofacilitaterecoveryfromfailedDLLPs.
4. Iftherearenoerrors,theDLLPtypeisdeterminedandpassedtotheappro
priateinternallogictomanage:
Ack/NaknotificationofTLPstatus
FlowControlnotificationofbufferspaceavailable
PowerManagementsettings
Vendorspecificinformation

Sending DLLPs

General
ThesepacketsoriginateattheDataLinkLayerandarepassedtothePhysical
Layer.If8b/10bencodingisinuse(Gen1andGen2mode),framingsymbolswill
beaddedtobothendsoftheDLLPatthislevelbeforethepacketissent.InGen3
mode,aSDPtokenoftwobytesisaddedtothefrontendoftheDLLP,butno
ENDisaddedtotheendoftheDLLP.Figure92onpage310showsageneric
(Gen1/Gen2) DLLP in transit, showing the framing symbols and the general
contentsofthepacket.

309
PCIe 3.0.book Page 310 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure92:GenericDataLinkLayerPacketFormat

Device A Device B
Device Core Device Core

PCI-XP Core PCI-XP Core


Hardware/Software Hardware/Software
Interface Interface

Transaction Layer Transaction Layer

Data Link Layer Data Link Layer

Physical Layer Physical Layer


(RX) (TX) (RX) (TX)

Framing C Framing
DLLP R
(SDP) C (END)

+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

Byte 0 DLLP Type (Fields Vary With DLLP Type)

Byte 4 16 Bit CRC

DLLP Packet Size is Fixed at 8 Bytes


DataLinkLayerPacketsarealways8byteslongforboth8b/10band128b/130b
andconsistofthefollowingcomponents:

1. A1DWcore(4bytes)containingtheonebyteDLLPTypefieldandthree
additionalbytesofattributes.TheattributesvarywiththeDLLPtype.
2. A 2byte CRC value that is calculated based on the core contents of the
DLLP.ItisimportanttopointoutthatthisCRCisdifferentfromtheLCRCs
addedtoTLPs.ThisCRCisonly16bitsinsizeandiscalculateddifferently
thanthe32bitLCRCsinTLPs.ThisCRCisappendedtothecoreDLLPand
thenthese6bytesarepassedtothePhysicalLayer.

310
PCIe 3.0.book Page 311 Sunday, September 2, 2012 11:25 AM

Chapter 9: DLLP Elements

3. If8b/10bencodingisinuse,aStartofDLLP(SDP)controlsymbolandan
EndGood(END)controlsymbolareaddedtothebeginningandendofthe
packet.Asusual,beforetransmissionthePhysicalLayerencodesthebytes
into10bitsymbolsfortransmission.
4. InGen3mode,when128b/130bencodingisinuse,a2byteSDPTokenis
addedtothefrontofthepackettocreatethe8bytepacketandthereisno
ENDsymbolortoken.

NotethatthereisneveradatapayloadwithaDLLP;alltheinformationiscar
riedinthecorefourbytesofthepacket.

DLLP Packet Types


TherearefourgroupsofDLLPsdefinedthatdealwithAck/Nak,PowerMan
agement, and Flow Control, along with one Vendor Specific version. Some of
thesehaveseveralvariants,andTable 91onpage 311summarizeseachvariant
aswellastheirDLLPTypefieldencoding.

Table91:DLLPTypes

TypeField
DLLPType Purpose
Encoding

Ack(TLPAcknowledge) 00000000b TLPtransmissionintegrity

Nak(TLPNegativeAcknowl 00010000b TLPtransmissionintegrity


edge)

PM_Enter_L1 00100000b PowerManagement

PM_Enter_L23 00100001b PowerManagement

PM_Active_State_Request_L1 00100011b PowerManagement

PM_Request_Ack 00100100b PowerManagement

VendorSpecific 00110000b VendorDefined

InitFC1P 01000xxxb TLPFlowControl


(xxx=VCnumber)

InitFC1NP 01010xxxb TLPFlowControl

311
PCIe 3.0.book Page 312 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table91:DLLPTypes(Continued)

TypeField
DLLPType Purpose
Encoding

InitFC1Cpl 01100xxxb TLPFlowControl

InitFC2P 11000xxxb TLPFlowControl

InitFC2NP 11010xxxb TLPFlowControl

InitFC2Cpl 11100xxxb TLPFlowControl

UpdateFCP 10000xxxb TLPFlowControl

UpdateFCNP 10010xxxb TLPFlowControl

UpdateFCCpl 10100xxxb TLPFlowControl

Reserved Others Reserved

Ack/Nak DLLP Format


TheformatoftheDLLPusedbyadevicetoAck(acknowledge)orNak(nega
tively acknowledge) the receipt of a TLP is illustrated in Figure 93, and its
fields are described in Ack/Nak DLLP Fields on page 313. For more discus
siononhowtheseareusedtohandletheAck/Nakprotocol,refertoChapter10,
entitledAck/NakProtocol,onpage317.

Figure93:AckOrNakDLLPFormat

+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
0000 0000 - Ack
Byte 0 0001 0000 - Nak
Reserved AckNak_Seq_Num

Byte 4 16-bit CRC

312
PCIe 3.0.book Page 313 Sunday, September 2, 2012 11:25 AM

Chapter 9: DLLP Elements

Table92:Ack/NakDLLPFields

FieldName HeaderByte/Bit DLLPFunction

DLLPType Byte0,[7:0] IndicatesthetypeofDLLP:


0000 0000b = Ack
0001 0000b = Nak

AckNak_Seq_Num Byte2,[3:0] IfagoodTLPwasreceived:


Byte3,[7:0] IfincomingSequenceNumber=
NEXT_RCV_SEQ(matchedwhatwas
expected),scheduleAckDLLPwiththat
number.
IfincomingSequenceNumberwasear
lierthanNEXT_RCV_SEQcount(a
duplicateTLPwasreceived),schedule
AckDLLPwithNEXT_RCV_SEQ1
(effectively,thisisthenumberofthelast
goodTLP).
ForaTLPreceivedwithaproblem:
IftheTLPhadanerror,oritsSequence
Numberwashigherthan
NEXT_RCV_SEQ,scheduleaNak
DLLPwithNEXT_RCV_SEQ1.

16bitCRC Byte4,[7:0] This16bitCRCprotectsthecontentsof


Byte5,[7:0] thisDLLP.CalculationisbasedonBytes0
3oftheAck/Nak.

Power Management DLLP Format


PowermanagementDLLPinformationisshowninFigure94,anditsfieldsare
describedinTable 93onpage 314.Tolearnmoreabouttheuseofthesepackets
inpowermanagement,refertoChapter16,entitledPowerManagement,on
page703.

313
PCIe 3.0.book Page 314 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure94:PowerManagementDLLPFormat

+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

Byte 0 00100xxx Reserved

Byte 4 16-bit CRC

Table93:PowerManagementDLLPFields

Field
HeaderByte/Bit DLLPFunction
Name

DLLP Byte0,[7:0] IndicatesDLLPtype.ForPowerManagementDLLPs:


Type 00100000b=PM_Enter_L1
00100001b=PM_Enter_L23
00100011b=PM_Active_State_Request_L1
00100100b=PM_Request_Ack

16bit Byte4,[7:0] A16BitCRCusedtoprotectDLLPcontents.Calcula


CRC Byte5,[7:0] tionisbasedonBytes03,regardlessofwhetherfields
areused.

Flow Control DLLP Format


Likemanyotherserialtransportbuses,PCIeimprovestransportefficiencyby
using a creditbased flow control scheme. This topic is covered in detail in
Chapter6,entitledFlowControl,onpage215.DLLPsareusedtocommuni
cateflowcontrolcreditinformation.AvarietyofdifferentDLLPsinitializeflow
controlcredits.AnothercategoryofupdateDLLPsareusedtomanagetherunt
ime credit management as receiver buffer space is recovered. There are two
Flow Control Initialization DLLPs called InitFC1 and InitFC2, and one Flow
ControlUpdateDLLPcalledUpdateFC.

ThepacketformatforallthreevariantsisillustratedinFigure95onpage315,
whileTable 94onpage 315describesthefieldscontainedinit.

314
PCIe 3.0.book Page 315 Sunday, September 2, 2012 11:25 AM

Chapter 9: DLLP Elements

Figure95:FlowControlDLLPFormat

+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

Byte 0 x x x x 0 VC ID R HeaderFC R DataFC

Byte 4 16-bit CRC

Table94:FlowControlDLLPFields

FieldName HeaderByte/Bit DLLPFunction

DLLPType Byte0,[7:4] ThiscodeindicatesthetypeofFCDLLP:


0100b=InitFC1P(PostedRequests)
0101b=InitFC1NP(NonPostedRequests)
0110b=InitFC1Cpl(Completions)
0101b=InitFC2P(PostedRequests)
1101b=InitFC2NP(NonPostedRequests)
1110b=InitFC2Cpl(Completions)
1000b=UpdateFCP(PostedRequests)
1001b=UpdateFCNP(NonPostedRequests)
1010b=UpdateFCCpl(Completions)

Byte0,[3] Mustbe0baspartofflowcontrolencoding.

Byte0,[2:0] VCID.IndicatestheVirtualChannel(VC07)to
beupdatedwiththesecredits.

HdrFC Byte1,[5:0] Containsthecreditcountforheaderstoragefor


Byte2,[7:6] thespecifiedVirtualChannel.Eachcreditrepre
sentsspacefor1header+theoptionalTLPDigest
(ECRC).

DataFC Byte2,[3:0] Containsthecreditcountfordatastorageforthe


Byte3,[7:0] specifiedVirtualChannel.Eachcreditrepresents
16bytes.

315
PCIe 3.0.book Page 316 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table94:FlowControlDLLPFields(Continued)

FieldName HeaderByte/Bit DLLPFunction

16bitCRC Byte4,[7:0] A16BitCRCthatprotectsthecontentsofthis


Byte5,[7:0] DLLP.CalculationisbasedonBytes03,regard
lessofwhetherallfieldsareused.

Vendor-Specific DLLP Format


ThelastdefinedDLLPtypeisusedforvendorspecificpurposes.Thereforeonly
theDLLPTypefieldisdefinedbythespec(00110000b),leavingtheremaining
contentsavailableforvendordefineduse.

Figure96:VendorSpecificDLLPFormat

+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

Byte 0 00110000 Vendor-Defined

Byte 4 16-bit CRC

316
PCIe 3.0.book Page 317 Sunday, September 2, 2012 11:25 AM

10 Ack/NakProtocol
The Previous Chapter
In the previous chapter we describe Data Link Layer Packets (DLLPs). We
describe the use, format, and definition of the DLLP types and the details of
theirrelatedfields.DLLPsareusedtosupportAck/Nakprotocol,powerman
agement, flow control mechanism and can be used for vendordefined pur
poses.

This Chapter
ThischapterdescribesakeyfeatureoftheDataLinkLayer:anautomatic,hard
warebasedmechanismforensuringreliabletransportofTLPsacrosstheLink.
AckDLLPsconfirmsuccessfulreceptionofTLPswhileNakDLLPsindicatea
transmissionerror.WedescribethenormalrulesofoperationwhennoTLPor
DLLP error is detected as well as error recovery mechanisms associated with
bothTLPandDLLPerrors.

The Next Chapter


ThenextchapterdescribestheLogicalsubblockofthePhysicalLayer,which
preparespacketsforserialtransmissionandreception.Severalstepsareneeded
toaccomplishthisandtheyaredescribedindetail.Thischaptercoversthelogic
associated with the first two spec versions Gen1 and Gen2 that use 8b/10b
encoding. The logic for Gen3 does not use 8b/10b encoding and is described
separatelyinthechaptercalledPhysicalLayerLogical(Gen3)onpage 407.

Goal: Reliable TLP Transport


Thefunctionofthe DataLink Layer (showninFigure101on page318)is to
ensurereliabledeliveryofTLPs.ThespecrequiresaBER(BitErrorRate)ofno
worsethan1012,buterrorswillstillhappenoftenenoughtocausetrouble,and
asinglebiterrorwillcorruptanentirepacket.Thisproblemwillonlybecome
morepronouncedasLinkratescontinuetoincreasewithnewgenerations.

317
PCIe 3.0.book Page 318 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure101:DataLinkLayer

Memory, I/O, Configuration R/W Requests or Message Requests or Completions


(Software layer sends / receives address/transaction type/data/message index)
Software layer
Transmit Receive

Transaction Layer Packet (TLP) Transaction Layer Packet (TLP)


Header Data Payload ECRC Header Data Payload ECRC

Transaction layer
Flow Control
Transmit Receive
Virtual Channel
Buffers Buffers
Management
per VC per VC
Ordering

Link Packet DLLPs e.g. DLLPs Link Packet


Sequence TLP LCRC ACK/NAK CRC ACK/NAK CRC Sequence TLP LCRC

Data Link layer De-mux


TLP Replay
Buffer
TLP Error
Mux Check

Physical Packet Physical Packet


Start Link Packet End Start Link Packet End

Physical layer Encode Decode

Parallel-to-Serial Serial-to-Parallel
Link
Differential Driver Training Differential Receiver

Port
Link

To facilitate this goal, an error detection code called an LCRC (Link Cyclic
RedundancyCode)isaddedtoeachTLP.Thefirststepinerrorcheckingissim
plytoverifythatthiscodestillevaluatescorrectlyatthereceiver.Ifeachpacket
isgivenauniqueincrementalSequenceNumberaswell,thenitwillbeeasyto
sortoutwhichpacket,outofseveralthathavebeensent,encounteredanerror.
UsingthatSequenceNumber,wecanalsorequirethatTLPsmustbesuccess
fullyreceivedinthesameordertheyweresent.Thissimplerulemakesiteasy
todetectmissingTLPsattheReceiversDataLinkLayer.

ThebasicblocksintheDataLinkLayerassociatedwiththeAck/Nakprotocol
areshowningreaterdetailinFigure102onpage319.EveryTLPsentacross
theLinkischeckedatthereceiverbyevaluatingtheLCRC(first)andSequence
Number(second)inthepacket.Thereceivingdevicenotifiesthetransmitting
devicethatagoodTLPhasbeenreceivedbyreturninganAck.Receptionofan

318
PCIe 3.0.book Page 319 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

AckatthetransmittermeansthatthereceiverhasreceivedatleastoneTLPsuc
cessfully.Ontheotherhand,receptionofaNakbythetransmitterindicatesthat
thereceiverhasreceivedatleastoneTLPinerror.Inthatcase,thetransmitter
willresendtheappropriateTLP(s)inhopesofabetterresultthistime.Thisis
sensible,becausethingsthatwouldcauseatransmissionerrorwouldlikelybe
transienteventsandareplaywillhaveaverygoodchanceofsolvingtheprob
lem.

Figure102:OverviewoftheAck/NakProtocol

Transmit Receiver
Device A Device B
From To
Transaction Layer Transaction Layer
Tx Rx
Data Link Layer Data Link Layer
TLP DLLP DLLP TLP
ACK / ACK /
Sequence TLP LCRC NAK NAK Sequence TLP LCRC

Replay
Buffer De-mux De-mux

Error
Mux Mux Check

Tx Rx Tx Rx
DLLP
ACK /
NAK

Link

TLP
Sequence TLP LCRC

Sinceboththesendingandreceivingdevicesintheprotocolhavebothatrans
mitandareceiveside,thischapterwillusetheterms:
TransmittertomeanthedevicethatsendsTLPs
ReceivertomeanthedevicethatreceivesTLPs

319
PCIe 3.0.book Page 320 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Elements of the Ack/Nak Protocol


ThemajorAck/NakprotocolelementsoftheDataLinkLayerareshowninFig
ure103onpage320.Therestoomuchtoconsiderallatonce,though,solets
beginbyfocusingonjustthetransmitterelements,whichareshowninalarger
viewinFigure104onpage322.

Figure103:ElementsoftheAck/NakProtocol

Transaction Layer (TX) Block TLPs; Report Transaction Layer (RX)


DLL protocol error
Yes Increment NRS Good TLPs
No
TLPs (NTS-AS) 2048?
(Continue) NEXT_RCV_SEQ (NRS) Seq Num = NRS
Block TLP during Replay

Assign
Sequence Seq Num
NEXT_TRANSMIT_SEQ (NTS) Seq Num < NRS (Duplicate TLP)
Number >, <, =
(NRS 1) = AckNak_Seq_Num[11:0]

(Increment) (Schedule Ack)


NRS?
REPLAY_TIMER
LCRC Increment on Replay) Seq Num > NRS (Lost TLP)
REPLAY_NUM
Generator (Send Nak) Yes
Purge Older TLPs (Reset Both)
(Send Nak) No Pass
Nak AckD_SEQ (AS) LCRC?
Retry (Replay) Yes
Nak? (Update) No Nak Flag Clear?
Buffer (Replay) Set & Send Nak
Yes AckNak
(TLP copy)
SeqNum = AS? NAK_SCHEDULED Good TLP?
Clear Nak Flag
(TLP copy) Yes Ack Nak
No Pass Ack/Nak AckNak Latency
(Discard) CRC?
Generator Timer

Ack/Nak
DLLP Link
TLP TLP

Transmitter Elements
AsTLPsarrivefromtheTransactionLayer,severalthingsaredonetoprepare
themforrobusterrordetectionatthereceiver.AsshowninthediagramTLPs
arefirstassignedthenextsequentialSequenceNumber,obtainedfromthe12
bitNEXT_TRANSMIT_SEQcounter.

320
PCIe 3.0.book Page 321 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

NEXT_TRANSMIT_SEQ Counter
ThiscountergeneratestheSequenceNumberthatwillbeassignedtothenext
incoming TLP. Its a 12bit counter that is initialized to 0 at reset or when the
LinkLayerreportsDL_Down(LinkLayerisinactive).Sinceitincrementscon
tinuously with each TLP and only counts forward, the counter eventually
reachesitsmaxvalueof4095androllsoverto0asitcontinuestocount.

ThisSequenceNumberassignedtotheTLPwillbeusedintheAckorNaksent
bythereceivertoreferencethisTLPintheReplayBuffer.Onemightthinkthat
suchalargecountermeansthatalargenumberofunacknowledgedTLPscould
be in flight, but in practice this is very unlikely. The main reason is that the
receiverhasarequirementtosendanAckbackforsuccessfullyreceivedTLPs
withinacertainamountoftime.Thatamountoftimeisdiscussedindetailin
AckNak_LATENCY_TIMERonpage 328,butistypicallyonlylongenoughto
transmitafewmaxsizedpackets.

LCRC Generator
This block generates a 32bit CRC (Cyclic Redundancy Check) code based on
theheaderanddatatobesentandaddsittotheendoftheoutgoingpacketto
facilitateerrordetection.Thenameisderivedfromthefactthatthischeckcode
(calculatedfromthepackettobesent)isredundant(addsnoinformation),andis
derivedfromcycliccodes.AlthoughaCRCdoesntsupplyenoughinformation
fortheReceivertodoautomaticerrorcorrectionthewayECC(ErrorCorrecting
Code)methodscan,itdoesproviderobusterrordetection.CRCsarecommonly
used in serial transports because theyre easy to implement in hardware, and
because theyre good at detecting burst errors: a string of incorrect bits. Since
thisismore likelytohappenina serialdesignthan aparallelmodel,ithelps
explainwhyaCRCisagoodchoiceforerrordetectioninserialtransports.The
CRCcodeiscalculatedusingallfieldsoftheTLP,includingtheSequenceNum
ber.Thereceiverwillmakethesamecalculationandcompareitsresulttothe
LCRCfieldintheTLP.Iftheydontmatch,anerrorisdetectedintheReceivers
LinkLayer.

Replay Buffer
Thereplaybuffer,orretrybuffer,storesTLPs,includingtheSequenceNumber
andLCRC,intheorderoftheirtransmission.Whenthetransmitterreceivesan
AckindicatingthatTLPshavereachedthereceiversuccessfully,itpurgesfrom
the Replay Buffer those TLPs whose Sequence Number is equal to or earlier
thanthenumberintheAck.Inthisway,thedesignallowsoneAcktorepresent
severalsuccessfulTLPs,reducingthenumberofAcksthatmustbesent.Since
the packets must always be seen in order, then if an Ack is received with a

321
PCIe 3.0.book Page 322 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

SequenceNumberof7,thennotonlywasTLP7receivedsuccessfully,butall
the packets before it mustalsohavebeen received successfully, so thereis no
reasontokeepacopyoftheminthereplaybuffer.

IfaNakisreceived,theSequenceNumberintheNakstillindicatesthelastgood
packet received. So even receiving a Nak can cause the transmitter to purge
TLPsfromthereplaybuffer.However,becauseitisaNak,itmeansthatsome
thing was not received successfully at the receiver, so after purging all the
acknowledgedTLPs,thetransmittermustreplayeverythingstillinthereplay
bufferinorder.Forexample,ifaNakisreceivedwithaSequenceNumberof9,
thenpacket9andallpriorpacketsarepurgedfromthereplaybuffer,because
thereceiveracknowledgedthattheyhavebeensuccessfullyreceived.However,
becauseitisaNak,thetransmittermustthenreplayalltheremainingTLPsin
thereplaybufferinorder,startingwithpacket10.

Figure104:TransmitterElementsAssociatedwiththeAck/NakProtocol

Transaction Layer (TX) Block TLPs; Report


DLL protocol error
Yes
No
TLPs (NTS-AS) 2048?
(Continue)
Block TLP during Replay

Assign
Sequence
NEXT_TRANSMIT_SEQ (NTS)
Number
(Increment)

REPLAY_TIMER
LCRC Increment on Replay)
REPLAY_NUM
Generator
Purge Older TLPs (Reset Both)
Nak AckD_SEQ (AS)
Retry (Replay) Yes
Nak? (Update) No
Buffer (Replay)
Yes AckNak
(TLP copy)
SeqNum = AS?

(TLP copy) Yes


No Pass
(Discard) CRC?

Link

322
PCIe 3.0.book Page 323 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

REPLAY_TIMER Count
Thistimeriseffectivelyawatchdogtimer.Itmakessurethatthetransmitteris
receiving Ack/Nak packets for TLPs that have been transmitted. If this timer
expires,itmeansthatthetransmitterhassentoneormoreTLPsthatithasnot
received an acknowledgement for in the expected time frame. The fix is to
retransmiteverythinginthereplaybufferandrestarttheREPLAY_TIMER.

ThistimerisrunninganytimeaTLPhasbeentransmittedbutnotyetacknowl
edged. If the REPLAY_TIMER is not currently running, it is started when the
last Symbol of any TLP is transmitted. If the timer is already running, then
sendingadditionalTLPsdoesnotresetthetimervalue.WhenanAckorNakis
receivedthatacknowledgesTLPsinthereplaybuffer,thetimerresetsbackto0,
andiftherearestillTLPsinthereplaybuffer(TLPsthathavebeentransmitted,
butnotyetacknowledged),itimmediatelystartscountingagain.However,ifan
Ack is received that acknowledges the last TLP in the replay buffer, meaning
the replay buffer is now empty, the REPLAY_TIMER resets to 0 but does not
count.ItwillnotbegincountingagainuntilthelastSymbolofthenextTLPis
transmitted.

REPLAY_NUM Count
This2bitcountertracksthenumberofreplayattemptsafterreceptionofaNak
oraREPLAY_TIMERtimeout.WhentheREPLAY_NUMcountrollsoverfrom
11bto00b(indicating4failedattemptstodeliverthesamesetofTLPs),theData
LinkLayerautomaticallyforcesthePhysicalLayertoretraintheLink(LTSSM
goestotheRecoverystate).Whenretrainingisfinished,itwillattempttosend
thefailedTLPsagain.TheREPLAY_NUMcounterisinitializedto00batreset,
or when the Link Layer is inactive. It is also reset whenever an Ack DLLP is
received with a Sequence Number that is more recent than the last one seen,
meaningforwardprogressisbeingmade.

ACKD_SEQ Register
This12bitregisterstorestheSequenceNumberofthemostrecentlyreceived
AckorNak.Itisinitializedtoall1satreset,orwhentheDataLinkLayerisinac
tive. This register is updated with the AckNak_Seq_Num [11:0] field of a
received Ack or Nak. The ACKD_SEQ count is compared with the Sequence
NumberinthelastreceivedAckorNaktocheckforforwardprogress.Ifthelat
estAck/NakhadaSequenceNumberlaterthantheACKD_SEQregister,then
weremakingforwardprogress.

323
PCIe 3.0.book Page 324 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Asanaside,weusethetermlaterSequenceNumbertoaccountforthefact
that,likemostcountersinPCIe,theSequenceNumbercountersonlycountfor
ward,meaningthattheylleventuallyrolloverbacktozero.Technically,alater
numberwouldmeananumericallyhighervalue,butwehavetorememberthat
when the counter reaches 4095 (its a 12bit counter), the next higher number
willbezero.Thiswraparoundeffectwillbeeasiertoseeintheexampleslater,
asinAck/NakExamplesonpage 331.

As shown in Figure 104 on page 322, when an Ack or Nak makes forward
progress it causes TLPs with Sequence Numbers equal to or older than the
valueintheDLLPtobepurgedoutoftheReplayBuffer.Italsoresetsboththe
REPLAY_TIMERandtheREPLAY_NUMcount.Ifnoforwardprogressismade,
noTLPscanbepurgedsoweonlychecktoseeifitsaNakthatwouldnecessi
tateareplay.

Thisisagoodplacetomentionapotentialproblemwiththecounters:thenum
berofTLPssentmighttheoreticallybecomemuchlargerthanthenumberthat
have been acknowledged by the receiver. As mentioned earlier, this is very
unlikely;itsonlymentionedhereforcompleteness.Theproblemisbasicallythe
sameasitfortheFlowControlcounters(seeStage3CountersRollOveron
page 234) and has the same solution: the NEXT_TRANSMIT_SEQ and
ACKD_SEQcountersareneverallowedtobeseparatedbymorethanhalftheir
totalcountvalue.IfalargenumberofTLPsaresentwithoutacknowledgement
sothattheNEXT_TRANSMIT_SEQcountvalueislaterthanACKD_SEQcount
by2048,nomoreTLPswillbeacceptedfromtheTransactionLayeruntilthisis
resolvedbyreceivingmoreAcks.IfthedifferencebetweentheSequenceNum
bersentandtheacknowledgedcounteverdidexceedhalfthemaximumcount
value,aDataLinkLayerprotocolerrorwouldbereported.(Formoreonerror
reporting,seeDataLinkLayerErrorsonpage 655.)

DLLP CRC Check


Thisblockchecksforerrorsinthe16bitCRCofDLLPs.Ifanerrorisdetected,
theDLLPisdiscardedandaCorrectableErrormaybereported,ifenabled.No
furtheractionistakenbecausethereisnomechanismtoreplayorcorrectfailed
DLLPs.Instead,wesimplywaitforthenextsuccessfulAck/Nak,whichwillget
thecountersbackuptodateandallownormaloperationtocontinue.

Receiver Elements
IncomingTLPsarefirstcheckedforLCRCerrorsandthenforSequenceNum
bers.If there areno errors, theTLPisforwarded to thereceiversTransaction

324
PCIe 3.0.book Page 325 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

Layer. If there are errors, the TLP is discarded and a Nak will be scheduled
unlesstherewasalreadyaNakoutstanding.

Figure105onpage325illustratesthereceiverDataLinkLayerelementsassoci
atedwithprocessingofinboundTLPsandoutboundAck/NakDLLPs.

Figure105:ReceiverElementsAssociatedwiththeAck/NakProtocol

Transaction Layer (RX)

Increment NRS Good TLPs

NEXT_RCV_SEQ (NRS) Seq Num = NRS

Seq Num < NRS (Duplicate TLP) Seq Num


>, <, =
(NRS 1) = AckNak_Seq_Num[11:0]

(Schedule Ack)
NRS?

Seq Num > NRS (Lost TLP)


(Send Nak) Yes

(Send Nak) No Pass


LCRC?
Nak Flag Clear?
Set & Send Nak

NAK_SCHEDULED Good TLP?


Clear Nak Flag
Ack Nak
Ack/Nak AckNak Latency
Generator Timer

Link

LCRC Error Check


ThisblockchecksfortransmissionerrorsinthereceivedTLPbyverifyingthe
32bitLCRC.ThisblockcalculatesanLCRCvaluebasedonthereceivedbitsof
theTLPandthencomparesthecalculatedLCRCtothereceivedLCRC.Ifthey
match,thenallthebitsofthepacketwerereceivedexactlyastheyweretrans
mitted. If it doesnt match, then there was a bit error in the TLP so it gets
droppedandaNakwillbesenttogetareplayofthatpacketandanyTLPssent
afterthebadpacket.

325
PCIe 3.0.book Page 326 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

NEXT_RCV_SEQ Counter
The 12bit NEXT_RCV_SEQ (Next Receive Sequence number) counter keeps
trackoftheexpectedSequenceNumberandisusedtoverifysequentialpacket
reception.Itsinitializedto0atresetorwhentheDataLinkLayerisinactive,
andisincrementedonceforeachgoodTLPforwardedtotheTransactionLayer.
TLPsthathaveerrorsorwerenullifiedarenotsenttotheTransactionLayerand
thereforedontincrementthiscounter.

Sequence Number Check


IftheLCRCcheckwasOK,theTLPsSequenceNumberischeckedagainstthe
expectedcount(theNRSnumber).AscanbeseeninFigure105onpage325,
therearethreepossibleoutcomesofthischeck:

1. The TLP Sequence Number equals the NRS count (the number were
expecting). In this case, everything is good: the TLP is accepted and for
warded to the Transaction Layer and the NRS count is incremented. The
Receiver schedules an Ack, but it doesnt have to be sent until the
AckNak_LATENCY_TIMER expires. In the meantime, other good TLPs
may be received, incrementing the NEXT_RCV_SEQ counter. Then, once
thetimerexpires,asingleAckissentwiththeSequenceNumberofthelast
goodTLPreceived(NRS1).ThatallowsoneAcktorepresentseveralsuc
cessfulTLPsandreducesoverhead,sinceadedicatedAckisnotrequired
foreveryTLP.
2. IftheTLPsSequenceNumberisearlierthantheNRScount(smallerthan
expected),thisTLPhasbeenseenbeforeandisaduplicate.Aslongasthe
expectedSequenceNumberandreceivedSequenceNumberdontgetsepa
ratedbymorethanhalfthetotalcountvalue(2048),thisisnotanerror,but
isseenasaduplicate,meaningtheTLPhasalreadybeenacceptedearlier.In
thiscase,theTLPissilentlydropped(noNak,noerrorreporting)andan
Ack is sent with the Sequence Number of the last good TLP it received
(NRS1).Whywouldthissituationhappen?Thetransmittermaynothave
receivedatransmittedAck,sohisREPLAY_TIMERexpiredandheretrans
mittedeverythinginhisReplayBuffer.BysendingthetransmitteranAck
withtheSequenceNumberofthelastgoodpacketwereceived,werenoti
fyinghimofthefurthestprogresswevemade.
3. If the TLPs Sequence Number is a later Sequence Number than
NEXT_RCV_SEQ count (larger than expected), then the Link Layer has
missedaTLP.Forexample,ifwereexpectingSequenceNumber30andthe
incoming TLP has Sequence Number 31 we know theres a problem. The
numbers must be sequential and, since they arent, one must have failed

326
PCIe 3.0.book Page 327 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

andbeendropped,asmighthappenatthePhysicalLayer.Thisoutoforder
TLPisdiscarded,whetherornotithadanyothererrorsbecausewemust
acceptTLPsinorder,andaNakwillbesentiftherewasntonealreadyout
standing.
Theconceptoftheexpectedsequencenumber(NRS)incrementingasnewTLPs
aresuccessfully received andseeinghow thataffects the sliding windows for
the invalid range of sequence numbers and the duplicate range of sequence
numberscanbeseeninFigure106.

Figure106:ExamplesofSequenceNumberRanges

0 30 2078 4095

Dupli- Invalid
Duplicate
cate (out of sequence)

Next Receive
Sequence (NRS) Number

0 31 2079 4095

Invalid
Duplicate Duplicate
(out of sequence)

Next Receive
Sequence (NRS) Number

0 32 2080 4095

Invalid
Duplicate Duplicate
(out of sequence)

Next Receive
Sequence (NRS) Number

NAK_SCHEDULED Flag
ThisflagissetwheneverthereceiverschedulesaNak,andisclearedwhenthe
receiver successfully receives the TLP with the expected Sequence Number
(NRS). The spec is clear that the receiver must not schedule additional Nak
DLLPswhiletheNAK_SCHEDULEDflagremainsset.Theauthorsopinionis

327
PCIe 3.0.book Page 328 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

thatthisisintendedtopreventthepossibilityofanendlessloop;acaseinwhich
the transmitter begins to replay some packets but the receiver sends another
Nakbeforethereplaysfinishandcausesittorestartsendingthemagain.What
everthemotivation,onceaNakhasbeensenttherewillbenomoreNaksforth
cominguntiltheproblemisresolvedbysuccessfulreceiptofthereplayedTLP
withthecorrectSequenceNumber.

AckNak_LATENCY_TIMER
ThistimerisrunninganytimeareceiversuccessfullyreceivesaTLPthatithas
notyetacknowledged.ThereceiverisrequiredtosendanAckoncethetimer
expires.ThelengthoftimetheAckNakLatencyTimerrunsisdictatedbythe
spec(seeAckNak_LATENCY_TIMERonpage 328)anddetermineshowlong
areceivercancoalesceAcks.OncetheAckNakLatencyTimerexpires,anAck
with sequence number NRS1 is generated and sent which indicates the last
good packet it received. This timer is reset whenever an Ack or Nak are sent
anditonlyrestartsonceanewgoodTLPisreceived.

Ack/Nak Generator
AckorNakDLLPsarescheduledbytheerrorcheckingblocksandcontaina12
bitAckNak_Seq_NumfieldasillustratedinFigure107onpage328.Itcalcu
lates this number by subtracting one from the NRS count, which results in
reportingthelastgoodSequenceNumberreceived.ThatsbecauseagoodTLP
received increments NRS before scheduling the Ack, while a failed TLP just
schedules a Nak without incrementing NRS. This method makes it easier to
handle failed packets because the error in the TLP might have been in the
SequenceNumber,sothatnumbercantbeusedintheNak.Instead,itusesthe
numberofthelastgoodTLP;whatwereexpectingminusone.Theonlycase
wherethisvaluedoesntrepresentthelastgoodTLPisforthefirstTLPaftera
reset.Ifthat first TLP,using Sequence Number 0,fails, theresulting Nak will
haveanAckNak_Seq_Numvalueofzerominusonewhichresultsinall1s.

Figure107:AckOrNakDLLPFormat

+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
0000 0000 - Ack
Byte 0 0001 0000 - Nak
Reserved AckNak_Seq_Num

Byte 4 16-bit CRC

328
PCIe 3.0.book Page 329 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

Table101:AckorNakDLLPFields

FieldName HeaderByte/Bit DLLPFunction

DLLPType Byte0,[7:0] Indicatesthetype:


00000000b=Ack
00010000b=Nak

AckNak_Seq_Num Byte2,[3:0] ThisvaluewillalwaysbeNEXT_RCV_SEQ


Byte3,[7:0] count1.

16bitCRC Byte4,[7:0] 16bitCRCusedtoprotectthecontentsof


Byte5,[7:0] thisDLLP.

Ack/Nak Protocol Details


Thissectiondescribesthedetailedtransmitterandreceiverbehaviorinprocess
ingTLPsandAck/NakDLLPs.Severalexamplesareusedtodemonstratevari
ouscasesthatmayoccur.

Transmitter Protocol Details


Sequence Number
Referring back to Figure 104 on page 322, when TLPs are delivered by the
TransactionLayertotheLinkLayer,oneofthefirststepsistoappenda12bit
SequenceNumber.KeepinmindthatthenextincrementalSequenceNumber
mayactuallybesmaller,aswillhappenwhenthecounterrollsoverbacktozero
after it reaches a maximum value of 4095. Consequently, a value of zero can
actuallybelargerthanavalueof4095,forexample.Itmayhelptothinkofthe
SequenceNumbercomparisonasevaluatingawindowofnumbersthatcon
sistentlymovesupwardandrollsover.Toclarifythisconcept,suchacountroll
overisusedinseveraloftheupcomingexamples.

32-Bit LCRC
Thetransmitteralsogeneratesandappendsa32bitLCRC(LinkCRC)basedon
theTLPcontents(SequenceNumber,Header,DataPayloadandECRC).

329
PCIe 3.0.book Page 330 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Replay (Retry) Buffer

General. BeforeadevicetransmitsaTLP,itstoresacopyoftheTLPinthe
Replay Buffer. (Note that the spec uses the term Retry Buffer but in this book
ReplaywaschoseninsteadofRetrytomoreclearlydistinguishthismechanism
fromtheoldPCIRetrymechanism).EachbufferentrystoresacompleteTLP
withallofitsfieldsincludingtheSequenceNumber(12bitswide,itoccu
piestwobytes),Header(upto16bytes),anoptionalDataPayload(upto
4KB),anoptionalECRC(fourbytes)andtheLCRCfield(fourbytes).

ItisimportanttonotethatthespecdescribestheReplayBufferinthisfash
ion,butitisNOTaspecrequirementthatitbeimplementedthisway.As
longasyourdevicecanreplayasequenceofTLPsifrequired,asdefinedby
thespec,thenhowthatisaccomplishedwithinadeviceiscompletelyupto
thedesigner.HavingaReplayBufferthatbehavesasdescribedaboveisone
waytoaccomplishthis.

Replay Buffer Sizing. ThespecwriterschosenottospecifytheReplay


Buffersize,leavingitasanoptimizationforthedevicedesigners.Itshould
bemadebigenoughtostoreTLPsthathaventyetbeenacknowledgedby
Ackssothatundernormaloperatingconditionsitdoesntbecomefulland
stallnewTLPscominginfromtheTransactionLayer,butalsosmallenough
tokeepthecostdown.Todeterminetheoptimalbuffersize,adesignerwill
consider:
AckDLLPLatencyfromthereceiver.
DelayscausedbythephysicalLink.
ReceiverL0sexitlatencytoL0.Inotherwords,thebuffershouldbebig
enoughtoholdTLPswithoutstallingwhiletheLinkreturnsfromthe
L0sstatetoL0.

When the transmitter receives an Ack, it purges TLPs from the Replay
BufferwithSequenceNumbersequaltoorearlierthantheSequenceNum
berintheAck(normallythistermwouldbesmallerthanbutthecounterroll
overbehaviorwillsometimesmakethatanincorrectevaluation,sothetermearlier
thanwaschoseninstead).Similarly,whenthetransmitterreceivesaNak,it
still purges the Replay Buffer of TLPs with Sequence Numbers that are
equaltoorearlierthantheSequenceNumberthatarrivesintheNak,but
thenitalsoreplays(resends)TLPsoflaterSequenceNumbers(theremain
ingTLPsintheReplayBuffer).

330
PCIe 3.0.book Page 331 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

Transmitters Response to an Ack DLLP


AsingleAckreturnedbythereceivermayacknowledgemultipleTLPs;it
isnt necessary that every TLP transmitted receive a dedicated Ack. The
receivercangetmultiplegoodTLPsandsendoneAckwiththeSequence
NumberofthelastgoodTLPreceived.ThetransmittersresponsetoanAck
thatmakesforwardprogress(hasaSequenceNumberthatislaterthanthe
lastoneseen)istoloadtheAckD_SEQregisterwiththeSequenceNumber
of the new Ack. It also resets the REPLAY_NUM counter and
REPLAY_TIMER, and purges the Replay Buffer of all TLPs that were
acknowledgedbythatAck.

Ack/Nak Examples

Example 1. Consider Figure 108 onpage 332 for the following discus
sion.

1. DeviceAtransmitsTLPswithSequenceNumbers3,4,5,6,7.
2. Device B successfully receives TLP 3 and increments its
NEXT_RCV_SEQ counter from 3 to 4. Since Device B had previously
acknowledged all successfully received TLPs, the
AckNak_LATENCY_TIMERwasnotrunning.HavingreceivedTLP3,
DeviceBhasnowsuccessfullyreceivedaTLPthatithasnotacknowl
edged,sotheAckNak_LATENCY_TIMERisstarted(thisisequivalent
ofschedulinganAck).
3. Device B successfully receives TLPs 4 and 5 before the
AckNak_LATENCY_TIMERexpires.ReceivingTLPs4and5doesNOT
resettheAckNak_LATENCY_TIMER.
4. OncetheAckNak_LATENCY_TIMERexpires,DeviceBsendsasingle
Ack with the Sequence Number 5, the last good TLP received. The
AckNak_LATENCY_TIMER is reset but does not restart until it suc
cessfullyreceivesTLP6.
5. Device A receives Ack 5, resets the REPLAY_TIMER and
REPLAY_NUMcounter,becauseforwardprogressisbeingmade.And
it purges TLPs from the Replay Buffer that have Sequence Numbers
earlierthanorequalto5.
6. Once Device B receives TLPs 6 and 7 and its
AckNak_LATENCY_TIMER expires again, it will send an Ack with a
SequenceNumberof7whichwillpurgethelasttwoTLPsintheReplay
BufferofDeviceA(accordingtothisexample).

331
PCIe 3.0.book Page 332 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure108:Example1ExampleofAck

3 Good TLP
Receive Buffer
4 Good TLP
Device A Device B
Data Link Layer NEXT_TRANSMIT_SEQ Data Link Layer
5 Good TLP
Replay Buffer 8 NEXT_RCV_SEQ

REPLAY_TIMER
6
NAK_SCHEDULED
0
Later TLP 7
6 Ack
Purge Lat Tmr
5 5
4
Earlier TLP 3 Ack/Nak
Generator

Link

7 6

Example 2. ThisexampleisshowingtheexactsamebehaviorasExam
ple1,butitispointingouttherolloverbehaviorfortheSequenceNumbers,
asshowinFigure109onpage333.

1. DeviceAtransmitsTLPswithSequenceNumbers4094,4095,0,1,and2
whereTLP4094isthefirstTLPsentandTLP2isthelastTLPsentin
thisexample.
2. Device B successfully receives TLPs with Sequence Numbers 4094,
4095, 0, 1 in that order. Reception of TLP 4094 causes the
AckNak_LATENCY_TIMER to start. TLPs 4095, 0 and 1 are received
beforetheAckNak_LATENCY_TIMERexpires.TLP2isstillenroute.
3. BecausetheAckNak_LATENCY_TIMERexpires,DeviceBsendanAck
withaSequenceNumberof1toacknowledgereceiptofTLP1andall
priorTLPs(0,4095and4094inthisexample).
4. DeviceAsuccessfullyreceivesAck1,purgesTLPs4094,4095,0,and1
from the Replay Buffer and resets the REPLAY_TIMER and
REPLAY_NUMcount.

332
PCIe 3.0.book Page 333 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

Figure109:Example2AckwithSequenceNumberRollover

4094 Good TLP


Receive Buffer 4095 Good TLP
0 Good TLP
Device A Device B
Data Link Layer NEXT_TRANSMIT_SEQ Data Link Layer
1 Good TLP
Replay Buffer 3 NEXT_RCV_SEQ

REPLAY_TIMER
2
NAK_SCHEDULED
0
Later TLP 2 Ack
Purge
1 1
Lat Tmr
0
4095
Earlier TLP 4094 Ack/Nak
Generator

Link

Transmitters Response to a Nak


ANakindicatesthataproblemhasoccurred.Whenatransmitterreceivesone,
itfirstpurgesfromtheReplayBufferanyTLPswithearlierorequalSequence
Numbers and then replays the remaining TLPs starting with the Sequence
NumberimmediatelyaftertheSequenceNumberintheNak.IftheNakcaused
at least one TLP to be purged from the buffer, then weve made forward
progress. In that case, the transmitter resets the REPLAY_NUM counter and
REPLAY_TIMERandloadstheAckD_SEQregisterwiththeSequenceNumber
oftheNak.

TLP Replay
When a Replay becomes necessary, the transmitter blocks acceptance of new
TLPsfromitsTransactionLayer.ItthenreplaysthenecessaryTLPsinthebuffer
inthesameordertheywereplacedintothebuffer(likeaFIFO).Afterthereplay
event,theDataLinkLayerunblocksacceptanceofnewTLPsfromitsTransac

333
PCIe 3.0.book Page 334 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

tion Layer. The replayed TLPs remain in the buffer until they are finally
acknowledgedatsomelatertime.

Efficient TLP Replay


AckorNakDLLPsreceivedduringreplaymustbeprocessed.Sotherearetwo
mainoptions here,the transmitter may hold themuntil the replay is finished
and then evaluate the Acks or Naks and take the appropriate steps. Another
optionwouldbetobeginprocessingthenewAck/NakDLLPsevenduringthe
replay.Ifthisoptionisused,thenewlyreceivedAcksmightpurgesomeentries
from the buffer while replay is in progress, possibly reducing the number of
TLPsthatneedtobereplayedandsavingtimeontheLink.Thisisallowed,but
itisimportanttorememberthatonceaTLPisstartedfortransmission,itmust
becompleted.

Example of a Nak
ConsiderFigure1010onpage335.

1. DeviceAtransmitsTLPswithSequenceNumber4094,4095,0,1,and2.
2. Device B receives TLP 4094 without error and increments the
NEXT_RCV_SEQ count to 4095 and starts the
AckNak_LATENCY_TIMER.
3. DeviceBdetectsaCRCerrorinthenextTLPreceived(TLP4095)and
sets the NAK_SCHEDULED flag, which will cause a Nak to be sent
with Sequence Number 4094 (NEXT_RCV_SEQ count 1). Device B
does NOT wait until the AckNak_LATENCY_TIMER expires before
sendingtheNak.Itwilltypicallybesentonthenextpacketboundary.
In face, since a Nak is scheduled for transmission, the
AckNak_LATENCY_TIMERisstoppedandreset.
4. DeviceBwillcontinueevaluatingincomingTLPslookingforTLP4095.
However,becauseDeviceAdidnotknowtherewasaproblemyet,it
had sent packets 0, 1 and 2, which Device B will receive. However,
Device B will not accept them, even though they may be good TLPs
(meaningtheydidnotfailtheLCRCcheck).Thisisbecauseallpackets
havetobeacceptedinorder.SoDeviceBwillsimplydropthosepack
etsbecausetheyareconsideredoutofsequence,butnoadditionNak
willbesent.EvenifoneormoreoftheseTLPsfailtheLCRCcheck,no
additionalNAKissent.TheNAK_SCHEDULEDflagisalreadysetand
itwillonlybeclearedonceDeviceBsuccessfullyreceivestheTLPitis
expecting(TLP4095inthisexample).

334
PCIe 3.0.book Page 335 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

5. Device A receives Nak 4094 and purges TLP 4094 and earlier TLPs
(none in this example) from the Replay Buffer. Also, since forward
progresswasmade,itresetstheREPLAY_TIMERandREPLAY_NUM
count.
6. Since the acknowledge DLLP received was a Nak and not an Ack,
Device A then replays all remaining TLPs in the Replay Buffer (TLPs
4095,0,1,and2)andrestartstheREPLAY_TIMERandincrementsthe
REPLAY_NUMcountbyone.
7. Once Device B receives the replayed TLP 4095, it will clear the
NAK_SCHEDULED flag, increment the NEXT_RCV_SEQ count and
starttheAckNak_LATENCY_TIMER.

Figure1010:ExampleofaNak

Receive Buffer 4094 Good TLP


Device A Device B
Data Link Layer NEXT_TRANSMIT_SEQ Data Link Layer
Replay Buffer 3 NEXT_RCV_SEQ

REPLAY_TIMER
4095
NAK_SCHEDULED
Replay 1 4095 LCRC fail
Later TLP 2
1
Lat Tmr
0
4095 Nak 0 Out of sequence
Purge Ack/Nak
Earlier TLP 4094 4094
Generator

Link
Replayed TLPs
2 1 0 4095 2 1

Repeated Replay of TLPs

General. Each time the transmitter receives a Nak, it replays the buffer
contents,andthe2bitREPLAY_NUMcounterisincrementedtokeeptrack
ofthenumberofreplayevents.ThereplaycausedbyaNakintheprevious
examplewillincrementREPLAY_NUM.

335
PCIe 3.0.book Page 336 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

If the replay doesnt clear the problem, though, we enter a new situation.
The receiver has set the Nak Scheduled Flag and cannot send any more
AcksorNaksuntilitseestheoffendingTLPcorrectlyreceived.Ifthereplay
doesntmakethathappenforsomereason,thentherewillbenoresponse
fromthereceiver.WhatsavesusnowisthetransmittersREPLAY_TIMER.
Whenittimesout,theentire contentsoftheReplayBufferwillberesent,
the REPLAY_NUM counter will be incremented and the REPLAY_TIMER
willberesetandrestarted.IftheREPLAY_TIMERexpireswithoutreceiving
an Ack or Nak indicating forward progress, this replay process can be
repeateduptothreetimes.Ifafterthethirdreplay,thereisstillnoforward
progress and the REPLAY_TIMER expires again, this would cause the
REPLAY_NUMcountertorolloverfrom3backto0.

Replay Number Rollover. When this happens, the assumption is


thattheremustbesomethingwrongwiththeLink,sotheLinkLayertrig
gersthePhysicalLayertoretraintheLink,causingittogointotheRecov
eryState(seeRecoveryStateonpage 571).IftheoptionalAdvancedError
Reporting registers are implemented, the Replay Number Rollover error
status bit will also be set (Advanced Correctable Error Handling on
page 688).TheReplayBuffercontentsarepreservedandtheLinkLayeris
notinitializedduringtheretrainingprocess(thisissimplyretrainingthe
Link,notperformingaresetoftheLink).Whenretrainingcompletes,the
transmitterresumesthesamereplayprocessagaininhopesthattheprob
lemhasbeenclearedandtheTLPscannowbereplayedsuccessfully.

The spec does not describe how a device might handle repeated rollover
eventsiftheLinktrainingdoesntcleartheproblem.Theauthorhasseen
commerciallyavailablehardwarethathadnomechanismtodetectthiscon
ditionandgotstuckinanendlessloopofretraining.Itseemsgoodthere
fore, to recommend that a device track the number of retrain attempts.
After sufficient attempts, the device could signal an Uncorrectable Fatal
Errororaninterruptasawaytonotifysoftwareofthiscondition.

Replay Timer
ThetransmitterREPLAY_TIMERisrunninganytimethereareTLPsthathave
been transmitted but have not yet been acknowledged. The goal of the
REPLAY_TIMER is to ensure that TLPs are being acknowledged in a timely
fashion.Ifthistimerexpires,itindicatesthatanAckorNakshouldhavebeen
receivedbythatpointintime,sosomethingmusthavegonewrongandthefix
fromthetransmitterspointofviewistoperformareplay,meaningtoresend
everythingintheReplayBuffer.

336
PCIe 3.0.book Page 337 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

Basedonthepurposeofthistimer,itmakessensethatitstimeoutvalueshould
be correlated the AckNak_LATENCY_TIMER in the receiver. In fact, the
REPLAY_TIMER is simply three times longer than the
AckNak_LATENCY_TIMER.

Aformulainthespecdeterminesthetimerscountvalue.Itsexpirationtriggers
a replay event and increments the REPLAY_NUM counter. A couple of cases
wheretimeoutmayariseisifanAckorNakislostenroute,orbecauseanerror
inthereceiverpreventsitfromreturninganAckorNak.Timerrelatedrules:

Ifnotalreadyrunning,thetimerstartswhenthelastsymbolofanyTLPis
transmitted
Thetimerisresetandrestartedwhen:
An Ack indicating forward progress is received, AND there are still
unacknowledgedTLPsintheReplayBuffer
AReplayeventoccursandthelastsymbolofthefirstreplayedTLPis
sent
Thetimerisresetandheldwhen:
TherearenoTLPstotransmit,ortheReplayBufferisempty
ANakisreceived;itrestartswhenthelastsymbolofthefirstreplayed
TLPissent
Thetimerexpires;itrestartswhenthelastsymbolofthefirstreplayed
TLPissent
TheDataLinkLayerisinactive
ThetimerisheldduringLinktrainingorretraining

REPLAY_TIMER Equation. Thetimeoutvaluedependsprimarilyon


themaxdatapayloadandthewidthoftheLink.Theequationtocalculate
the REPLAY_TIMER value in symbol times is given below. Note that the
valueissimplythreetimestheAck/NakLatencyvalue.
.

( ( Max_Payload_Size + TLPOverhead ) * AckFactor


LinkWidth )
+ InternalDelay * 3 + Rx_L0s_Adjustment

( this term removed


for Gen2 and later )
Theequationfieldsaredefinedasfollows:
Max_Payload_Size the value in the Device Control Register. In the
caseofmultipleFunctionswithdifferentMax_Payload_Sizevalues,the
specrecommendsusingthesmallestoneofthem.

337
PCIe 3.0.book Page 338 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

TLP Overhead the additional TLP fields beyond the data payload
(sequence number, header, digest, LCRC and Start/End framing sym
bols).Inthespec,theoverheadvalueistreatedasaconstantof28sym
bols.
AckFactor(AF)isbasicallyafudgefactorrepresentingthenumberof
max payloadsized TLPs that can be received before an Ack must be
sent. The AF value ranges from 1.0 to 3.0 and is intended to balance
LinkbandwidthefficiencyandReplayBuffersize.ThetableinFigure
1011onpage339showstheAckFactorvaluesforvariouslinkwidths
andpayloadsizes.TheseAckFactorvaluesarechosentoallowimple
mentations to achieve good performance without requiring a large
uneconomicalbuffer.
LinkWidthrangesfromx1(1bitwide)tox32(32bitswide).
InternalDelay the internal delay of processing a TLP within the
receiverandDLLPs(Acks)withinthetransmitter.Thisvalueisdefined
inthespecinsymboltimes,anddependsontheLinkspeed:Gen1=19,
Gen2=70,Gen3=115.
Rx_L0s_AdjustmentThisisavaluethatwasincludedinthe1.xPCIe
specsbutwasdroppedfor2.0andlaterPCIespecs.Itcouldbeusedto
accountforthetimerequiredbythereceivecircuitstoexitfromL0sto
L0.SettingtheExtendedSyncbitoftheLinkControlregisteraffectsthe
exittimefromL0sandmustbetakenintoaccountinthisadjustment.
Interestingly,thespecwriterschosetoassumethistobezerowhencre
atingtheirtableofReplayTimervalues.Moreonthisinthefollowing
section.

REPLAY_TIMER Summary Table. Figure 1011 on page 339 is a


summarytablefortheGen1ratethatshowstimerloadvaluesforvarious
valuesofthevariablesintheREPLAY_TIMERequation.Thenumbershave
changedforthenewergenerationsofthespec,andthenewtablesandadis
cussionofthemcanbefoundinthesectioncalledTimingDifferencesfor
NewerSpecVersionsonpage 350.Thetoleranceforallofthetablevalues
is0%to+100%.

Notethatthetablevaluesinthespec(copiedhereforconvenience)arecon
sideredunadjustedbecausetheyleave outthelastitemofthe equation
involvingthetimetorecoverfromL0s.Noexplanationisgivenforthisin
the spec, but if the Link had to wake up from L0s to L0 just to replay a
packet in case the timeout might have been an error, that would be poor
powermanagement.

338
PCIe 3.0.book Page 339 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

A simple way to avoid this problem altogether is for the transmitter to


ensure that the Replay Buffer is empty before entering L0s. The spec
requiresthatstepforentryintoL1butnotL0s,andthereasonprobablyhas
todowiththerelativeriskinvolved.GoingtoL1requiresalongerrecovery
processbacktoL0thathassomesmallriskoffailure.Ifitfailstorecover,
thePhysicalLayerstatemachinewillhavetodomoreoftheLinktraining,a
process that clears the LinkUp flag to the Link Layer, causing the Link
Layer to reinitialize. If there were entries in the Replay Buffer when that
happenedtheydbelostandproblemscouldresult.Therecoveryriskfrom
L0swasevidentlyconsideredlowenoughnottowarrantthatrequirement.
Still,the L0s latencywasleft out when thetable wasconstructed,leaving
the reader to wonder about this. In the authors opinion, the spec writers
expected designers to take steps to ensure that a Replay Timer timeout
eitherdoesntoccurwhileinL0s(byemptyingtheReplayBufferbeforeL0s
entry),orwillbedelayedifthepathfortheAcksisobservedtobeinL0s.

Figure1011:Gen1UnadjustedREPLAY_TIMERValues

Max_Payload X1 X2 X4 X8 X12 x16 X32


Size Link Link Link Link Link Link Link
128 Bytes 711 384 219 201 174 144 99
256 Bytes 1248 651 354 321 270 216 135
512 Bytes 1677 867 462 258 327 258 156
1024 Bytes 3213 1635 846 450 582 450 252
2048 Bytes 6285 3171 1614 834 1095 834 444
4096 Bytes 12,429 6243 3150 1602 2118 1602 828
The table summarizes values calculated using the equation, minus the
Rx_L0s_Adjustment term
Example: Assume a 2-lane link with a Max_Payload of 2048 bytes.
(Max_Payload_Size + TLP Overhead) * AckFactor + Internal Delay *3
LinkWidth

(2048 + 28) * 1.0 + 19 *3 = 3171 (about a 12.7uS timeout period)


2

339
PCIe 3.0.book Page 340 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Transmitter DLLP Handling


TheAck/NakErrorCheckingblockdetermineswhetherthereisanerrorinthe
16bitCRCofareceivedDLLP.Ifanerrorisdetected,theDLLPisdiscarded.
Thisisconsideredacorrectableerrorandmayhavebeensetuptobereported
in the optional Advanced Error Reporting registers (see Bad DLLP in
AdvancedCorrectableErrorHandlingonpage 688),butnofurtheractionis
takenbecausethisisntreallyaproblem.ThenextsuccessfullyreceivedDLLP
ofthattypewillbringthecountersbackuptospeed.Consequently,TLPsmight
bepurgedalittlelaterthantheywouldhavebeenorareplaymayhappenata
latertime,butnoinformationislost.Ofcourse,ifthedelaybetweensuccessful
Acksbecomestoolarge,theREPLAY_TIMERcouldexpire,causingtheTLPsto
bereplayed.

Receiver Protocol Details


Physical Layer
TLPs received at the Physical Layer are checked for receiver errors (such as
framing,disparity,andinvalidsymbols).Ifthereareerrorsatthislevel,theTLP
is discarded and the Link Layer may be informed by some designspecific
methodsoitcanscheduleaNakandhavethepacketreplayed.IftheLinkLayer
isnotinformed,theneventuallyitwilldetectaSequenceNumberviolationand
thatwillcauseaNakandareplay.

340
PCIe 3.0.book Page 341 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

Figure1012:Ack/NakReceiverElements

Transaction Layer (RX)

Increment NRS Good TLPs

NEXT_RCV_SEQ (NRS) Seq Num = NRS

Seq Num < NRS (Duplicate TLP) Seq Num


>, <, =
(NRS 1) = AckNak_Seq_Num[11:0]

(Schedule Ack)
NRS?

Seq Num > NRS (Lost TLP)


(Send Nak) Yes

(Send Nak) No Pass


LCRC?
Nak Flag Clear?
Set & Send Nak

NAK_SCHEDULED Good TLP?


Clear Nak Flag
Ack Nak
Ack/Nak AckNak Latency
Generator Timer

Link

TLP LCRC Check


If there were no Physical Layer errors, the Link Layer checks first for CRC
errors.ThereceivercalculatesanexpectedLCRCvaluefromthereceivedTLP
(excludingtheLCRCfield)andcomparesthisvaluewiththeTLPs32bitLCRC.
If the two match, the TLP is good. Otherwise, the TLP is discarded and the
receiverschedulesaNak.

Next Received TLPs Sequence Number


If the LCRC was correct, the receiver next compares the NEXT_RCV_SEQ
counteragainsttheSequenceNumberthatshouldbeinthenewlyreceivedTLP.
Undernormaloperationalconditions,thesetwonumberswillmatch.Iftheydo,
the receiver forwards the TLP to the Transaction Layer, increments the
NEXT_RCV_SEQcounter,andschedulesanAck.

341
PCIe 3.0.book Page 342 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

IfthereceivedTLPsSequenceNumberturnsouttobeearlierorlaterthanthe
NEXT_RCV_SEQcount,wehaveoneoftwocases:aduplicateTLPoranoutof
sequenceTLP.

Duplicate TLP. IftheSequenceNumberoftheincomingpacketisear


lier (logically smaller) than the expected value, it means the transmitter
decided toresendapacketthatthereceiverhasalreadyseenbefore.This
duplicatepacketisnotanerroralthoughwearewastingtimeontheLink
byresendingit.Thismightbecausedbyatimeoutatthetransmitterifthe
AckorNakforapreviousTLPfailed.Whenthisisseenatthereceiver,the
duplicate packetisdiscarded andan Ackisscheduledwiththe Sequence
Number of the last good TLP it has received (which is probably not the
sameSequenceNumberinthereplayedTLP).

Out of Sequence TLP. If the Sequence Number of the incoming


packetislater(logicallylarger)thantheexpectedvalue,theonlyexplana
tionisthataTLPmusthavebeenlost.Thisisacorrectableerrorandishan
dled by sending a Nak. It doesnt matter if the incoming packet is good
becausetheycanonlybeacceptedincorrectSequenceNumberorder.The
packet is discarded and the receiver waits for a TLP with the expected
SequenceNumber.

TheNEXT_RCV_SEQcounterisnotincrementedwhenaTLPisreceivedwitha
CRCerror,orwasnullified,orforwhichtheSequenceNumbercheckfails.

AtransmitterordersTLPsaccordingtothePCIorderingrulestomaintaincor
rect program flow and avoid potential deadlock and livelock conditions (see
Chapter 8, entitled Transaction Ordering, on page 285). The Receiver is
requiredtopreservethisorderandappliesthesethreerules:

WhenthereceiverdetectsabadTLP,itdiscardstheTLPandallnewTLPs
thatfollowinthepipelineuntilthereplayedTLPsaredetected.
DuplicateTLPsarediscarded.
TLPsreceivedwhilewaitingforalostorcorruptTLParediscarded.

Receiver Schedules An Ack DLLP


IftheDataLinkLayerofthereceiverdoesnotdetectanerrorinanincoming
TLP, it forwards the TLP to the Transaction Layer. The NEXT_RCV_SEQ
counterisincrementedandthereceiverstartstheAckNak_LATENCY_TIMER
(assumingitwasnotalreadyrunning).Thisistheequivalentofschedulingan
Ack.ThereceiverisallowedtocontinuereceivinggoodTLPswithoutsending
anAckuntiltheAckNak_LATENCY_TIMERexpires.Whenthetimerexpiresit

342
PCIe 3.0.book Page 343 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

sendsjustoneAckwiththeSequenceNumberofthelastgoodTLP,acknowl
edginggoodreceiptofallreceivedTLPsuptotheSequenceNumberinthecur
rentAck.ThistechniqueimprovesLinkefficiencybyreducingAck/Naktraffic.
Forreview,recallthatthistechniqueworksbecausetheTLPsmustalwaysbe
successfullyreceivedinorder.

Receiver Schedules a Nak


Asmentionedearlierinthediscussionofthereceiverlogic(seeReceiverEle
mentsonpage 324),whenthereceiverdetectsanerroronaTLP,itdiscardsthe
badpacketandsetstheNAK_SCHEDULEDflagifitwasclear,whichwillcause
aNaktobescheduledwiththeSequenceNumberofNEXT_RCV_SEQcount
1.SinceaNakisnowscheduled,theAckNak_LATENCY_TIMERisresetand
halted.SchedulingaNakcanbethoughtofasbeinganedgetriggeredevent
instead of a leveltriggered event. It is seeing the rising edge of the
NAK_SCHEDULEDflagthatcausesaNaktobescheduled.AnotherNakcan
notbesentuntilthenextrisingedge,whichmeanstheNAK_SCHEDULEDflag
mustbecleared(fallingedge)first.Thereareonlytwoeventsthatwillclearthe
NAK_SCHEDULED flag. The first is successfully receiving the expected next
TLP(TLPwithaSequenceNumberthatmatchestheNEXT_RCV_SEQcount).
Thesecondisaresetofthelink(notretraining,butreset).

AlthoughitsimportanttogettheNaktothetransmitterquickly(nootherTLPs
can be accepted until the failed one is seen without errors), other outgoing
TLPs,DLLPsorOrderedSetsalreadybeinprogressorhaveahigherpriority
thantheNakwhichmeansthereceiverwouldhavetodelaythetransmissionof
theNakuntiltheyredone(seeRecommendedPriorityToSchedulePackets
onpage 350).Inthemeantime,ifotherTLPsarriveatthereceivertheyaredis
carded and no additional Acks or Naks will be scheduled while the
NAK_SCHEDULEDflagisset.

AckNak_LATENCY_TIMER
ThistimerdefineshowlongareceivercanwaitbeforeitmustsendanAckfora
successfullyreceivedTLP(orsequenceofTLPs).Asstatedbefore,thistimeris
running anytime a receiver successfully receives a TLP that it has not yet
acknowledged. Once the timer expires, an Ack is scheduled for transmission
withtheSequenceNumberofthelastgoodTLPitreceived.SchedulinganAck
resets the AckNak_LATENCY_TIMER and it only starts counting again once
thenextTLPissuccessfullyreceived.

343
PCIe 3.0.book Page 344 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

AckNak_LATENCY_TIMER Equation.
The timeout value for the AckNak_LATENCY_TIMER is defined by the
specandvariesbasedontheNegotiatedLinkWidthandMaxPayloadSize
Enabled.Theequationwhichdefinesthetimeoutisshownbelow:

( Max_Payload_Size + TLPOverhead ) * AckFactor


+ InternalDelay + Tx_L0s_Adjustment
LinkWidth

( this term removed


for Gen2 and later )
Thevalueinthetimerisgiveninsymboltimes,thetimeittakestosendone
symbolacrosstheLink:4nsforGen1,2nsforGen2,and1nsforGen3.
Theequationfieldsare:
Max_Payload_Size the value in the Device Control Register. In the
caseofmultipleFunctionswithdifferentMax_Payload_Sizevalues,the
specrecommendsusingthesmallestoneofthem.
TLPOverhead the additional TLP fields beyond the data payload
(sequence number, header, digest, LCRC and Start/End framing sym
bols).Inthespec,theoverheadvalueistreatedasaconstantof28sym
bols.
AckFactor(AF)isbasicallyafudgefactorrepresentingthenumberof
max payloadsized TLPs that can be received before an Ack must be
sent. The AF value ranges from 1.0 to 3.0 and is intended to balance
LinkbandwidthefficiencyandReplayBuffersize.ThetableinFigure
1011onpage339showstheAckFactorvaluesforvariouslinkwidths
andpayloadsizes.TheseAckFactorvaluesarechosentoallowimple
mentations to achieve good performance without requiring a large
uneconomicalbuffer.
LinkWidthrangesfromx1(1bitwide)tox32(32bitswide).from1
to32Lanes.
InternalDelay the internal delay of processing a TLP within the
receiverandDLLPs(Acks)withinthetransmitter.Thisvalueisdefined
inthespecinsymboltimes,anddependsontheLinkspeed:Gen1=19,
Gen2=70,Gen3=115.
Tx_L0s_Adjustment:Thisisavaluethatwasincludedinthe1.xPCIe
specsbutwasdroppedfor2.0andlaterPCIespecs.Itcouldbeusedto
accountforthetimerequiredbythereceivecircuitstoexitfromL0sto
L0.SettingtheExtendedSyncbitoftheLinkControlregisteraffectsthe
exittimefromL0sandmustbetakenintoaccountinthisadjustment.
Interestingly,thespecwriterschosetoassumethistobezerowhencre
atingtheirtableofReplayTimervalues.

344
PCIe 3.0.book Page 345 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

AckNak_LATENCY_TIMER Summary Table. Figure 102 on


page345showstheGen1timerloadvaluesforallthepossiblevaluesused
intheAckNak_LATENCY_TIMERequation.Higherdatarateschangethe
equation and theresulting table(see TimingDifferencesfor Newer Spec
Versions on page 350). Like the Replay Timer table, this table is con
structedbyassumingtheL0sadjustmentintheequationiszeroandthen
referringtotheresultingvaluesasunadjusted.Notethatthetolerancefor
allofthetablevaluesis0%to+100%.

Table102:Gen1UnadjustedAckTransmissionLatency

Max_Payload X1 X2 X4 X8 X12 x16 X32


Size Link Link Link Link Link Link Link
128 Bytes 237 128 73 67 58 48 33
(AF=1.4) (AF=1.4) (AF=1.4) (AF=2.5) (AF=3.0) (AF=3.0) (AF=3.0)
256 Bytes 416 217 118 107 90 72 45
(AF=1.4) (AF=1.4) (AF=1.4) (AF=2.5) (AF=3.0) (AF=3.0) (AF=3.0)
512 Bytes 559 289 154 86 109 86 52
(AF=1.0) (AF=1.0) (AF=1.0) (AF=1.0) (AF=2.0) (AF=2.0) (AF=2.0)
1024 Bytes 1071 545 282 150 194 150 84
(AF=1.0) (AF=1.0) (AF=1.0) (AF=1.0) (AF=2.0) (AF=2.0) (AF=2.0)
2048 Bytes 2095 1057 538 278 365 278 148
(AF=1.0) (AF=1.0) (AF=1.0) (AF=1.0) (AF=2.0) (AF=2.0) (AF=2.0)
4096 Bytes 4143 2081 1050 534 706 534 276
(AF=1.0) (AF=1.0) (AF=1.0) (AF=1.0) (AF=2.0) (AF=2.0) (AF=2.0)

More Examples
IntheclassroomsettingexamplesoftenmakeitmucheasiertograsptheAck/
Nakprocessandsosomeofthemarepresentedheretoillustratespecialcases.

Lost TLPs
Consider Figure 1013 on page 346, showing how a lost TLP is detected and
handled.
1. DeviceAtransmitsTLPs4094,4095,0,1,and2.
2. Device B successfully receives TLP 4094 so it starts its
AckNak_LATENCY_TIMER and increments its NEXT_RCV_SEQ
count.Afterthat,italsoreceivesTLPs4095and0.

345
PCIe 3.0.book Page 346 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

3. After receiving TLP 0, the AckNak_LATENCY_TIMER expires which


causesittoscheduleanAckwithSequenceNumberof0.
4. SeeingAck0,DeviceApurgesTLPs4094,4095,and0fromitsreplay
buffer.
5. TLP 1 is lost en route for some reason (maybe the Physical Layer
dropped it), and TLP 2 arrives instead. The Sequence Number check
shows Device B that TLP 2s Sequence Number is not equal to the
NEXT_RCV_SEQcountbutisintheoutofsequencerange.
6. Device B discards TLP 2 and sets the NAK_SCHEDULED flag which
willsendaNak0(NEXT_RCV_SEQcount1)inthiscase.
7. UponreceiptofNak0,DeviceAreplaysTLPs1and2.Itwouldpurge
TLP0andanyearlieronesintheReplayBuffer,buttheywereremoved
earliersothatbecomesunnecessary.
8. TLPs1and2arrivewithouterroratDeviceBandareforwardedtothe
TransactionLayer.
Figure1013:HandlingLostTLPs

4094 Good TLP


Receive Buffer 4095 Good TLP
0 Good TLP
Device A Device B
Data Link Layer NEXT_TRANSMIT_SEQ Data Link Layer
Replay Buffer 3 NEXT_RCV_SEQ

REPLAY_TIMER
1
NAK_SCHEDULED
Replay 1 2 Out of sequence
Later TLP 2
1 Ack
Purge Lat Tmr
0 0
4095
Earlier TLP 4094 Ack/Nak
0 Nak Generator

Link
Replayed TLPs
2 1

346
PCIe 3.0.book Page 347 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

Bad Ack
Figure1014onpage347whichshowstheprotocolforhandlingacorruptAck.
1. DeviceAtransmitsTLPs4094,4095,0,1,and2.
2. Device B receives TLPs 4094, 4095, and 0, sets NEXT_RCV_SEQ to 1, and
returnsAck0becausetheAckNak_LATENCY_TIMERhadexpired.
3. Ack0hasabitduringitsflightontheLink,sowhenDeviceAchecksits16
bitCRC,itfailsthecheckandisdiscarded.ThismeansTLPs4094,4095,and
0remaininDeviceAsReplayBuffer.
4. TLPs1 and2arriveatDeviceBandaregood, soNEXT_RCV_SEQ count
incrementsto3andAck2isreturnedoncetheAckNak_LATENCY_TIMER
expiresagain.
5. Ack 2 arrives safely at Device A, which purges its Replay Buffer of TLPs
4094,4095,0,1,and2.
IfAck2isalsolostorcorruptedandnofurtherAckorNakDLLPsarereturned
to Device A, its REPLAY_TIMER expires causing a replay of its entire buffer.
DeviceBseesTLPs4094,4095,0,1and2andconsidersthemtobeduplicates
[theirsequencenumbersareearlierthanNEXT_RCV_SEQcount(3)].Theyare
discarded and another Ack 2 would be returned to Device A because of the
duplicatepackets.

Figure1014:HandlingBadAck

4094 Good TLP


Receive Buffer 4095 Good TLP
0 Good TLP
Device A Device B
Data Link Layer NEXT_TRANSMIT_SEQ Data Link Layer
Replay Buffer 3 NEXT_RCV_SEQ

REPLAY_TIMER
1
NAK_SCHEDULED
Replay 1 2 Out of sequence
Later TLP 2
1 Ack
Purge Lat Tmr
0 0
4095
Earlier TLP 4094 Ack/Nak
0 Nak Generator

Link
Replayed TLPs
2 1

347
PCIe 3.0.book Page 348 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Bad Nak
Figure1015onpage349whichshowsprotocolforhandlingacorruptNak.
1. DeviceAtransmitsTLPs4094,4095,0,1,and2.
2. Device B receives TLPs 4094, 4095, and 0 all successfully (and the
AckNak_LATENCY_TIMER has not yet expired). The next TLP that it
receivesfailstheLCRCcheck,soDeviceBsetstheNAK_SCHEDULEDflag,
andresetsandholdstheAckNak_LATENCY_TIMER.TheNakistransmit
tedbackwithaSequenceNumberofthelastgoodTLPreceived,0.
3. Nak0failsthe16bitCRCcheckatDeviceAandisdiscarded.
4. Atthispoint,DeviceBwillnotbesendinganymoreAcksorNaksuntilit
successfully receives the next TLP it is expecting, TLP 1 in this example.
However, this will require a replay. Device A does not yet know that a
replayisrequiredbecausethe oneNakthat wassentbackwascorrupted
and discarded. This gets resolved by the REPLAY_TIMER. The
REPLAY_TIMER will eventually expire because it has not seen an Ack or
Nakthatmakesforwardprogressinthespecifiedtimeframe.
5. Once the REPLAY_TIMER expires, Device A will replay all TLPs in the
Replay Buffer, increment REPLAY_NUM count and reset and restart the
REPLAY_TIMER.
6. Device B will receive TLPs 4094, 4095 and 0 and recognize that they are
duplicates.TheduplicateTLPswillbedroppedandanAckwillbesched
uledwithaSequenceNumber0(indicatingthefurthestprogressmade).
7. Once TLP 1 is successfully received by Device B, it will clear the
NAK_SCHEDULED flag, increment the NEXT_RCV_SEQ and restart the
AckNak_LATENCY_TIMERbecauseithassuccessfullyreceivedaTLPthat
ithasnotyetacknowledged.

348
PCIe 3.0.book Page 349 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

Figure1015:HandlingBadNak

4094 Good TLP


Receive Buffer 4095 Good TLP
0 Good TLP
Device A Device B
Data Link Layer NEXT_TRANSMIT_SEQ Data Link Layer
Replay Buffer 3 NEXT_RCV_SEQ

REPLAY_TIMER
3
(expires) NAK_SCHEDULED
1 1 LCRC Fail
Later TLP 2
1 Replay
Lat Tmr
0
4095 Nak
CRC Ack/Nak 2 Out of sequence
Earlier TLP 4094 2 Fail Generator

Link
Replayed TLPs
2 1 0 4095 4094

Error Situations Handled by Ack/Nak


TheAck/NakprotocolguaranteesreliabledeliveryofTLPsdespiteseveralpos
sibleerrors.Thelistoferrorsbelowincludesthecorrectionmechanismusedto
resolvethem.
LCRCerrorinaTLP.Solution:ReceiverdetectsLCRCerrorandschedules
aNakthatcontainstheNEXT_RCV_SEQcount1.Inresponse,thetrans
mitterreplaysatleastoneTLP,startingwiththeonethatfailed.
TLPs lost en route to the receivers Data Link Layer (e.g. Physical Layer
detects issue with packet and drops it). Solution: The receiver checks the
SequenceNumberonallreceivedTLPs,expectingthemtoarrivewiththe
nextsequentialSequenceNumber.IfaTLPislost,theSequenceNumberof
thenextonethatsucceedswillbeoutofsequence.Inresponse,theReceiver

349
PCIe 3.0.book Page 350 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

schedulesaNakwithNRScount1,andthetransmitterreplaysatleastone
TLP,startingwiththemissingone.
CorruptedAckorNakenroutetothetransmitter.Solution:TheTransmit
terdetectsaCRCerrorintheDLLP(seeReceiverhandlingofDLLPson
page 309),discardsthepacketandsimplywaitsforthenextone.
Ack Case: A subsequent Ack received with a later Sequence Number
causesthetransmitterReplay Buffer topurgeallTLPswith Sequence
Numbers equal to or earlier than it. The transmitter is unaware that
anything was wrong (except for a potential case of the Replay Buffer
temporarilyfillingup).
Nak Case: The receiver, having set the Nak Scheduled flag, will not
send another Nak or any Acks until it successfully receives the next
expected TLP, meaning a replay is needed. Of course, the transmitter
doesnt know it needs to replay if the Nak was lost. In this case, the
REPLAY_TIMERwilleventuallyexpireandtriggerthereplay.
No Ack/Nak seen within the expected time. Solution: REPLAY_TIMER
timeouttriggersareplay.
Receiver fails to send Ack/Nak for a received TLP. Solution: Again, the
transmittersREPLAY_TIMERwillexpireandresultinareplay.

Recommended Priority To Schedule Packets


AdevicemayhavemanytypesofTLPs,DLLPsandOrderedSetstotransmiton
agivenLink.Therecommendedpriorityforschedulingpacketsis:

1. CompletionofanyTLPorDLLPcurrentlyinprogress(highestpriority)
2. OrderedSet
3. Nak
4. Ack
5. FlowControl
6. ReplayBufferretransmissions
7. TLPsthatarewaitingintheTransactionLayer
8. AllotherDLLPtransmissions(lowestpriority)

Timing Differences for Newer Spec Versions


Asmentionedearlier,thetimervaluesfortheAck/Nakprotocolaredifferentfor
Gen2andlaterversionsofthespec.Toimprovereadabilityofthetext,onlythe
Gen1 versions (2.5 GT/s rate) were included in the earlier discussion, but all
threeversionsareincludedhereforconvenience.

350
PCIe 3.0.book Page 351 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

Asbefore,thevaluesgivenareinsymboltimes,sotheactualtimeisthatvalue
multipliedbythetimeneededtodeliveronesymbolovertheLinkatthatrate.
Forreview,thetimetotransmitonesymbol(knownasaSymbolTime)is4ns
forGen1,2nsforGen2,and1.25nstotransmit1byteforGen3.

Ack Transmission Latency (AckNak Latency)


OneinterestingdifferencebetweenthespecversionsisthewaytheL0srecov
ery time is considered. In the 1.x specs, an argument is included in the
AckNak_LATENCY_TIMER equationto accountforthis, butthe tables in the
specbasedonthatequationputitsvalueatzeroandcalltheresultingvalues
unadjusted. Beginning with the 2.0 spec, the L0s recovery value is dropped
fromtheequationaltogetherandthetextstatesthatthereceiverisnotrequired
toadjustAckschedulingbasedonL0sexitlatencyorthevalueoftheExtended
Syncbit.NoneofthetablevaluescontainanL0srecoverycomponentandare
thereforeallstillcalledunadjusted.

Note that, since the AF (Ack Factor) values are the same in all the tables and
wereshownintheearlierpresentationoftheGen1table,theyrenotincludedin
thetableshere.

Also,asitwasforGen1,thetoleranceforallofthetablevaluesis0%to+100%.
To illustrate this, Table 103 on page 351 lists the time for a x1 Link and Max
Payload size of 128 Bytes as 237 symbol times. Legal values would therefore
rangefromnolessthan237symboltimestonomorethan474.

2.5 GT/s Operation

Table103:Gen1UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)

MaxPayload x1 x2 x4 x8 x12 x16 x32


Link Link Link Link Link Link Link

128Bytes 237 128 73 67 58 48 33

256Bytes 416 217 118 107 90 72 45

512Bytes 559 289 154 86 109 86 52

1024Bytes 1071 545 282 150 194 150 84

2048Bytes 2095 1057 538 278 365 278 148

351
PCIe 3.0.book Page 352 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table103:Gen1UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)(Continued)

MaxPayload x1 x2 x4 x8 x12 x16 x32


Link Link Link Link Link Link Link

4096Bytes 4143 2081 1050 534 706 534 276

5.0 GT/s Operation


Table104:Gen2UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)

MaxPayload x1 x2 x4 x8 x12 x16 x32


Link Link Link Link Link Link Link

128Bytes 288 179 124 118 109 99 84

256Bytes 467 268 169 158 141 123 96

512Bytes 610 340 205 137 160 137 103

1024Bytes 1122 596 333 201 245 201 135

2048Bytes 2146 1108 589 329 416 329 199

4096Bytes 4194 2132 1101 585 757 585 327

8.0 GT/s Operation

Table105:Gen3UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)

MaxPayload x1 x2 x4 x8 x12 x16 x32


Link Link Link Link Link Link Link

128Bytes 333 224 169 163 154 144 129

256Bytes 512 313 214 203 186 168 141

512Bytes 655 385 250 182 205 182 148

1024Bytes 1167 641 378 246 290 246 180

2048Bytes 2191 1153 634 374 461 374 244

4096Bytes 4239 2177 1146 630 802 630 372

352
PCIe 3.0.book Page 353 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

Replay Timer
MuchliketheAckNakLatencyTimercalculation,L0srecoverytimeisconsid
ereddifferentlyfortheReplayTimerinnewerspecversions.Inthe1.xspecs,an
argumentisincludedintheReplayTimerequationtoaccountforthis,butthe
tablesinthespecbasedonthatequationputitsvalueatzeroandcalltheresult
ingvaluesunadjusted.Beginningwiththe2.0spec,theargumentisdropped
fromtheequationaltogetherandthetextstatesthatthetransmittershouldcom
pensateforL0sexitifitwillbeused,eitherbystaticallyaddingthattimetothe
tablevaluesorbysensingwhentheLinkisinthatstateandallowingextratime
inthatcase.ThetablevaluesstilldontcontainanL0scomponentandarestill
calledunadjusted.

Asafinalwordonthistopic,thespecstronglyrecommendsthatatransmitter
shouldnotdoareplayonaReplayTimertimeoutifitspossiblethatthedelay
in receiving an Ack was caused by the other devices transmitter being in the
L0sstate.

Notethat,justlikefortheAckLatencyTimertables,thetoleranceforallofthe
tablevaluesis0%to+100%.Toillustratethis,Table 106onpage 353liststhe
timeforax1LinkandMaxPayloadsizeof128Bytesas711symboltimes.Legal
values would thereforerangefromno less than711 symbol times to nomore
than1422.

2.5 GT/s Operation

Table106:Gen1UnadjustedREPLAY_TIMERValuesinSymbolTimes

MaxPayload x1 x2 x4 x8 x12 x16 x32


Link Link Link Link Link Link Link

128Bytes 711 384 219 201 174 144 99

256Bytes 1248 651 354 321 270 216 135

512Bytes 1677 867 462 258 327 258 156

1024Bytes 3213 1635 846 450 582 450 252

2048Bytes 6285 3171 1614 834 1095 834 444

4096Bytes 12429 6243 3150 1602 2118 1602 828

353
PCIe 3.0.book Page 354 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

5.0 GT/s Operation


Table107:Gen2UnadjustedREPLAY_TIMERValuesinSymbolTimes

MaxPayload x1 x2 x4 x8 x12 x16 x32


Link Link Link Link Link Link Link

128Bytes 864 537 372 354 327 297 252

256Bytes 1401 804 507 474 423 369 288

512Bytes 1830 1020 615 411 480 411 309

1024Bytes 3366 1788 999 603 735 603 405

2048Bytes 6438 3324 1767 987 1248 987 597

4096Bytes 12582 6396 3303 1755 2271 1755 981

8.0 GT/s Operation


Table108:Gen3UnadjustedREPLAY_TIMERValues

MaxPayload x1 x2 x4 x8 x12 x16 x32


Link Link Link Link Link Link Link

128Bytes 999 672 507 489 462 432 387

256Bytes 1536 939 642 609 558 504 423

512Bytes 1965 1155 750 546 615 546 444

1024Bytes 3501 1923 1134 738 870 738 540

2048Bytes 6573 3459 1902 1122 1383 1122 732

4096Bytes 12717 6531 3438 1890 2406 1890 1116

Switch Cut-Through Mode


Now that weve described how the protocol works, this is a good time to
explain an exception to its general operation. PCIe supports a Switch feature,
calledcutthroughmode,thatcanbeusedtoimprovethetransferlatencyfor
largeTLPsthroughaSwitch.

354
PCIe 3.0.book Page 355 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

Background
Consider an example where a large TLP needs to pass through a Switch as
shown in Figure 1016 on page 357. Since the Ingress Switch Port cant tell
whether there was an error inthe packet until ithas seen the whole TLP, itll
normallystoretheentirepacketandcheckitforerrorsbeforeforwardingitto
the Egress Port. This storeandforward method works but, for large packets,
the latency to get through the Switch can be large which may be an issue for
someapplications.Itwouldbenicetominimizethislatencyifpossible.

A Latency Improvement Option


SincethefirstpartoftheTLPcontainstheheaderwiththeroutinginformation
forthepacket,oneoptionwouldbetoassumethatthepacketisagoodpacket
andstartevaluatingtheroutinginfoinheaderevenbeforetheentirepacketis
received.ThiswouldallowaSwitchtobeginforwardingtheTLPtotheEgress
Portassoonasthatroutingisevaluated.TheEgressPortcouldthengoahead
and start sending it out onto its Link, as long as doing so will not cause an
underflowconditionwithintheSwitch.(Apotentialunderflowcasecouldeas
ilyhappeniftheIngressPortisx1andtheEgressPortisx16.TheEgressPort
wouldbesendingthepacketoutmuchfasterthanitisbeingreceived.)

Ofcourse,theIngressPortcantcheckforerrorsinthepacketuntilitreceives
theLCRCattheendofthepacket,sothereisasmallriskinvolvedthattheTLP
beingforwardedoutmayactuallycontainanerror.Eventually,theendofthe
TLP arrives at the Ingress Port and the packet can be checked. If it turns out
therewasanerror,theIngressPorttakesthenormalbehaviortoabadTLPand
simplysendsaNaktohavethepacketreplayed.However,wenowhavetodeal
withtheproblemthatmostofapacketthatwenowknowisbadhasalready
beenforwardedontotheEgressPort.Whatareouroptionsatthispoint?We
could finish forwarding the packet and wait for a Nak from the neighboring
receiverwhenitseestheerror,butthepacketinthereplaybufferwouldbethe
badone,andsoareplaytherewontfixtheproblem.Wemighttruncatethebad
packet in flight, but the spec doesnt allow for that possibility. To make this
work,weneedanotheroption,andthatswheretheCutThroughoptioncomes
intoplay.

355
PCIe 3.0.book Page 356 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Cut-Through Operation
Cutthoughmodeprovidesthesolutiontotheforwardingproblemdescribedin
theprevioussection:ifanerrorisseenintheincomingpacket,thepacketthatis
alreadyonitswayoutmustbenullified.

AnullifiedpacketisterminatedwithanEDB(endbad)symbolinsteadofan
END(endgood)symboland,tomaketheconditionveryclear,theTLPs32bit
LCRC is inverted (1s complement) from the original calculated value. In
essence, a nullified packet is handled as though it had never existed. On the
Switch Egress Port, that means the replay buffer discards the packet and the
NEXT_TRANSMIT_SEQcounterisdecrementedbyone(rolledback).

WhenadevicereceivesaTLPthatitrecognizesasbeinganullifiedTLP,itsim
plydropsthepacketandtreatsitasifitneverexisted.TheNEXT_RCV_SEQis
not incremented, the AckNak_LATENCY_TIMER is not started, nor is the
NAK_SCHEDULEDset.ThereceivingdevicesilentlydiscardsthenullifiedTLP
anddoesnotreturnanAck/Nakforit.

Example of Cut-Through Operation


Figure 1016 on page 357 illustrates a TLP coming in from the left, going
through the Switch, and ending up at an Endpoint on the right. A TLP error
occursontheleftLink.Thestepsareasfollows:
1. An incoming TLP is seen at the Switch Ingress Port. It has become cor
ruptedinflightbutthatisntknownyet.
2. TheTLPheaderarrives,isdecoded,andthepacketisforwardedtothedes
tinationEgressPortincutthroughoperation.
3. Eventually,theendofthepacketarrivesandtheSwitchIngressPortisable
tocompletetheLCRCerrorcheck.ItfindsaCRCerrorandreturnsaNakto
theTLPsource.
4. AttheEgressPort,theSwitchreplacestheENDframingsymbolattheend
ofthebadTLPwithEDBandinvertsthecalculatedLCRCvalue.TheTLPis
nownullifiedandtheSwitchdiscardsitfromtheReplayBuffer.
5. ThenullifiedpacketarrivesattheEndpoint.TheEndpointdetectstheEDB
symbol and inverted LCRC and silently discards the packet. It does not
returnaNak.
NowletssaytheTLPsourcedevicereplaysthepacketandnoerroroccurs.As
before,theTLPisforwardedtotheEgressPortwithveryshortlatency.When

356
PCIe 3.0.book Page 357 Sunday, September 2, 2012 11:25 AM

Chapter10:Ack/NakProtocol

therestoftheTLParrivesattheSwitch,thereisnoerror,soanAckisreturned
totheTLPsourcewhichthenpurgesthisTLPfromitsReplayBuffer.Thistime
theSwitchEgressPortkeepsacopyoftheTLPinitsReplayBuffer.Whenthe
TLPreachesthedestination,thepackethasnoerrorsandtheEndpointreturns
anAck.Basedonthat,theSwitchpurgesthecopyoftheTLPfromitsReplay
Bufferandthesequenceiscomplete.

Figure1016:SwitchCutThroughModeShowingErrorHandling

Error occurs

1) 2) 4)
END TLP STP END TLP STP EDB TLP STP
EDB TLP STP
Switch Endpoint
5) Discard Packet
3) NAK 6) No ACK or NAK

357
PCIe 3.0.book Page 358 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

358
PCIe 3.0.book Page 359 Sunday, September 2, 2012 11:25 AM

PartFour:

PhysicalLayer
PCIe 3.0.book Page 360 Sunday, September 2, 2012 11:25 AM
PCIe 3.0.book Page 361 Sunday, September 2, 2012 11:25 AM

11 PhysicalLayer
Logical(Gen1
andGen2)
The Previous Chapter
ThepreviouschapterdescribestheAck/NakProtocol:anautomatic,hardware
basedmechanismforensuringreliabletransportofTLPsacrosstheLink.Ack
DLLPsconfirmgoodreceptionofTLPswhileNakDLLPsindicateatransmis
sionerror.Thechapterdescribesthenormalrulesofoperationaswellaserror
recoverymechanisms.

This Chapter
This chapter describes the Logical subblock of the Physical Layer. This pre
parespacketsforserialtransmissionandrecovery.Severalstepsareneededto
accomplishthisandtheyaredescribedindetail.Thischaptercoversthelogic
associatedwiththeGen1andGen2protocolthatuse8b/10bencoding.Thelogic
forGen3doesnotuse8b/10bencodingandisdescribedseparatelyinthechap
tercalledPhysicalLayerLogical(Gen3)onpage 407.

The Next Chapter


ThenextchapterdescribesthePhysicalLayercharacteristicsforthethirdgener
ation(Gen3)ofPCIe.Themajorchangeincludestheabilitytodoubletheband
widthrelativetoGen2withoutneedingtodoublethefrequencybyeliminating
theneedfor8b/10bencoding.Morerobustsignalcompensationisnecessaryat
Gen3speed.Makingthesechangesismorecomplexthanmightbeexpected.

361
PCIe 3.0.book Page 362 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Physical Layer Overview


ThisPhysical Layer Overviewintroduces therelationships between theGen1,
Gen2 and Gen3 implementations. Thereafter the focus is the logical Physical
Layer implementation associated with Gen1 and Gen2. The logical Physical
LayerimplementationforGen3isdescribedinthenextchapter.

ThePhysicalLayerresidesatthebottomoftheinterfacebetweentheexternal
physicallinkandDataLinkLayer.ItconvertsoutboundpacketsfromtheData
LinkLayerintoaserializedbitstreamthatisclockedontoallLanesoftheLink.
ThislayeralsorecoversthebitstreamfromallLanesoftheLinkatthereceiver.
ThereceivelogicdeserializesthebitsbackintoaSymbolstream,reassembles
thepackets,andforwardsTLPsandDLLPsuptotheDataLinkLayer.

Figure111:PCIePortLayers

Software layer sends and receives address and transaction information


Software layer
Transmit Receive

Transaction Layer Packet (TLP) Transaction Layer Packet (TLP)


Header Data Payload ECRC Header Data Payload ECRC

Transaction layer
Flow Control
Transmit Receive
Virtual Channel
Buffers Buffers
Management
per VC per VC
Ordering

Link Packet DLLPs e.g. DLLPs Link Packet


Sequence TLP LCRC Ack/Nak CRC Ack/Nak CRC Sequence TLP LCRC

Data Link layer De-mux


TLP Retry
Buffer
TLP Error
Mux Check

Physical Packet Physical Packet


Start Link Packet End Start Link Packet End

Physical layer Encode Decode

Parallel-to-Serial Serial-to-Parallel
Link
Differential Driver Training Differential Receiver

Port
Link

362
PCIe 3.0.book Page 363 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Thecontentsofthelayersareconceptualanddontdefinepreciselogicblocks,
buttotheextentthatdesignersdopartitionthemtomatchthespectheirimple
mentationscanbenefitbecauseoftheconstantlyincreasingdataratesaffectthe
PhysicalLayermorethantheothers.Partitioningadesignbylayeredresponsi
bilitiesallowsthePhysicalLayertobeadaptedtothehigherclockrateswhile
changingaslittleaspossibleintheotherlayers.

The3.0revisionofthePCIespecdoesnotusespecifictermstodistinguishthe
different transmission rates defined by the versions of the spec. With that in
mind,thefollowingtermsaredefinedandusedinthisbook.

Gen1thefirstgenerationofPCIe(rev1.x)operatingat2.5GT/s
Gen2thesecondgeneration(rev2.x)operatingat5.0GT/s
Gen3thethirdgeneration(rev3.x)operatingat8.0GT/s

ThePhysicalLayerismadeupoftwosubblocks:theLogicalpartandtheElec
trical part as shown in Figure 112. Both contain independent transmit and
receivelogic,allowingdualsimplexcommunication.

Figure112:LogicalandElectricalSubBlocksofthePhysicalLayer

Physical Layer Physical Layer

Tx Rx Tx Rx

Logical Logical

Tx Rx Tx Rx
Electrical Electrical

Link CTX
Tx+ Tx- Rx+ Rx- Tx- Tx+ Rx- Rx+

CTX

363
PCIe 3.0.book Page 364 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Observation
The spec describes the functionality of the Physical Layer but is purposefully
vagueregardingimplementationdetails.Evidently,thespecwriterswerereluc
tanttogivedetailsorexampleimplementationsbecausetheywantedtoleave
roomforindividualvendorstoaddvaluewithcleverorcreativeversionsofthe
logic.Forourdiscussionthough,anexampleisindispensable,andonewascho
senthatillustratestheconcepts.Itsimportanttomakeclearthatthisexample
hasnotbeentestedorvalidated,norshouldadesignerfeelcompelledtoimple
mentaPhysicalLayerinsuchamanner.

Transmit Logic Overview


Forsimplicity,letsbeginwithahighleveloverviewofthetransmitsideofthis
layer, shown in Figure 113 on page 365. Starting at the top, we can see that
packetbytesenteringfromtheDataLinklayerfirstgointoabuffer.Itmakes
sense to have a buffer here because there will be times when the packet flow
from the Data Link Layer must be delayed to allow Ordered Set packets and
otheritemstobeinjectedintotheflowofbytes.

ForGen1andGen2operation,theseinjecteditemsarecontrolanddatacharac
ters used to mark packet boundaries and create ordered sets. To differentiate
betweenthesetwotypesofcharacters,aD/K#bit(DataorKontrol)isadded.
ThelogiccanseewhatvalueD/K# shouldtakeonbasedonthesourceofthe
character.
Gen3 mode of operation, doesnt use control characters, so data patterns are
usedtomakeuptheorderedsetsthatidentifyiftransmittedbytesareassoci
atedwithTLPs/DLLPsorOrderedSets.A2bitSyncHeaderisinsertedatthe
beginning of a 128 bit (16 byte) block of data. The Sync Header informs the
receiverwhetherthereceivedblockisaDataBlock(TLPorDLLPrelatedbytes)
oranOrderedSetBlock.SincetherearenocontrolcharactersinGen3mode,the
D/K#bitisnotneeded.

364
PCIe 3.0.book Page 365 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)


Figure113:PhysicalLayerTransmitDetails

From Data Link Layer


Packet Boundary Indicator

Throttle N*8
Tx
Control/
Buffer Ordered
Token Logical
Sets
Characters Idle

N*8 8 8 8
Mux

N*8 D/K#

Lane 0 Byte Striping Lane N


8 D/K# 8 D/K#

Gen3 Scrambler Lane 1, ... ,N-1 Gen3 Scrambler


Scrambler Scrambler
8 8
D/K# Tx Local D/K#
PLL
8b/10b 8b/10b
Encoder Encoder
8 10 Tx Clk 8 10

Mux Mux

Gen3 Sync
Serializer Bits Generator Serializer

Mux Mux

Tx Tx

Lane 0 Lane 1, ... ,N-1 Lane N

Next, the parallel data bytes coming from the upper layers are sent to Byte
Striping logic where they are spread out, or striped, onto all the lanes of this
link.Onebyteofthepacketistransferredperlane,andallactivelanesareused
foreachpacketgoingout.TheLanesoftheLinkarealltransmittingatthesame
time,sothebytesmustcomeintothislogicfastenoughtoaccommodatethat.
Forexample,ifthereareeightLanes,eightbytesofparallelfromtheupperlay
ers may arrive at the bytestriping logic allowing data to be clocked onto all
lanessimultaneously.

365
PCIe 3.0.book Page 366 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

NextistheScrambler,whichXORsapseudorandompatternontotheoutgoing
databytestomixupthebits.Althoughitwouldseemthatthismightintroduce
problems,itdoesntbecausethescramblingpatternispredictableandnottruly
random,sothereceivercanusethesamealgorithmtoeasilyrecovertheorigi
nal data. If the scramblers get out of step then the Receiver wont be able to
makesenseofthebitstreamso,toguardagainstthatproblem,thescrambleris
reset periodically (Gen1 and Gen2). That way, if the scramblers do get out of
stepwitheachotheritwontbelongbeforetheyrebothreinitializedandback
in stepagain. For Gen1andGen2 modesthat reinitialization happens when
ever theCOMcharacter isdetected. For Gen3mode, ithappenswhenever an
EIEOSorderedsetisseen.Amoresophisticated24bitbasedscramblerisuti
lized in Gen3 mode, hence the alternate path through the Gen3 scrambler, as
depictedinFigure113onpage365.
ForGen1andGen2mode,thescrambled8bitcharactersarethenencodedfor
transmission by the 8b/10b Encoder. Recall that a Character is an 8bit un
encodedbyte,whileaSymbolisthe10bitencodedoutputofthe8b/10blogic.
Thereareseveraladvantagesto8b/10bencoding,butitdoesaddoverhead.
For Gen3 a separate path is shown bypassing the encoder. In other words,
scrambledbytesofapacketaretransmittedwithout8b/10bencoding.TheSync
BitGeneratoraddsa2bitSyncHeaderpriortoevery16byteblockofapacket.
Theadded2bitSyncHeaderidentifiesthefollowing16byteblocktobeeithera
datablockoranorderedsetblock.Thisadditionofa2bitSyncHeaderevery16
bytes(128bits)isthebasisofGen3s128b/130bencodingscheme.
Finally,theSymbolsareserializedintoabitstreamandforwardedtotheelectri
calsubblockofthePhysicalLayerandtransmittedtotheotherendofthelink.

Receive Logic Overview


Figure114onpage367showsthekeyelementsthatmakeupthereceiverlogic.
Theprocessdescribedbelowisperformedforeachlane.Startingatthebottom
this time, the first thing to mention is the receiver Clock and Data Recovery
(CDR).Thefirststepinthisprocessistorecovertheclockbasedontransitions
intheincomingbitstream.ThisrecoveredclockfaithfullyreproducestheTrans
mittersclockthatwasusedtosendthedataandisusedtolatchtheincoming
bitsintoadeserializingbuffer.
The next steps in the CDR processare to find the Gen1/Gen2 Symbol bound
ariesanddividetherecoveredclockby10tolatchthe10bitSymbolsintothe
ElasticBuffer.ForGen3,thenextstepistoacquireBlockLockandthenlatchthe
8bitSymbolsassociatedwitheachofthe16bytesintheblockintotheElastic
Buffermoreonthisinthenextchapter.

366
PCIe 3.0.book Page 367 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

LogiccontrollingtheElasticBufferadjustsforminorclockvariationsbetween
therecoveredclockandthelocalclockofthereceiverbyaddingorremoving
SKPSymbolsasneededwhenanSOS(SKPOrderedSet)isdetected.Finally,the
ReceiverslocalclockmoveseachSymboloutoftheElasticBuffer.

Figure114:PhysicalLayerReceiveLogicDetails

To Data Link Layer


eceiTLP/DLLP
Indicator

N*8

Rx
Buffer

TLP/DLLP
N*8 Indicator

Packet
Filtering
Block
N*8 D/K# Type

Byte Un-Striping
Lane 0 Lane N
8 8
Mux Mux
8 8 8 8
D/K# D/K#
Gen3 De-Scrambler Gen3 De-Scrambler
De-Scrambler De-Scrambler
8 8 D/K# 8 8 D/K#

8b/10b 8b/10b
Decoder Decoder
Gen3 Gen3
10 Block 10 Block
Type Type

CDR Logic CDR Logic

Rx Rx

Lane 0 Lane 1, ..,N-1 Lane N

Usingthe8b/10bDecoder,Gen1/Gen2Symbolsaredecodedthusconvertingthe
10bitsymbolsto8bitcharacters.Thedescramblerappliesthesamescrambling
method used at the transmitter to recover the original data. Finally, the bytes
fromeachLaneareunstripedtoformabytestreamthatwillbeforwardedup
totheDataLinkLayer.OnlyTLPsandDLLPsareloadedintothereceivebuffer
andsenttotheDataLinkLayer.

367
PCIe 3.0.book Page 368 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Transmit Logic Details (Gen1 and Gen2 Only)


Thesectionprovidesmoredetailassociatedwiththestepsidentifiedinthepre
vioussection.RefertotheblockdiagraminFigure115onpage369duringthis
discussion.

Tx Buffer
Starting from the top of the diagram once again, the buffer accepts TLPs and
DLLPsfromtheDataLinkLayer,alongwithControlinformationthatspecifies
whenanewpacketbegins.Asmentioned,thebufferallowsustostalltheflow
ofcharactersfromtimetotimeinordertoinsertcontrolcharactersandordered
sets.AthrottlesignalisalsoshowngoingbackuptotheDataLinkLayerto
stoptheflowofcharactersifthebuffershouldbecomefull.

Mux and Control Logic


Themultiplexer,showninFigure116onpage370,isusedtoinsertspecialcon
trol(K)charactersintothedataflowcomingfromthebuffer.OnlythePhysical
Layer uses K control characters; they are inserted during transmission and
removedatthereceiver.Thefourdifferentinputstothemuxare:

Transmit Data Buffer. When the Data Link Layer supplies a packet, the
muxgatesthecharacterstreamthrough.Allofthecharacterscomingfrom
the buffer are D characters, so the D/K# signal is driven high when Tx
Buffercontentsaregated.
Start and End characters. These Control characters are added to the start
andendofeveryTLPandDLLP(seeFigure117onpage371)andallowa
receiver to readily detect the boundaries of a packet. There are two Start
characters:STPindicatesthestartofaTLP,whileSDPindicatesthestartofa
DLLP.AnindicatorfromtheDataLinkLayer,alongwiththepackettype,
determineswhattypeofframingcharactertoinsert.Therearealsotwoend
characters,theEndGoodcharacter(END)fornormaltransmission,andthe
EndBadcharacter(EDB)tohandlesomeerrorcases.StartandEndcharac
tersareKcharacters,sotheD/K#signalisdrivenlowwhentheStartand
Endcharactersareinserted(seeTable 111onpage 386foralistofControl
characters).

368
PCIe 3.0.book Page 369 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Figure115:PhysicalLayerTransmitLogicDetails(Gen1andGen2Only)

From Data Link Layer


Packet Boundary Indicator

Throttle N*8
Tx
Control/
Buffer Ordered
Token Logical
Sets
Characters Idle

N*8 8 8 8
Mux

N*8 D/K#

Lane 0 Byte Striping Lane N


8 D/K# 8 D/K#

Scrambler Lane 1, ... ,N-1 Scrambler


8 8
D/K# Tx Local D/K#
PLL
8b/10b 8b/10b
Encoder Encoder
10 Tx Clk 10

Serializer Serializer

Tx Tx

Lane 0 Lane 1, ... ,N-1 Lane N

OrderedSets.Asmentionedearlier,controlcharactersareonlyusedbythe
PhysicalLayerandarenotseenbythehigherlayers.Somecommunication
across the Link is necessary to initiate and maintain Link operation, and
thatisaccomplishedbyexchangingOrderedSets.Everyorderedsetstarts
withaKcharactercalledacomma(COM),andcontainsotherKorDchar
acters depending on the type of Order Set be delivered. Ordered Sets are
alwaysalignedonfourbyteboundariesandaretransmittedduringavari
etyofcircumstancesincluding:
Errorrecovery,initiatingevents(suchasHotReset),orexitfromlow
power states. In these cases, the Training Sequence 1 and 2 (TS1 and
TS2)orderedsetsareexchangedacrosstheLink.
At periodic intervals, the mux inserts the SKIP ordered set pattern to
facilitate clock tolerance compensation in the receiver. For a detailed
descriptionofthisprocess,refertoClockCompensationonpage 391.

369
PCIe 3.0.book Page 370 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

WhenadevicewantstoplaceitstransmitterintheElectricalIdlestate,
itmustinformtheremotereceiverattheotherendoftheLink.Themux
insertsanElectricalIdleorderedsettoaccomplishthis.
When a device wants to change the Link power state from L0s low
powerstatetotheL0fullonpowerstate,itsendsasetofFastTraining
Sequence (FTS) ordered sets to the receiver. The receiver uses this
orderedsettoresynchronizeitsPLLtothetransmitterclock.
Logical Idle Sequence. When there are no packets ready to transmit
andnoorderedsetstosend,thelinkislogicallyidle.Inordertokeep
thereceiverPLLlockedontothetransmittersfrequency,itsimportant
thatthetransmitterkeepsendingsomething,soLogicalIdlecharacters
are inserted for that case. Logical Idle is very simple, and consists of
nothingmorethanastringofData00hcharacters.

Figure116:TransmitLogicMultiplexer

From Data Link Layer


Packet Boundary Indicator
Throttle
N*8
Tx
Control/
Buffer Ordered
Token Logical
Sets
Characters Idle
N*8
8 8 8
N*8
N*8 Ordered Sets:
Mux Tx TS1, TS2,
Buffer
N*8 D/K# STP, SDP SKIP Logical
END, EDB Electrical Idle Idle
Lane 0 Byte Striping Lane N
N*8
8 D/K# 8 D/K#
D K K/D D
Scrambler Lane 1, ... ,N-1 Scrambler Mux
8 8
D/K# Tx Local D/K#
N*8 D/K#
PLL
8b/10b 8b/10b
Encoder Encoder
10 Tx Clk 10

Serializer Serializer

Tx Tx

Lane 0 Lane N
Lane 1, ... ,N-1

370
PCIe 3.0.book Page 371 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Figure117:TLPandDLLPPacketFramingwithStartandEndControlCharacters

D Character

Transaction Layer Packet (TLP)


STP Sequence Header Data Payload ECRC LCRC END

D Character
K Character K Character

Data Link Layer Packet (DLLP)


SDP DLLP Type Misc. CRC END

K Character K Character

Byte Striping (for Wide Links)


The next step shown in our example is Byte Striping, although this is only
needediftheportimplementsmorethanoneLane(calledawideLink).Strip
ing means that each consecutive outbound character in a character stream is
routedontoconsecutiveLanes.ThenumberofLanesusedisconfiguredduring
theLinktrainingprocessbasedonwhatissupportedbybothdevicesthatshare
theLink.
Three examples of byte striping are illustrated in the following diagrams. In
Figure118onpage372,asinglelanelink(x1)isshown.Thisisnotaveryinter
estingcase,sincethepacketentersthePhysicalLayerabyteatatimeandgoes
out the same way, but illustrates the way the sequence of characters will be
drawn.
Figure 119 on page 372 shows the incoming Dword packets from the muti
plexer.Eachbyteisdirectedtothecorrespondinglanes.Finally,Figure1110on
page 373 illustrates an eightlane (x8) link. In this example, two Dwords are
requiredtopopulateall8lanes.ThisrequirestheDwordtoarriveattwicethe
rateasthepreviousexample.Theformatofthedatabeingsentacrosseachlane
isdescribedinthesectionsthatfollow.

371
PCIe 3.0.book Page 372 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure118:x1ByteStriping

Packet byte stream from Mux block

8
D/K#
Character 7
Character 6
Character 5
Character 4
Character 3
Character 2
Character 1
Character 0
x1 Byte Striping 8
D/K#
Character 2
Character 1
Character 0
8 D/K#

To Scrambler

Figure119:x4ByteStriping

Packet Dword Stream from M ux Block


D/K# D/K# D/K# D/K#
8 8 8 8

Character 12 Character 13 Character 14 Character 15


Character 8 Character 9 Character 10 Character 11
Character 4 Character 5 Character 6 Character 7
Character 0 Character 1 Character 2 Character 3

Character 12 Character 13 Character 14 Character 15


Character 16 Character 17 Character 11 Character 11
Character 8 Character 9 Character 7 Character 7
Character 0 Character 1 Character 3 Character 3
8 8 8 8
D/K# D/K# D/K# D/K#

To Lane 0 To Lane 1 To Lane 2 To Lane 3


Scram bler Scrambler Scrambler Scram bler

372
PCIe 3.0.book Page 373 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Figure1110:x8ByteStripingwithDWordParallelData

D/K# D/K# D/K# D/K#


8 8 8 8

Character 20 Character 21 Character 22 Character 23


Character 16 Character 17 Character 18 Character 19
Character 12 Character 13 Character 14 Character 15
Character 8 Character 9 Character 10 Character 11
Character 4 Character 5 Character 6 Character 7
Character 0 Character 1 Character 2 Character 3

x8 Byte Striping
Character 16 Character 17 Character 23
Character 8 Character 9 Character 15
Character 0 Character 1 Character 7
8 8 8
D/K# D/K#

To Lane 0 To Lane 1 To Lane 7


Scrambler Scrambler Scrambler

Packet Format Rules


General Rules
Thetotalpacketlength(includingStartandEndcharacters)ofeachpacket
isalwaysamultipleoffourcharacters.Thisisanaturalextensionofthefact
thatthedatalengthismeasuredindwords.
TLPs start with the STP character and finish with either an END or EDB
character.
DLLPsstartwithSDP,terminatewiththeENDcharacter.andareexactly8
characterslong(SDP+6characters+END)
STPandSDPcharactersareplacedonLane0whenstartingthetransmis
sionofapacketafterthetransmissionofLogicalIdles.Inothercases,they
maystartonaLanenumberdivisibleby4.
ThereceiversPhysicalLayerisallowedtowatchforviolationoftheserules
andmayreportthemasReceiverErrorstotheDataLinkLayer.

373
PCIe 3.0.book Page 374 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Example: x1 Format
TheexampleshowninFigure1111onpage374illustratestheformatofpackets
transmittedoverax1link(alinkwithonlyonelaneoperational).Asequenceof
packets is shown interspersed with one SKIP Ordered Set. Logical Idles are
shownattheendtorepresentthecasewhenthetransmitterhasnomorepack
etstosendandusesidlecharactersasfiller.

Figure1111:x1PacketFormat

Lane
0
STP COM STP STP
SKP
TLP SKP TLP
SKP
STP
Time

TLP
END END
SDP SDP

DLLP TLP DLLP


END
Idle (00h)
Idle (00h)
Idle (00h)
END END END

x4 Format Rules
STPandSDPcharactersarealwayssentonLane0.
ENDandEDBcharactersarealwayssentonLane3.
WhenanorderedsetsuchastheSKIPissent,itmustappearonalllanes
simultaneously.
WhenLogicalIdlesaretransmitted,theymustbesentonalllanessimulta
neously.
AnyviolationoftheserulesmaybereportedasaReceiverErrortotheData
LinkLayer.

374
PCIe 3.0.book Page 375 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Example x4 Format
TheexampleshowninFigure1112onpage375illustratestheformatofpackets
sent over a x4 Link (link with four data lanes operational). The illustration
shows one TLP followed by a SKIP ordered set transmitted on all Lanes for
receiver clock compensation. Next is a DLLP, followed by Logical Idle on all
lanes.Thisexamplehighlightsthatthepacketsarealwaysmultiplesof4charac
tersbecausethestartcharacteralwaysappearsinlane0andtheendcharacteris
alwaysinlane3.Italsoillustratesthatorderedsetsmustappearonallthelanes
simultaneously.

Figure1112:x4PacketFormat

/DQH /DQH /DQH /DQH


   
673 6HTXHQFH6HTXHQFH

7/3

/&5&
/&5& /&5& /&5& (1'
7LPH

&20 &20 &20 &20


6.3 6.3 6.3 6.3
6.3 2UGHUHG 6HW
6.3 6.3 6.3 6.3
6.3 6.3 6.3 6.3
6'3
'//3
(1'
,GOH K ,GOH K ,GOH K ,GOH K
,GOH K ,GOH K ,GOH K ,GOH K
,GOH K ,GOH K ,GOH K ,GOH K /RJLFDO ,GOH
,GOH K ,GOH K ,GOH K ,GOH K
,GOH K ,GOH K ,GOH K ,GOH K

375
PCIe 3.0.book Page 376 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Large Link-Width Packet Format Rules


Thefollowingrulesapplywhenapacketistransmittedoverax8,x12,x16,or
x32Link:

STP/SDP characters are always sent on Lane 0 when transmission starts


afteraperiodduringwhichLogicalIdlesaretransmitted.Afterthat,they
may only be sent on Lane numbers divisible by 4 when sending backto
backpackets(Lane4,8,12,etc.).
END/EDB characters are sent on Lane numbers divisible by 4 and then
minusone(Lane3,7,11,etc.).
IfapacketdoesntendonthelastLaneoftheLinkandtherearenomore
packetsreadytogo,PADSymbolsareusedasfillerontheremaininglane
numbers.LogicalIdlecantbeusedforthispurposebecauseitmustappear
onallLanesatthesametime.
Orderedsetsmustbesentonalllanessimultaneously.
Similarly,logicalidlesmustbesentonalllaneswhentheyareused.
AnyviolationoftheserulesmaybereportedasaReceiverErrortotheData
LinkLayer.

x8 Packet Format Example


TheexampleshowninFigure1113onpage377illustratestheformatofpackets
transmitted over a x8 link. The illustration shows a TLP followed by a SKIP
orderedset,aDLLP,andfinallyaTLPthatendsonLane3.Atthatpoint,the
transmitter has no more packets ready to send but the current packet doesnt
extendtoincludealltheavailablelanes.Onemightexpecttheextralanestobe
filledwithLogicalIdle,butitwontworkherebecauseidlesmustappearonall
lanesatthesametime.Soanotherfillcharacterisneeded,andthespecwriters
chosetousethePADcontrolcharacterhere.TheonlyotherplacethatPADis
usedisduringthetrainingprocess.Finally,sincetherearestillnomorepackets
tosend,LogicalIdlesaresentonallthelanes.

376
PCIe 3.0.book Page 377 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Figure1113:x8PacketFormat

/DQH /DQH /DQH /DQH /DQH /DQH /DQH /DQH


       
,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K

673 6HTXHQFH6HTXHQFH

7/3
/&5& /&5& /&5& /&5& (1'
&20 &20 &20 &20 &20 &20 &20 &20
6.3 6.3 6.3 6.3 6.3 6.3 6.3 6.3
7LPH

6.3 6.3 6.3 6.3 6.3 6.3 6.3 6.3


6.3 6.3 6.3 6.3 6.3 6.3 6.3 6.3
6'3 '//3 (1'
673 6HTXHQFH6HTXHQFH

7/3 /&5&
/&5& /&5& /&5& (1' 3$' 3$' 3$' 3$'
,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K

,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K

Scrambler
The next step in our example is scrambling, as shown in Figure 115 on page
369,whichisintendedtopreventrepetitivepatternsinthedatastream.Repeti
tive patterns create pure tones on the link, meaning a consistent frequency
causedbythepatternthatgeneratesmorethantheusualnoise,orEMI.Reduc
ingthisproblembyspreadingthisenergyoverawiderfrequencyrangeisthe
primary goal of scrambling. In addition, though, scrambled transmission on
one Lane also reduces interference with adjacent Lanes on a wide Link. This
spatial frequency decorrelation, or reduction of crosstalk noise, helps the
receiveroneachlanetodistinguishthedesiredsignal.

377
PCIe 3.0.book Page 378 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

To help the receiver maintain synchronization with the scrambled sequence,


control characters do not get scrambled and are thus recognizable even if the
scramblersgetoutofsync.Inaddition,thearrivaloftheCOMcontrolcharacter
(K28.5)reinitializesthescramblersonbothendsoftheLinkeachtimeitarrives
andthusresynchronizesthem.

Scrambler Algorithm
ThescramblerdescribedinthespecisshowninFigure1114onpage378.Its
made of a 16bit Linear Feedback Shift Register (LFSR) with feedback points
thatimplementthefollowingpolynomial:

G(x)=X16+X5+X4+X3+1

Figure1114:Scrambler

; ; ; ;25 ; ;25 ; ;25 ; ; ; ; ;

N N N N N N N N 2SHUDWHV DW %LW 5DWH
 RU  *+]

N N N N N N N N


2SHUDWHV DW %\WH 5DWH
 RU  0+]
%\WH &ORFN

+ * ) ( ' & % $
>+*)('&%$@
;25 ;25 ;25 ;25 ;25 ;25 ;25 ;25
>+*)('&%$@ ;25 >6FU NN @

+ * ) ( ' & % $
6FUDPEOHU 2XWSXW 6FU>NN@

TheLFSRisclockedat8timesthefrequencyoftheclockfeedingthedatabytes,
anditsoutputisclockedintoan8bitregisterthatisXORedwiththe8bitdata
characterstoformthescrambleddataoutput.

378
PCIe 3.0.book Page 379 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Some Scrambler implementation rules:


On a multiLane Link implementation, Scramblers associated with each
Lanemustoperateinconcert,maintainingthesamesimultaneousvaluein
eachLFSR.
ScramblingisappliedtoDcharactersonly,meaningthoseassociatedwith
TLPandDLLPsandtheLogicalIdle(00h)characters.However,thoseD
charactersthatarewithintheTS1andTS2orderedsetsarenotscrambled.
ScramblingisneverappliedtoKcharactersandcharacterswithinordered
sets,suchasTS1,TS2,SKIP,FTSandElectricalIdle.Thesecharactersbypass
thescramblerlogic.Onereasonforthisistoensuretheyllstillberecogniz
ablebythereceiverevenifthescramblerssomehowgetoutofsequence.
CompliancePatterncharacters(usedfortesting)arealsonotscrambled.
TheCOMcharacter,acontrolcharacterthatdoesnotgetscrambled,isused
toreinitializetheLFSRtoFFFFhatboththetransmitterandreceiver.
Except for the COM character, the LFSR normally will serially advance
eighttimesforeveryDorKcharactersent,butitdoesnotadvanceonSKP
characters associated with the SKIP ordered set. The reason is that a
receiver may add or delete SKP Symbols to perform clock tolerance com
pensation.Changingthenumberofcharactersinthereceivercomparedto
the number that were sent would cause the value in the receiver LFSR to
lose synchronization with the transmitter LFSR value if they were not
ignored.

Disabling Scrambling
Scramblingisenabledbydefault,butthespecallowsittobedisabledfortest
anddebugpurposes.Thatsbecausetestingmayrequirecontroloftheexactbit
patternsentand,sincethehardwarehandlesscrambling,theresnoreasonable
wayforthesoftwaretobeabletoforceaspecificpattern.Nospecificsoftware
mechanismisdefinedbywhichtoinstructthePhysicalLayertodisablescram
bling,sothishastobeadesignspecificimplementation.

Ifscramblingisdisabledbyadevice,thisgetscommunicatedtotheneighbor
ingdevicebysendingatleasttwoTS1sandTS2sthathavetheappropriatebit
set in the control field as described in Configuration State on page 539. In
response,theneighboringdevicealsodisablesitsscrambling.

379
PCIe 3.0.book Page 380 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

8b/10b Encoding
General
ThefirsttwogenerationsofPCIeuse8b/10bencoding.EachLaneimplements
an8b/10bEncoderthattranslatesthe8bitcharactersinto10bitSymbols.This
codingschemewaspatentedbyIBMin1984andiswidelyusedinmanyserial
transportstoday,suchasGigabitEthernetandFibreChannel.

Motivation
Encodingaccomplishesseveraldesirablegoalsforserialtransmission.Threeof
themostimportantarelistedhere:

EmbeddingaClockintotheData.Encodingensuresthatthedatastream
hasenoughedgesinittorecoveraclockattheReceiver,withtheresultthat
adistributedclockisnotneeded.Thisavoidssomelimitationsofaparallel
busdesign,suchasflighttimeandclockskew.Italsoeliminatestheneedto
distribute a highfrequency clock that would cause other problems like
increasedEMIanddifficultrouting.
Asanexampleofthisprocess,Figure1115onpage381showstheencoding
resultsofthedatabyte00h.Ascanbeseen,this8bitcharacterthathadno
transitionsconvertstoa10bitSymbolwith5transitions.The8b/10bguar
antees enough edges to ensure the run length (sequence of consecutive
onesorzeros)inthebitstreamtonomorethan5consecutivebitsunderany
conditions.
MaintainingDCBalance.PCIeusesanACcoupledlink,placingacapaci
torseriallyinthepathtoisolatetheDCpartofthesignalfromtheotherend
oftheLink.ThisallowstheTransmitterandReceivertousedifferentcom
monmodevoltagesandmakestheelectricaldesigneasierforcaseswhere
thepathbetweenthemislongenoughthattheyrelesslikelytohaveexactly
thesamereferencevoltages.ThatDCvalue,orcommonmodevoltage,can
change during run time because the line charges up when the signal is
driven.Normally,thesignalchangessoquicklythatthereisnttimeforthis
tocauseaproblembut,ifthesignalaverageispredominantlyonelevelor
theother,thecommonmodevaluewillappeartodrift.ReferredtoasDC
Wander,thisdriftingvoltagedegradessignalintegrityattheReceiver.To
compensate, the 8b/10b encoder tracks the disparity of the last Symbol
thatwassent.Disparity,orinequality,simplyindicateswhethertheprevi
ous Symbol had more ones than zeros (called positive disparity), more
zerosthanones(negativedisparity),orabalanceofonesandzeros(neutral

380
PCIe 3.0.book Page 381 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

disparity).IfthepreviousSymbolhadnegativedisparity,forexample,the
nextoneshouldbalancethatbyusingmoreones.
EnhancingErrorDetection.Theencodingschemealsofacilitatesthedetec
tionoftransmissionerrors.Fora10bitvalue,1024codesarepossible,but
thecharactertobeencodedonlyhas256uniquecodes.TomaintainDCbal
ancethedesignusestwocodesforeachcharacter,andchooseswhichone
basedonthedisparityofthelastSymbolthatwassent,so512codeswould
be needed. However, many of the neutral disparity encodings have the
same values (D28.5 is one example), so not all 512 are used. As a result,
morethanhalfthepossibleencodingsarenotusedandwillbeconsidered
illegalifseenataReceiver.Ifatransmissionerrordoeschangethebitpat
ternofaSymbol,theresagoodchancetheresultwouldbeoneoftheseille
gal patterns that can be recognized right away. For more on this see the
sectiontitled,Disparityonpage 383.
The major disadvantage of 8b/10b encoding is the overhead it requires. The
actualtransmissionperformanceisdegradedby20%fromtheReceiverspoint
ofviewbecause10bitsaresentforeachbyte,butonly8usefulbitsarerecov
ered at the receiver. This is a nontrivial price to pay but is still considered
acceptabletogaintheadvantagesmentioned.

Figure1115:Exampleof8bitCharacter00hEncoding

8b Value
Data 00h 00000000

10b Encoded
0 11 0 0 0 1 0 1 1
Value

Properties of 10-bit Symbols


Asdescribedintheliteratureon8b/10bcoding,thedesignisntstrictly8bitsto
10bits.Instead,itsreallya5to6bitencodingfollowedbya3to4bitencoding.
The subblocks are internal to the design but their existence helps to explain
someofthepropertiesforalegalSymbol,aslistedbelow.ASymbolthatdoesnt
followthesepropertiesisconsideredinvalid.

381
PCIe 3.0.book Page 382 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Thebitstreamnevercontainsmorethanfivecontinuous1sor0s,evenfrom
theendofoneSymboltobeginningofthenext.
Each10bitSymbolcontains:
Four0sandsix1s(notnecessarilycontiguous),or
Six0sandfour1s(notnecessarilycontiguous),or
Five0sandfive1s(notnecessarilycontiguous).
Each 10bit Symbol is subdivided into two subblocks: the first is six bits
wideandthesecondisfourbitswide.
The6bitsubblockcontainsnomorethanfour1sorfour0s.
The4bitsubblockcontainsnomorethanthree1sorthree0s.

Character Notation
The 8b/10b uses a special notation shorthand, and Figure 1116 on page 382
illustratesthestepstoarriveattheshorthandforagivencharacter:

1. Partitionthecharacterintoits3bitand5bitsubblocks.
2. Transposethepositionofthesubblocks.
3. Createthedecimalequivalentforeachsubblock.
4. ThecharactertakestheformDxx.yforDatacharacters,orKxx.yforControl
characters. In this notation, xx is the decimal equivalent of the 5bit field,
andyisthedecimalequivalentofthe3bitfield.

Figure1116:8b/10bNomenclature

8b Designation Example Data (6Ah)

D/
8b Character K# 7 6 5 4 3 2 1 0 D 01101010

Partition into D/ H G F E D C B A
sub-blocks K# D 011 01010

D/ D 01010 011
Flip sub-blocks K# E D C B A H G F

Convert sub-blocks
D/K xx . y D 10 . 3
to decimal notation

Final Notation D/Kxx.y D10.3

382
PCIe 3.0.book Page 383 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Disparity
Definition.Disparityreferstotheinequalitybetweenthenumberofones
andzeroswithina10bitSymbolandisusedtohelpmaintainDCbalance
onthelink.ASymbolwithmorezerosissaidtohaveanegative()dispar
ity, while a Symbol with more ones has a positive (+) disparity. When a
Symbolhasanequalnumberofonesandzeros,itssaidtohaveaneutral
disparity.Interestingly,mostcharactersencodeintoSymbolswith+ordis
parity,butsomeonlyencodeintoSymbolswithneutraldisparity.

CRD(CurrentRunningDisparity).The CRD is the information as to


thecurrentstateofdisparityonthelink.Sinceitsjustasinglebititcanonly
bepositiveornegativeanddoesntalwayschangewhenthenextSymbolis
sentout.Toseehowitworks,rememberthatthenextSymboldecodedcan
have negative, neutral, or positive disparity, then consider the following
example.IftheCRDwaspositive,anoutgoingSymbolwithanegativedis
parity would change it to negative, a neutral disparity would leave it as
positive,andapositivedisparitywouldbeanerrorbecausetheCRDisonly
onebitandcantbemademorepositive.

TheinitialstateoftheCRD(beforeanycharactersaretransmitted)maynot
matchbetweenthesenderandreceiverbutitturnsoutthatitdoesntmat
ter.WhenthereceiverseesthefirstSymbolaftertrainingiscomplete,itwill
checkforadisparityerrorand,ifoneisfound,justchangetheCRD.This
wontbeconsideredanerrorbutsimplyanadjustmentoftheCRDtomatch
the receiver and sender. After that, there are only two legal CRD cases: it
canremainthesameifthenewSymbolhasneutraldisparity,oritcanflipto
theoppositepolarityifthenewSymbolhastheoppositedisparity.Whatis
notlegalisforthedisparityofthenewSymboltobethesameastheCRD.
Suchaneventwouldbeadisparityerrorandshouldneveroccurafterthe
initialadjustmentunlessanerrorhasoccurred.

Encoding Procedure
Therearedifferentwaysthat8b/10bencodingcouldbeaccomplished.Thesim
plest approach is probably to implement a lookup table that contains all the
possible output values. However, this table can require a comparatively large
number of gates. Another approach is to implement the decoder as a logic
block, and this is usually the preferred choice because it typically results in a
smallerandcheapersolution.Thespecificsoftheencodinglogicaredescribed
indetailinthereferencedliterature,sowellfocushereonthebiggerpictureof
howitworksinstead.

383
PCIe 3.0.book Page 384 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Anexample8b/10bblockdiagramisshowninFigure1117onpage384.Anew
outgoingSymboliscreatedbasedonthreethings:theincomingcharacter,the
D/K#indicationforthatcharacter,andtheCRD.AnewCRDvalueiscomputed
basedontheoutgoingSymbolandisfedbackforuseinencodingthenextchar
acter.Afterencoding,theresultingSymbolisfedtoaserializerthatclocksout
theindividualbits.Figure1118onpage385showssomesample8b/10bencod
ingsthatwillbeusefulfortheexamplethatfollows.

Figure1117:8bitto10bit(8b/10b)Encoder

Bytes from Scrambler D/K#

8b Character 7 6 5 4 3 2 1 0

H G F E D C B A

8b/10b Encoding Logic


Current
Running
Disparity
(CRD)
CRD Calculator j h g f i e d c b a

Serial Stream
Serializer j h g f i e d c b a to Transmitter
using Tx Clock

384
PCIe 3.0.book Page 385 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)


Figure1118:Example8b/10bEncodings

(QFRGHV WR WKLV LI &5' LV SRVLWLYH


(QFRGHV WR WKLV LI &5' LV QHJDWLYH
ELW FKDUDFWHU

' RU . +H[ %LQDU\ %LWV %\WH &5' &5' 


&KDUDFWHU %\WH +*) ('&%$ 1DPH DEFGHL IJKM DEFGHL IJKM
'DWD ' $   '    
'DWD ' %   '    
'DWD ' )   '    
&RQWURO . )   .    
&RQWURO . %&   .    

Example Transmission
Figure1119illustratestheencodeandtransmissionofthreecharacters:thefirst
andsecond arethecontrolcharacter K28.5and thethirdcharacteristhedata
characterD10.3.

InthisexampletheinitialCRDisnegativesoK28.5encodesinto0011111010b.
ThisSymbolhaspositivedisparity(moreonesthanzeros),andcausestheCRD
polaritytofliptopositive.ThenextK28.5isencodedinto1100000101bandhas
anegativedisparity.ThatcausestheCRDthistimetofliptonegative.Finally,
D10.3 is encoded into 010101 1100b. Since its disparity is neutral, the CRD
doesntchangeinthiscasebutremainsnegativeforwhateverthenextcharacter
willbe.

385
PCIe 3.0.book Page 386 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1119:Example8b/10bTransmission

Use these two characters in the example below:

D/K# Hex Binary Bits Byte CRD CRD +


Byte HGF EDCBA Name abcdei fghj abcdei fghj
Control (K) BC 101 11100 K28.5 001111 1010 110000 0101
Data (D) 6A 011 01010 D10.3 010101 1100 010101 0011

Example Transmission
CRD Character CRD Character CRD Character CRD
Character to K28.5 (BCh) K28.5 (BCh) D10.3 (6Ah)
be transmitted
Bit stream - Yields + Yields - Yields -
transmitted 001111 1010 110000 0101 010101 1100
CRD is + CRD is - CRD is neutral

Initialized value of CRD is dont care. Receiver can determine from incoming bit stream

Control Characters
The8b/10bencodingprovidesseveralspecialcharactersforLinkmanagement
andTable 111onpage 386showstheirencoding.

Table111:ControlCharacterEncodingandDefinition

Character 8b/10b
Description
Name Name

COM K28.5 Firstcharacterinanyorderedset.AlsousedbyRx


toachieveSymbollockduringtraining.

PAD K23.7 Packetfiller

SKP K28.0 UsedinSKIPorderedsetforClockToleranceCom


pensation

386
PCIe 3.0.book Page 387 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Table111:ControlCharacterEncodingandDefinition(Continued)

Character 8b/10b
Description
Name Name

STP K27.7 StartofaTLP

SDP K28.2 StartofaDLLP

END K29.7 EndofGoodPacket

EDB K30.7 EndofabadornullifiedTLP.

FTS K28.1 UsedtoexitfromL0slowpowerstatetoL0

IDL K28.3 UsedtoplaceLinkintoElectricalIdlestate

EIE K28.7 PartoftheElectricalIdleExitOrderedSetsent


priortobringingtheLinkbacktofullpowerfor
speedshigherthan2.5GT/s

COM(Comma):OneofthemainfunctionsofthisistobethefirstSymbol
in the physical layer communications called ordered sets (see Ordered
sets on page 388). It has an interesting property that makes both of its
Symbol encodings easily recognizable at the receiver: they start with two
bits of one polarity followed by five bits of the opposite polarity (001111
1010or1100000101).Thispropertyisespeciallyhelpfulforinitialtraining,
when the receiver is trying to make sense of the string of bits coming in,
because it helps the receiver lock onto the incoming Symbol stream. See
LinkTrainingandInitializationonpage 405formoreonhowthisworks.
PAD:OnamultiLaneLink,ifapackettobesentdoesntcoveralltheavail
ablelanesandtherearenomorepacketsreadytosend,thePADcharacteris
usedtofillintheremainingLanes.
SKP(Skip):ThisisusedaspartoftheSKIPorderedsetthatissentperiodi
callytofacilitateclocktolerancecompensation.
STP(StartTLP):InsertedtoidentifythestartofaTLP.
SDP(StartDLLP):InsertedtoidentifythestartofaDLLP.
END:AppendedtoidentifytheendofanerrorfreeTLPorDLLP.
EDB (EnD Bad): Inserted to identify the end of a TLP that a forwarding
device (such as a switch) wishes to nullify. This case can arise when a
switch using the cutthrough mode forwards a packet from an ingress
porttoanegressportwithoutbufferingthewholepacketfirst.Anyerror
detectedduringtheforwardingprocesscreatesaproblembecauseaportion
ofthepacketisalreadybeingdeliveredbeforethepacketcanbecheckedfor

387
PCIe 3.0.book Page 388 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

errors.Tohandlethiscase,theswitchmustcanceltheonethatsalreadyin
route to the destination. This is accomplished by nullifying it: ending the
packetwithEDBandinvertingtheLCRCfromwhatitshouldhavebeen.
Whenareceiverseesanullifiedpacket,itdiscardsthepacketanddoesnot
returnanACKorNAK.(SeetheExampleofCutThroughOperationon
page 356.)
FTS(FastTrainingSequence):PartoftheFTSorderedsetsentbyadeviceto
recoveralinkfromtheL0sstandbystatebacktothefullonL0state.
IDL(Idle):PartoftheElectricalIdleorderedsetsenttoinformthereceiver
thattheLinkistransitioningintoalowpowerstate.
EIE(ElectricalIdle Exit): Added in thePCIe2.0spec andusedtohelp an
electricallyidlelinkbeginthewakeupprocess.

Ordered sets
General.OrderedSetsareusedforcommunicationbetweenthePhysical
LayersofLinkpartnersandmaybethoughtofasLanemanagementpack
ets.BydefinitiontheyareaseriesofcharactersthatarenotTLPsorDLLPs.
For Gen1 and Gen2 they always begin with the COM character. Ordered
SetsarereplicatedonallLanesatthesametime,becauseeachLaneistech
nically an independent serial path. This also allows Receivers to verify
alignmentanddeskewing.OrderedSetsareusedforthingslikeLinktrain
ing,clocktolerancecompensation,andchangingLinkpowerstates.

TS1andTS2OrderedSet(TS1OS/TS2OS).Training sequences one


andtwoareusedforLinkinitializationandtraining.TheyallowtheLink
partnerstoachievebitlockandSymbollock,negotiatethelinkspeed,and
reportothervariablesassociatedwithLinkoperation.Theyaredescribedin
moredetailinthesectiontitledTS1andTS2OrderedSetsonpage 510.

ElectricalIdleOrderedSet(EIOS).ATransmitterthatwishestogotoa
lowerpower link state sends this before ceasing transmission. Upon
receipt,Receiversknowtoprepareforthelowpowerstate.TheEIOScon
sists of four Symbols: the COM Symbol followed by three IDL Symbols.
ReceiversdetectthisOrderedSetandpreparefortheLinktogotointoElec
tricalIdlebyignoringinputerrorsuntilexitingfromElectricalIdle.Shortly
after sending EIOS, the Transmitter reduces its differential voltage to less
than20mVpeak.
FTSOrderedSet(FTSOS).A Transmitter sends the proper number of
these(theminimumnumberwasgivenbytheLinkneighborduringtrain
ing) to take a Link from the lowpower L0s state back to the fullyopera
tional L0 state. The receiver detects the FTSs, recognizes that the Link is

388
PCIe 3.0.book Page 389 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

exiting from Electrical Idle, and uses them to recover Bit and Symbol
Lock.TheFTSOrderedSetconsistsoffourSymbols:theCOMSymbolfol
lowedbythreeFTSSymbols.
SKPOrderedSet(SOS).ThisconsistsoffourSymbols:theCOMSymbol
followedbythreeSKPSymbols.Itstransmittedatregularintervalsandis
usedforClockToleranceCompensationasdescribedinClockCompensa
tiononpage 391andReceiverClockCompensationLogiconpage 396.
Basically, the Receiver evaluates the SOS and internally adds or removes
SKPSymbolsasneededtopreventitselasticbufferfromunderflowingor
overflowing.
ElectricalIdleExitOrderedSet(EIEOS).Added in the PCIe 2.0 spec,
this Ordered Set was defined to provide a lowerfrequency sequence
requiredtoexittheelectricalidleLinkstate.TheEIEOSfor8b/10bencod
ing,usesrepeatedK28.7controlcharacterstoappearasarepeatingstringof
5onesfollowedby5zeros.Thislowfrequencystringproducesalowfre
quencysignalthat allows forhighersignalvoltagesthatare more readily
detectedatthereceiver.Infact,thespecstatesthatthispatternguarantees
thattheReceiverwillproperlydetectanexitfromElectricalIdle,something
thatscrambleddatacannotdo.Fordetailsonelectricalidleexit,refertothe
sectionElectricalIdleonpage 736.

Serializer
The8b/10bencoderoneachlanefeedsaserializerthatclockstheSymbolsoutin
bitorder(seeFigure1117onpage384),withtheleastsignificantbit(a)shifted
outfirstandthemostsignificantbit(j)shiftedoutlast.Foreachlane,theSym
bolswillbesuppliedtotheserializerateither250MHzor500MHztosupporta
serialbitrate10timesfasterthanthatat2.5GHzor5.0GHz.

Differential Driver
ThedifferentialdriverthatactuallysendsthebitstreamontothewireusesNRZ
encoding.NRZsimplymeansthattherearenospecialorintermediatevoltage
levelsused.Differentialsignallingimprovessignalintegrityandallowsforboth
higherfrequenciesandlowervoltages.Detailsregardingtheelectricalcharac
teristics of the driver are discussed in the section Transmitter Voltages on
page 462.

389
PCIe 3.0.book Page 390 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Transmit Clock (Tx Clock)


The serialized output on each Lane is clocked out by the Tx Clock signal. As
mentionedearlier,theclockfrequencymustbeaccurateto+/300ppmaround
thecenterfrequency(600ppmtotalvariation).Therearetwooptionsregarding
thesourceofthisclock.Itcanbegeneratedinternallyorderivedfromanexter
nal reference that may optionally be available. The PCIe spec for peripheral
cardsincludesthedefinitionofa100MHzreferenceclocksuppliedbythesys
tem board for this purpose. This reference clock is multiplied internally to
derivethelocalclockthatdrivestheinternallogicandtheTxclockusedbythe
serializer.

Miscellaneous Transmit Topics


Logical Idle
InordertokeepthereceiversPLLfromdrifting,somethingmustbetransmit
tedduringperiodswhentherearenoTLPs,DLLPsororderedsetstotransmit,
anditislogicalidlecharactersthatareinjectedintothecharacterflowduring
thesetimes.Somepropertiesofthelogicalidlecharacter:

Itsan8bitDatacharacterwithavalueof00h.
Whensent,itgoesonallLanesatthesametimeandtheLinkissaidtobein
thelogicalidlestate(nottobeconfusedwithelectricalIdlethestatewhen
theoutputdriverstopstransmittingaltogetherandthereceiverPLLloses
synchronization).
Thelogicalidlecharacterisscrambled,butareceivercandistinguishitfrom
otherdatabecauseitoccursoutsideofapacketframingcontext(i.e.:after
anENDorEDB,butbeforeanSTPorSDP).
Itis8b/10bencoded.
During logical idle transmission, SKIP ordered sets are still sent periodi
cally.

Tx Signal Skew
Understandably, the transmitter should introduce a minimal skew between
lanestoleaveasmuchRxskewbudgetaspossibleforroutingandothervaria
tions.ThespecliststheTxskewvaluesas500ps+2UIforGen1,500ps+4UIfor
Gen2,and500ps+6UIforGen3.RecallingthatUI(unitinterval)representsone
bittimeontheLink,thisworksoutasshowninTable112below.

390
PCIe 3.0.book Page 391 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Table112:AllowableTransmitterSignalSkew

SpecVersion AllowableTxSkew

Gen1 1300ps

Gen2 1300ps

Gen3 1250ps

Clock Compensation
Background.Highspeed serial transports like PCIe have a particular
clockproblemtosolve.Thereceiverrecoversaclockfromtheincomingbit
streamandusesthattolatchinthedatabits,butthisrecoveredclockisnot
synchronizedwiththereceiversinternalclockandatsomepointithasto
begin clocking the data with its own internal clock. Even if they have an
optionalcommonexternalreferenceclock,thebesttheycandoistogener
ateaninternalclockwithinaspecifiedtoleranceofthedesiredfrequency.
Consequently, one of the clocks will almost always have a slightly higher
frequencythantheother.Ifthetransmitterclockisfaster,thepacketswill
bearrivingfasterthantheycanbetakenin.Tocompensate,thetransmitter
mustinjectsomethrowawaycharactersinthebitstreamthatthereceiver
candiscardifitprovesnecessarytoavoidabufferoverruncondition.For
PCIe, these characters which can be deleted take the form of the SKIP
orderedset,whichconsistsofaCOMcharacterfollowedbythreeSKPchar
acters (see Figure 1120). For more detail on this topic, refer to Receiver
ClockCompensationLogiconpage 396).

SKIPorderedsetInsertionRules.A transmitter is required to send


SKIPorderedsetsonaperiodicbasis,andthefollowingrulesapply:

TheSKIPorderedsetmustbescheduledforinsertionbetween1180and
1538 Symbol times (a Symbol time is the time required to send one
Symbolandis10bittimes,soat2.5GT/s,aSymboltimeis4nsandat
5.0GT/s,its2ns).
They are only inserted on packet boundaries (nothing is allowed to
interruptapacket)andmustgosimultaneouslyonallLanes.Ifapacket
isalreadyinprogresstheSKPOrderedSetwillhavetowait.Themaxi
mumpossiblepacketsizewouldrequiremorethan4096Symboltimes,
though, and during that time several SKIP ordered sets should have

391
PCIe 3.0.book Page 392 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

beensent.ThiscaseishandledbyaccumulatingtheSKIPsthatshould
havegoneoutandinjectingthemallatthenextpacketboundary.
SincethisorderedsetmustbetransmittedonallLanessimultaneously,
a multilane link may need to add PAD characters on some Lanes to
allowtheorderedsettogoonallLanessimultaneously(seeFigure11
13onpage377).
During lowpower link states, any counters used to schedule SKIP
orderedsetsmustbereset.Theresnoneedforthemwhenthetransmit
ter isnt signaling, and it wouldnt make sense to wake up the link to
sendthem.
SKIPorderedsetsmustnotbetransmittedwhiletheCompliancePat
ternisinprogress.

Figure1120:SKIPOrderedSet

Encoding
COM K28.5
SKP K28.0
SKP K28.0
SKP K28.0

Receive Logic Details (Gen1 and Gen2 Only)


Figure1121showsthereceiverlogicoftheLogicalPhysicalLayer.Thissection
describespacketprocessingfromthetimethedataisreceivedseriallyoneach
laneuntilthepacketbytestreamisclockedintotheDataLinkLayer.

392
PCIe 3.0.book Page 393 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Figure1121:PhysicalLayerReceiveLogicDetails(Gen1andGen2Only)

To Data Link Layer


Control

Receive
8

Rx
Buffer

8 Control
Start/End/Idle/Pad Character Removal and
Packet Alignment Check
8 D/K#

Lane 0
Byte Un-Striping Lane N

8 D/K# 8 D/K#

De-Scrambler De-Scrambler

8 D/K# 8 D/K#

Error 8b/10b Error 8b/10b


Detect Decoder Detect Decoder
Rx Local
10 PLL 10

Serial-to-Parallel Serial-to-Parallel
and Elastic Buffer and Elastic Buffer
Rx Clk Rx Clk

Rx Rx

Lane 0 Lane 1, ..,N-1 Lane N

Differential Receiver
ThefirstpartsofthereceiverlogicareshowninFigure1122,includingthedif
ferentialinputbufferforeachlane.Thebuffersensespeaktopeakvoltagedif
ferencesanddetermineswhetherthedifferencerepresentsalogicaloneorzero.

393
PCIe 3.0.book Page 394 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Foradetaileddiscussionofreceivercharacteristics,seesectionReceiverChar
acteristicsonpage 492.

Figure1122:ReceiverLogicsFrontEndPerLane

j Symbol 10-bit S ym bols


i
Lock
h

Lane
g

Serial-to-Parallel K28.5 Detection De-skew


E lastic
f

Converter (Comma Symbol) Delay


B uffer
d e

10 10
Circuit
c

Differential
b

Input
Rx Local
a

C lock C ontrol C lock


D+
R x C lock Local
D ifferential
R ecovery Clock
D- R eceiver S erial B it P LL PLL
Stream

Rx Clock Recovery
General
Next the receiver generates an Rx Clock from the data bit transitions in the
inputdatastream,probablyusingaPLL.Thisrecoveredclockhasthesamefre
quency(2.5or5.0GHz)asthatoftheTxClockthatwasusedtoclockthebit
streamontothewire.TheRxClockisusedtoclocktheinboundbitstreaminto
thedeserializer.Thedeserializerhastobealignedtothe10bitSymbolbound
ary(aprocesscalledachievingSymbollock),andthenitsSymbolstreamoutput
is clocked into the elastic buffer with a version of the Rx Clock thats been
dividedby10.Eventhoughtbothmustbeaccuratetowithin+/300ppmofthe
centerfrequency,theRxClockisprobablyalittledifferentfromtheLocalClock
andifso,compensationisneeded.

394
PCIe 3.0.book Page 395 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Achieving Bit Lock


Recallthatthe8b/10bencodingschemeguaranteestheinboundserialSymbol
streamwillcontainfrequenttransitions.ThereceiverPLLusesthosetransitions
tocreateanRxClockthatissynchronizedwiththeTxClockthatwasusedto
clockthebitstreamoutofthetransmitter.WhenthereceiverlocksontotheTx
Clockfrequency,thereceiverissaidtohaveachievedBitLock.
DuringLinktraining,thetransmittersendsalongseriesofTS1andTS2ordered
sets to the receiver, which then uses the bit transitions in them to achieve Bit
Lock.ThereareenoughtransitionsontheLinkduringnormaloperationforthe
receivertomaintainBitLockafterthat.

Losing Bit Lock


IftheLinkisputinalowpowerstate(suchasL0sorL1)inwhichpackettrans
missionceases,thereceiverwilllosesynchronization.Toavoidhavingtheerror
circuit see this as an error, the transmitter sends an electrical Idle ordered set
(EIOS)beforegoingtothelower powerstatetotellthereceiverto degateits
input.

Regaining Bit Lock


When thetransmitter is readytowakethe Linkfromthe L0sstate, itsends a
specificnumberFTSorderedsets(theactualnumberisdesignspecific)andthe
receiverusesthesetoregainbitandSymbollock.Arelativelysmallnumberof
FTSs are needed to recover and so the recovery latency is short. Because the
LinkisintheL0sstateforashorttime,thereceiverPLLdoesnotusuallydrift
toofarfromtheTxClockbeforeitbeginstoreceivetheFTSs.IftheLinkwas
insteadintheL1lowpowerstateandthetransmitterinsteadstartstransmitting
TS1OSs.ThisresultsintheLinkgettingretrainedandwakeuptimeislonger
thanL0swakeuptime.ShouldtheLinkhaveamoreseriouserrorandtheAck/
Nakmechanismbeunsuccessfulinerrorrecoveryafterfourattemptsofretry
ingtheTLPs,theDataLinkLayersignalsthePhysicalLayertoretrainingthe
Link.Hereagain,BitLockisreestablishedduringtheretrainingprocess.

Deserializer
General
The incoming data is clocked into each Lanes deserializer (serialtoparallel
converter)bytheRxclock(seeFigure1122onpage394).The10bitSymbols
produced are clocked into the Elastic Buffer using a dividedby10 version of
theRxClock.

395
PCIe 3.0.book Page 396 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Achieving Symbol Lock


Whenthereceivelogicstartsreceivingabitstream,itisJABOB(justabunchof
bits)withnomarkerstodifferentiateSymbolsoranyboundaries.Thereceive
logic must have a way to find the start and end of a 10bit Symbol, and the
Comma(COM)Symbolservesthispurpose.

The10bitencodingoftheCOMSymbolcontainstwobitsofonepolarityfol
lowedbyfivebitsoftheoppositepolarity(0011111010bor1100000101b),mak
ing it easily detectable. Recall that the COM Control character, like all other
Control characters, is also not scrambled by the transmitter, and that ensures
that the desired sequence will be seen at the receiver. Upon detection of the
COM,thelogicknowsthatthenextbitreceivedwillbethefirstbitofthenext
10bitSymbol.Atthatpoint,thedeserializerissaidtohaveachievedSymbol
Lock.

TheCOMSymbolisusedtoachieveSymbolLockasfollows:

DuringLinktrainingwhentheLinkisfirstestablishedorwhenretraining
isneeded,andTS1andTS2orderedsetsaretransmitted.
WhenFTSorderedsetsaresenttoinformthereceivertochangethestateof
theLinkfromL0stoL0.

Receiver Clock Compensation Logic


Background
Weveobservedbeforethattheclocksusedbythetransmitterandreceiveron
eitherendofalinkarentrequired tohaveexactlythesamefrequencies.This
will be the case whenever the linkdoesnt useacommon reference clock and
introduces the problem that one of them is running slightly faster than the
other. The only requirement is that both clocks must be within +/ 300 ppm
(partspermillion)ofthecenterfrequency.Sinceonecouldbe+300ppmandthe
othercouldbe300ppmintheworstcase,theworstseparationbetweenthem
couldbe600ppm.ThatdifferencetranslatesintoagainorlossofoneSymbol
clockevery1666clocks.OncetheLinkistrained,thereceiveclock(RxClock)in
thereceiveristhesameasthetransmitclock(TxClock)attheotherendofthe
Link(becausethereceiveclockisderivedfromthebitstream).

396
PCIe 3.0.book Page 397 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Elastic Buffers Role


To compensate for that worstcase frequency difference, an elastic buffer (see
Figure1122onpage394)isbuiltintothereceivepath.ReceivedSymbolsare
clocked into it using the recovered clock and clocked out using the receivers
localclock.TheElasticBuffercompensatesforthefrequencydifferencebyadd
ingorremovingSKPSymbols.WhenaSKPorderedsetarrives,logicwatching
thestatusoftheelasticbuffermakesanevaluation.Ifthelocalclockisrunning
faster,Symbolsarebeingclockedoutfasterthantheyrecomingin,sothebuffer
willbeapproachinganunderflowcondition.Thelogicwillcompensateforthis
byappendinganextraSKPSymboltotheorderedsetwhenitarrivestoquickly
refillthebuffer.Ontheotherhand,iftherecoveredclockisrunningfaster,the
bufferwillbeapproachinganoverflowconditionandthelogicwillcompensate
forthatbydeletingoneoftheSKPSymbolstoquicklydrainthebuffer.These
actionswillmakeupfordifferenceinratesofarrivalandconsumptionofthe
Symbolsandpreventanyconfusionorlossofdata.

The transmitter periodically sends the SKIP ordered sets for this purpose. As
thenameimplies,theSKPcharactersarereallydisposablecharacters.Deleting
oraddingaSKPSymbolpreventsabufferoverfloworunderflowintheelastic
bufferand thentheygetdiscarded alongwithalltheother controlcharacters
whentheSymbolsareforwardedtothenextlayer.Consequently,theyusealit
tlebandwidthbutdontotherwiseaffecttheflowofpacketsatall.

Although lost Symbols due to an Elastic Buffer overflow or underflow is an


errorcondition,itsoptionalforreceiverstocheckforthis.Iftheydo,andthis
situationoccurs,aReceiverErrorwillbeindicatedtotheDataLinkLayer.

ThetransmitterschedulesaSKIPorderedsettransmissiononceevery1180to
1538 Symbol times. However, if the transmitter starts a maximum sized TLP
transmissionrightatthe1538SymboltimeboundarywhenaSKIPorderedset
is scheduled to be transmitted, the SKIP ordered set transmission isdeferred.
ReceiversmustbeabletotolerateSKIPorderedsetsthathaveamaximumsepa
rationdependentonthemaximumpacketpayloadsizeadevicesupports.The
formulaforthemaximumnumberofSymbols(n)betweenSKIPorderedsetsis:
n=1538+(maximumpacketpayloadsize+28)

Thenumber28intheequationistheTLPoverhead.Itisthelargestnumberof
Symbolsthatwouldbeassociatedwiththeheader(16bytes),theoptionalECRC
(4bytes),theLCRC(4bytes),thesequencenumber(2bytes)andtheframing
SymbolsSTPandEND(2bytes).

397
PCIe 3.0.book Page 398 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Lane-to-Lane Skew
Flight Time Will Vary Between Lanes
Forwidelinks,skewbetweenlanesisanissuethatcantbeavoidedandwhich
must be compensated at the receiver. Symbols are sent simultaneously on all
lanesusingthesametransmitclock,buttheycantbeexpectedtoarriveatthe
receiveratpreciselythesametime.SourcesofLanetoLaneskewinclude:

Differencesbetweenelectricaldriversandreceivers
Printedwiringboardimpedancevariations
Tracelengthmismatches

Whentheserialbitstreamscarryingapacketarriveatthereceiver,thisLaneto
Laneskewmustberemovedtoreceivethebytesinthecorrectorder.Thispro
cessisreferredtoasdeskewingthelink.

Ordered sets Help De-Skewing


Theuniquestructureoftheorderedsetsandthefactthattheyaresentsimulta
neouslyonallthelanesmakesthemusefulfordetectingtimingmisalignment
betweenLanes.ThespecdoesntdefineamethodfordoingthisbutinGen1and
Gen2thereceiverlogiccansimplylookfortheCOMcharacteroneachlane.Ifit
doesntappearatthesametimeonallLanes,thentheearlyarrivingCOMsare
delayeduntiltheyallmatchuponallLanes.

Receiver Lane-to-Lane De-Skew Capability


Thiscouldbedonebyadjustingananalogdelaylineontheincomingsignals.
Alternatively,itcouldbedoneaftertheelasticbuffer,whichhastheadvantage
thatthearrivaltimedifferencesarenowdigitizedtoSymboltimesbythelocal
clock of the receiver (see Figure 1123 on page 399). If the input to one lane
makesitonaclockedgeandanotheronedoesnt,theearlyarrivalCOMscan
simply be delayed by the appropriate number of Symbol clocks to line it up
withthelatearrivingCOMs.Thefactthatthemaximumallowableskewatthe
receiverisamultipleoftheclockperiodsinfersthatthespecwritersprobably
hadanimplementationlikethisinmind(seeTable 113onpage 399).

398
PCIe 3.0.book Page 399 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Table113:AllowableReceiverSignalSkew

SpecVersion AllowableRxSkew

Gen1 20ns
(5clocksat4nsperSymbol)

Gen2 8ns
(4clocksat2nsperSymbol)

Gen3 6ns
(4clocksat1.25nsperSymbol)

In Gen3 mode there arent any COM characters to use for deskewing, but
detectingOrderedSetscanstillprovidethenecessarytimingalignment.

De-Skew Opportunities
Anunambiguouspatternisneededonalllanesatthesametimetoperformde
skewingand anyorderedsetwilldo. Linktrainingsends these, butthe SKIP
orderedsetissentregularlyduringnormalLinkoperation.Checkingitsarrival
timeallowstheskewtobecheckedonanongoingbasisincaseitmightchange
basedontemperatureorvoltage.Ifitdoes,theLinkwillneedtotransitionto
the Recovery LTSSM state to correct it. If that happens while packets are in
flight,however,areceivererrormayoccurandapacketcouldbedropped,pos
siblyresultinginreplayedTLPs.

Figure1123:ReceiversLinkDeSkewLogic
COM
COM

T S 1/T S 2 T S 1/T S 2
FTS Delay FTS
Lane 0 Rx (symbols)
COM
COM

T S 1/T S 2 T S 1/T S 2
FTS Delay FTS
Lane 1 Rx (symbols)
COM
COM

T S 1/T S 2 T S 1/T S 2
FTS Delay FTS
Lane 2 Rx (symbols)
COM
COM

T S 1/T S 2 T S 1/T S 2
FTS FTS
Lane 3 Rx Delay
(symbols)

399
PCIe 3.0.book Page 400 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

8b/10b Decoder
General
ThefirsttwogenerationsofPCIeuse8b/10b,whileGen3doesnot.Letsexplore
theoperationofitfirstandthenconsiderthedifferenceforGen3.RefertoFig
ure1124onpage401.EachreceiverLaneincorporatesa10b/8bdecoderwhich
isfedfromtheElasticBuffer.Thedecoderisshownwithtwolookuptables(the
DandKtables)todecodethe10bitSymbolstreaminto8bitcharactersplusthe
D/K#signal.ThestateoftheD/K#signalindicatesthatthereceivedSymbolisa
Data(D)characterifamatchforthereceivedSymbolisfoundintheDtable,or
aControl(K)characterifamatchforthereceivedSymbolisdiscoveredintheK
table.

Disparity Calculator
ThedecodersetsthedisparityvaluebasedonthedisparityofthefirstSymbol
received.AfterthefirstSymbol,onceSymbollockhasbeenachievedanddis
parity has been initialized, the calculated disparity for each subsequent Sym
bolsdisparityisexpectedtofollowtherules.Ifitdoesnot,aReceiverErroris
reported.

Code Violation and Disparity Error Detection


General.The error detection logic of the 8b/10b decoder detects illegal
SymbolsinthereceivedSymbolstream.Someerrorcheckingisoptionalin
thereceiver,butthespecrequiresthattheseerrorsbecheckedandreported
asaReceiverError.Thetwotypesoferrorsdetectedare:

CodeViolations.

Any6bitsubblockcontainingmorethanfour1sorfour0sisinerror.
Any4bitsubblockcontainingmorethanthree1sorthree0sisinerror.
Any10bitSymbolcontainingmorethansix1sorsix0sisinerror.
Any10bitSymbolcontainingmorethanfiveconsecutive1sorfivecon
secutive0sisinerror.
Any10bitSymbolthatdoesntdecodeintoan8bitcharacterisinerror.

DisparityErrors.

AtthereceiveraSymbolcannothaveadisparitythatdoesntmatchwhatit
shouldbefortheCRD.Ifitdoes,adisparityerrorisdetected.Somedispar
ityerrorsmaynotbedetectableuntilthesubsequentSymbolisprocessed

400
PCIe 3.0.book Page 401 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

(seeFigure1125onpage401).Forexample,iftwobitsinaSymbolflipin
error,theerrormaynotbevisibleandtheSymbolmaydecodeintoavalid
8bitcharacter.SuchanerrorwontbedetectedinthePhysicalLayer.

Figure1124:8b/10bDecoderperLane

Bytes to De-Scrambler D/K#

7 6 5 4 3 2 1 0
D/
K#

8b Character H G F E D C B A

To Error Reporting
8b/10b Look-Up Table For D Characters
Current
8b/10b Look-Up Table For K Characters
Running
Disparity
(CRD)

CRD Calculator j h g f i e d c b a

10b Symbol

From Elastic Buffer

Figure1125:ExampleofDelayedDisparityErrorDetection

CRD Character CRD Character CRD Character CRD


Transmitted - D21.1 - D10.2 - D23.5 +
Character Stream
Transmitted Bit - 101010 1001 - 010101 0101 - 111010 1010 +
Stream
Bit Stream After - 101010 1011 + 010101 0101 + 111010 1010 +
Error
Decoded - D21.0 + D10.2 + Invalid +
Character Stream

Error occurs here Error detected here

401
PCIe 3.0.book Page 402 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Descrambler
The descrambler is fed by the 8b/10b decoder. It only descrambles Data (D)
charactersassociatedwithaTLPorDLLP(D/K#ishigh).Itdoesntdescramble
Control (K) characters or ordered sets because theyre not scrambled at the
transmitter.

Some Descrambler Implementation Rules:


OnamultiLaneLink,descramblersassociatedwitheachLanemustoper
ateinconcert,maintainingthesamesimultaneousvalueineachLFSR.
DescramblingisappliedtoDcharactersassociatedwithTLPandDLLPs
includingtheLogicalIdle(00h)sequence.Dcharacterswithinorderedset
arenotdescrambled.
Kcharactersandorderedsetcharactersbypassthedescramblerlogic.
CompliancePatterncharactersarenotdescrambled.
When a COM character enters the descrambler, it reinitializes the LFSR
valuetoFFFFh.
Withoneexception,theLFSRseriallyadvanceseighttimesforeverychar
acter (D or K character) received. The LFSR does NOT advance on SKP
characters associated with the SKIP ordered sets received. The reason the
LFSRisnotadvancedondetectingSKPsisbecausetheremaybeadiffer
encebetweenthenumberofSKPcharacterstransmittedandtheSKPchar
acters exiting the Elastic Buffer (as discussed in Receiver Clock
CompensationLogiconpage 396).

Disabling Descrambling
Bydefault,descramblingisalwaysenabled,butthespecallowsittobedisabled
fortestanddebugpurposesalthoughnostandardsoftwaremethodisgivenfor
disablingit.IfthedescramblerreceivesatleasttwoTS1/TS2orderedsetswith
the disable scrambling bit set on all of its configured Lanes, it disables the
descrambler.

Byte Un-Striping
Figure1126onpage403showseightcharacterstreamsfromthedescramblers
ofax8Linkbeingunstripedintoasinglebytestreamwhichisfedtothechar
acterfilterlogic.

402
PCIe 3.0.book Page 403 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Figure1126:Exampleofx8ByteUnStriping

Packet byte stream from Multiplexer block


Data Stream D/K#

Character 0
Character 1
Character 2
Character 3
Character 4
Character 5
Character 6
Character 7
Byte Un-Striping

Character 0 Character 1 Character 7


Character 8 Character 9 Character 15
Character 16 Character 17 Character 23

From Lane 0 From Lane 1 From Lane 7


De-Scrambler De-Scrambler De-Scrambler

Filter and Packet Alignment Check


The serial byte stream supplied by the byte unstriping logic contains TLPs,
DLLPs,LogicalIdlesequences,ControlcharacterssuchasSTP,SDP,END,EDB,
andPADs,aswellastheorderedsets.Ofthese,theLogicalIdlesequence,the
controlcharactersandorderedsetsaredetectedandeliminatedbeforetheyget
tothenextlayer.WhatremainsaretheTLPsandDLLPsandthesearesentto
theRxBufferalongwithanindicatorofthestartandendofeachpacket.

Receive Buffer (Rx Buffer)


TheRxBufferholdsreceivedTLPsandDLLPsafterthestartandendcharacters
havebeeneliminated.ThereceivedpacketsarereadytosendtotheDataLink
Layer.TheinterfacetotheDataLinkLayerisnotdescribedinthespec,sothe
designer is free to decide details like data bus width. As an example, we can

403
PCIe 3.0.book Page 404 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

assumeaninterfaceclockof250MHzandaGen1speedontheLink.Forthat
case, the number of bytes in the data bus between these layers would be the
sameasthenumberofLanessupported.

Physical Layer Error Handling

General
Physical Layer errors are reported as Receiver Errors to the Data Link Layer.
Accordingtothespec,someerrorsmustbecheckedandtriggerareceivererror,
whileothersareoptional.

Requirederrorchecking:

8b/10bdecodeerrors:disparityerror,illegalSymbol

Optionalerrorchecking:

LossofSymbollock(seeAchievingSymbolLockonpage 396)
ElasticBufferoverfloworunderflow
Lanedeskewerrors(seeLanetoLaneSkewonpage 398)
Packetsinconsistentwithformatrules

Response of Data Link Layer to Receiver Error


IfthePhysicalLayerindicatesaReceiverErrortotheDataLinkLayer,theData
LinkLayerdiscardstheTLPcurrentlybeingreceivedandfreesanystorageallo
catedfortheTLP.ItthenschedulesaNAKtogobacktothetransmitterofthe
TLP.ThatcausesthetransmittertoreplayTLPsfromtheReplayBuffer,which
shouldautomaticallycorrecttheerror.TheDataLinkLayermayalsodirectthe
PhysicalLayertoinitiateLinkretraining.

IfthePCIExpressExtendedAdvancedErrorCapabilitiesregistersetisimple
mented, a Receiver Error sets the Receiver Error Status bit in the Correctable
ErrorStatusregister.Ifenabled,thedevicecansendanERR_COR(correctable
error)messagetotheRootComplex.

404
PCIe 3.0.book Page 405 Sunday, September 2, 2012 11:25 AM

Chapter 11: Physical Layer - Logical (Gen1 and Gen2)

Active State Power Management


ThereareseveralLinkpowerstatesthatallowpowersavingsundercertaincon
ditions. These are L0s, L1, L2, and L3, which represent progressively lower
powerandalsolongerrecoverytimetogetthelinkbacktothefullyoperation
stateofL0.TheL0sstatecanonlybeenteredunderhardwarecontrol,whileL1
canbeinitiatedbyhardwareorsoftware.SinceL0sandL1canbecontrolledby
hardware,theyarereferredtobythespecasASPM(ActiveStatePowerMan
agement)states.Formoreonthedetailsoflinkanddevicepowermanagement
seethesectionActiveStatePowerManagement(ASPM)onpage 735.

Link Training and Initialization


As weve just briefly mentioned in this chapter, the Physical Layer is also
responsibleforinitializingthelinkafterareset.However,thistopicistoobigto
coverhereandisinsteadcoveredinChapter14,entitledLinkInitialization&
Training,onpage505.

405
PCIe 3.0.book Page 406 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

406
PCIe 3.0.book Page 407 Sunday, September 2, 2012 11:25 AM

12 PhysicalLayer
Logical(Gen3)
The Previous Chapter
ThepreviouschapterdescribestheGen1/Gen2logicalsubblockofthePhysical
Layer.Thislayerpreparespacketsforserialtransmissionandrecovery,andthe
severalstepsneededtoaccomplishthisaredescribedindetail.Thechaptercov
erslogicassociatedwiththeGen1andGen2protocolthatuse8b/10bencoding/
decoding.

This Chapter
This chapter describes the logical Physical Layer characteristics for the third
generation(Gen3)ofPCIe.Themajorchangeincludestheabilitytodoublethe
bandwidth relative to Gen2 speed without needing to double the frequency
(Linkspeedgoesfrom5GT/sto8GT/s).Thisisaccomplishedbyeliminating
8b/10bencodingwheninGen3mode.Morerobustsignalcompensationisnec
essaryatGen3speed.

The Next Chapter


The next chapter describes the Physical Layer electrical interface to the Link.
Theneedforsignalequalizationandthemethodsusedtoaccomplishitarealso
discussedhere.Thischaptercombineselectricaltransmitterandreceiverchar
acteristicsforbothGen1,Gen2andGen3speeds.

Introduction to Gen3
RecallthatwhenaPCIeLinkenterstraining(i.e.,afterareset)italwaysbegins
usingGen1speedforbackwardcompatibility.Ifhigherspeedswereadvertised
duringthetraining,theLinkwillimmediatelytransitiontotheRecoverystate
andattempttochangetothehighestcommonlysupportedspeed.

407
PCIe 3.0.book Page 408 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ThemajormotivationforupgradingthePCIespectoGen3wastodoublethe
bandwidth, as shown in Table 121 on page 408. The straightforward way to
accomplishthiswouldhavebeentosimplydoublethesignalfrequencyfrom5
GT/sto10Gb/s,butdoingthatpresentedseveralproblems:

Higherfrequenciesconsumesubstantiallymorepower,aconditionexacer
bated by the need for sophisticated conditioning logic (equalization) to
maintainsignalintegrityatthehigherspeeds.Infact,thepowerdemandof
thisequalizinglogicismentionedinPCISIGliteratureasabigmotivation
forkeepingthefrequencyaslowaspractical.
Some circuit board materials experience significant signal degradation at
higher frequencies. This problem can be overcome with better materials
and more design effort, but those add cost and development time. Since
PCIe is intended to serve a wide variety of systems, the goal was that it
shouldworkwellininexpensivedesigns,too.
Similarly, allowing new designs to use the existing infrastructure (circuit
boards and connectors, for example) minimizes board design effort and
cost. Using higher frequencies makes that more difficult because trace
lengthsandotherparametersmustbeadjustedtoaccountforthenewtim
ing,andthatmakeshighfrequencieslessdesirable.

Table121:PCIExpressAggregateBandwidthforVariousLinkWidths

LinkWidth x1 x2 x4 x8 x12 x16 x32

Gen1Bandwidth 0.5 1 2 4 6 8 16
(GB/s)

Gen2Bandwidth 1 2 4 8 12 16 32
(GB/s)

Gen3Bandwidth 2 4 8 16 24 32 64
(GB/s)

TheseconsiderationsledtotwosignificantchangestotheGen3speccompared
withthepreviousgenerations:anewencodingmodelandamoresophisticated
signalequalizationmodel.

408
PCIe 3.0.book Page 409 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

New Encoding Model


ThelogicalpartofthePhysicalLayerreplacedthe8b/10bencodingwithanew
128b/130b encoding scheme. Of course, this meant departing from the well
understood8b/10bmodelusedinmanyserialdesigns.Designerswerewilling
totakethissteptorecoverthe20%transmissionoverheadimposedbythe8b/
10bencoding.Using128b/130bmeanstheLanesarenowdelivering8bits/byte
insteadof10bits,andthatmeansan8.0GT/sdataratethatdoublestheband
width.Thisequatestoabandwidthof1GB/sineachdirection.

Toillustratethedifferencebetweenthesetwoencodings,firstconsiderFigure
121 that shows the general 8b/10b packet construction. The arrows highlight
the Control (K) characters representing the framing Symbols for the 8b/10b
packets.Receiversknowwhattoexpectbyrecognizingthesecontrolcharacters.
See 8b/10b Encoding on page 380 to review the benefits of this encoding
scheme.

Figure121:8b/10bLaneEncoding

D Characters

STP Sequence Header Data Payload ECRC LCRC END

D Characters
K Character K Character
SDP DLLP Type Misc. CRC END

K Character K Character

By comparison, Figure 122 on page 410 shows the 128b/130b encoding. This
encoding does not affect bytes being transferred, instead the characters are
groupedintoblocksof16byteswitha2bitSyncfieldatthebeginningofeach
block. The 2bit Sync field specifies whether the block includes Data (10b) or
OrderedSets(01b).Consequently,theSyncfieldindicatestothereceiverwhat
kindoftraffictoexpectandwhenitwillbegin.Orderedsetsaresimilartothe
8b/10bversioninthattheymustbedrivenonalltheLanessimultaneously.That
requiresgettingtheLanesproperlysynchronizedandthisispartofthetraining
process(seeAchievingBlockAlignmentonpage 438).

409
PCIe 3.0.book Page 410 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure122:128b/130bBlockEncoding

0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 0 Symbol 1 Symbol 15


Field

Sophisticated Signal Equalization


ThesecondchangeismadetotheelectricalsubblockofthePhysicalLayerand
involvesmoresophisticatedsignalequalizationbothatthetransmitsideofthe
Linkandoptionallyatthereceiver.Gen1andGen2implementationsuseafixed
Txdeemphasistoachievegoodsignalquality.However,increasingtransmis
sion frequencies beyond 5 GT/s causes signal integrity problems to become
morepronounced,requiringmoretransmitterandreceivercompensation.This
canbemanagedsomewhatattheboardlevelbutthedesignerswantedtoallow
theexternalinfrastructuretoremainthesameasmuchaspossible,andinstead
placed the burden on the PHY transmitter and receiver circuits. For more
details on signal conditioning, refer to Solution for 8.0 GT/s Transmitter
Equalizationonpage 474.

Encoding for 8.0 GT/s


Aspreviouslydiscussed,theGen3128b/130bencodingmethodusesLinkwide
packetsandperLaneblockencoding.Thissectionprovidesadditionaldetails
regardingtheencoding.

Lane-Level Encoding
ToillustratetheuseofBlocks,considerFigure123onpage411,whereasingle
LaneDataBlockisshown.AtthebeginningarethetwoSyncHeaderbitsfol
lowerby16bytes(128bits)ofinformationresultingin130transmittedbits.The
SyncHeadersimplydefineswhetheraDatablock(10b)oranOrderedSet(01b)
isbeingsent.YoumayhavenoticedtheDataBlock inFigure123hasaSync
Headervalueof01ratherthanthe10bvaluementionedabove.Thisisbecause
the least significant bit of the Sync Header is sent first when transmitting the
block across the link. Notice the symbols following the Sync Header are also
sentwiththeleastsignificantbitfirst.

410
PCIe 3.0.book Page 411 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

Figure123:SyncHeaderDataBlockExample

UI UI
UI 2
0 2 10 12
= = = =
e e e e
m m m m
Ti Ti Ti Ti
0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 0 Symbol 1 Symbol 15


(01)

128-bit Payload

Data Block

Block Alignment
Likepreviousimplementations,Gen3achievesBitLockfirstandthenattempts
to establish Block Alignment locking. This requires receivers to find the Sync
HeaderthatdemarcatestheBlockboundary.Transmittersestablishthisbound
ary by sending recognizable EIEOS patterns consisting of alternating bytes of
00h and FFh, as shown in Figure 124. Thus, the use of EIEOS has expanded
fromsimplyexitingElectricalIdletoalsoservingasthesynchronizingmecha
nismthatestablishesBlockAlignment.NotethattheSyncHeaderbitsimmedi
ately precede and follow the EIEOS (not shown in the illustration). See
AchievingBlockAlignmentonpage 438fordetailsregardingthisprocess.

Figure124:Gen3ModeEIEOSSymbolPattern

0 00000000
1 11111111
2 00000000
3 11111111
4 00000000

13 11111111
14 00000000
15 11111111

411
PCIe 3.0.book Page 412 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Ordered Set Blocks


OrderedSetshavemuchthesamemeaningtheydidinGen1andGen2.They
areusedtomanageLaneprotocol.WhenanOrderedSetBlockissentitmust
appearonalltheLanesatthesametimeandalmostalwaysconsistsof16bytes
withoneexception.TheoneexceptiontothissizeruleistheSOS(SKPOrdered
Set)whichcanhaveSKPSymbolsaddedorremovedingroupsoffourbyclock
compensation logic (associated with a Link Repeater for example) and can
thereforelegallybe8,12,16,20,or24byteslong.

ThebasicformatoftheOrderedSetBlockissimilartotheDataBlock,except
thattheSyncHeaderbitsarereversed,asshowninFigure125onpage412.

Figure125:Gen3x1OrderedSetBlockExample

I U
I UI
U 2
0 2 10 12
= = = =
e e e e
m m m m
Ti Ti Ti Ti
1 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 0 Symbol 1 Symbol 15


(10)

128-bit Payload

Ordered Set Block

ThespecdefinessevenOrderedSetsforGen3(oneadditionalOrderedSetover
Gen1andGen2PCIe).Inmostcases,theirfunctionalityisthesameasitwasfor
thepreviousgenerations.

1. SOS Skip Ordered Set: used for clock compensation. See Ordered Set
ExampleSOSonpage 426formoredetail.
2. EIOSElectricalIdleOrderedSet:usedtoenterElectricalIdlestate
3. EIEOSElectricalIdleExitOrderedSet:usedfortwopurposesnow:
ElectricalIdleExitasbefore
Blockalignmentindicatorfor8.0GT/s
4. TS1TrainingSequence1OrderedSet
5. TS2TrainingSequence2OrderedSet
6. FTSFastTrainingSequenceOrderedSet
7. SDSStartofDataStreamOrderedSet:newseeDataStreamandData
Blocksonpage 413formore

412
PCIe 3.0.book Page 413 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

TogivethereaderanexampleoftheOrderedSetstructure,Figure126shows
thecontentofanFTSOrderedSetwhenrunningat8.0GT/s.AnOrderedSet
BlockisonlyrecognizedasanOrderedSetbytheSyncHeader,andidentified
asanFTStypebythefirstSymbolintheBlock.Therighthandsideofthefigure
lists the Ordered Set Identifiers (the first Symbol for each Ordered Set) that
servetoidentifythetypeofOrderedSetisbeingtransmitted.

Figure126:Gen3FTSOrderedSetExample

FTS Ordered Set Ordered Set Identifiers


Symbol Value Ordered Set First Symbol
Sync H eader 01b EIEOS 00h
0 55h EIOS 66h
1 47h FTS 55h
2 4Eh SDS E1
3 C7h TS1 1Eh
4 CC h TS2 2Dh
5 C6h SKP AAh
6 C9h
7 25h
8 6Eh
9 ECh
10 88h
11 7Fh
12 80h
13 8Dh
14 8Bh
15 8Eh

Data Stream and Data Blocks


TheLinkentersaDataStreambysendinganSDSOrderedSetandtransitioning
to the L0 Link state. While in a Data Stream multiple Data Blocks are trans
ferred,untiltheDataStreamendswithanEDSToken(unlessanerrorendsit
early).AnEDSTokenalwaysoccupiesthelastfourSymbolsoftheDataBlock
that precedes an Ordered Set. An exception is made for Skip Ordered Sets
becausetheydonotinterruptaDataStreamas longascertain conditionsare

413
PCIe 3.0.book Page 414 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

metthatarediscussedlater.ADataStreamisnolongerineffectwhentheLink
statetransitionsoutoftheL0statetoanyotherLinkstate,suchasRecovery.For
moreonLinkstates,seeLinkTrainingandStatusStateMachine(LTSSM)on
page 518.

Data Block Frame Construction


DataBlockscompriseTLPs,DLLP,andTokensthatareusedtodelivertheinfor
mation.FivetypesofDatastructures(calledTokens)arealsousedwithinaData
Block.Eachhaspatternsforeasydetectionbythereceiver.Threeofthetoken
maybesentatthebeginningofablock(i.e.,immediatelyfollowingaSyncData
Block).Theseinclude:

StartTLP(STP)followedbyaTLP
StartDLLP(SDP)followedbyaDLLP
LogicalIdle(IDLA)sentwhenthereisnopacketactivity

TheremainingTokensaredeliveredattheendoftheDataBlock:

EndofDataStream(EDS)PrecedesthetransitiontoOrderedSets
EndBad(EDB)reportsanullifiedpackethasbeendetected

Figure127providesanexampleofaDataBlockconsistingofasinglelaneTLP
transmission.

Figure127:Gen3x1FrameConstructionExample
0 ]
[3:
:8]

0]
er ce

er ce
RC
[11

[7:
:4]

mb en

mb en
]0

eC
S it
[10

Nu equ

Nu equ
b
[3:
b

rit y
11

am
N
N

S
11

LE
LE

Pa

Fr

Tx
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Symbol 0 Symbol 1 Symbol 2 Symbol 3

Header and Data Payload (8 bytes, same as 2.0) LCRC (4 bytes, same as 2.0)
Symbol 15

414
PCIe 3.0.book Page 415 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

Insummary,thecontentsofagivenDataBlockvarydependingontheactivity:

IDLs when no packets are being delivered Data Blocks consist of


nothingbutIDL.(ThespecdesignatesIDLasoneoftheTokens)
TLPsOneormoreTLPsmaybesentinagivenDataBlockdepend
ingonthelinkwidth.
DLLPsOneormoreDLLPsmaybesentinaDataBlock.
Combinationsoftheactivitylistedabovemaybedeliveredinasingle
DataBlock

Framing Tokens
The spec defines five Framing Tokens (or just Tokens for short) that are
allowedtoappearinaDataBlock,andthosearerepeatedforconveniencehere
inFigure128onpage417.ThefiveTokensare:

1. STPStartTLP:Muchlikeearlierversion,butnowincludesdwordcount
fortheentirepacket.
2. SDPStartDLLP
3. EDBEndBad:UsedtonullifyaTLPthewayitwasinearlierGen1and
Gen2 designs, but now four EDB symbols in a row are sent. The END
(End Good) symbol is done away now; if not explicitly marked as bad,
theTLPwillbeassumedtobegood.
4. EDSEndofDataStream:LastdwordofaDataStream,indicatingthat
atleastoneOrderedSetwillfollow.Curiously,theDataStreammaynot
actually be ended by this event. If the Ordered Set that follows it is an
SOSandisimmediatelyfollowedbyanotherDataBlock,theDataStream
continues.IftheOrderedSetthatfollowstheEDSisanythingotherthan
SOS,oriftheSOSisnotfollowedbyaDataBlock,theDataStreamends.
5. IDLLogicalIdle:TheIdleTokenissimplydatazerobytessentduring
LinkLogicalIdlestatewhennoTLPsorDLLPsarereadytotransmit.
The difference between the way the spec shows the Tokens and the way
theyrepresentedinFigure128onpage417isthatthisdrawingshowsboth
bytesandbitsinlittleendianorderinsteadofthebigendianbitrepresenta
tion used in the spec. The reason its shown that way is to illustrate the
orderthatthebitswillactuallyappearontheLane.

Packets
TheSTPandSDP,indicatethestartofapacketasshowninFigure127

TLPs.AnSTPTokenconsistsofanibbleof1sfollowedbyan11bitdword
length field. The length counts all the dwords of the TLP, including the

415
PCIe 3.0.book Page 416 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Token, header, optional data payload, optional digest, and LCRC. That
allowsthereceivertocountdwordstorecognizewheretheTLPends.Con
sequently,itsveryimportanttoverifythattheLengthfielddoesnthavean
error,andsoithasa4bitFrameCRC,andanevenparitybitthatprotects
boththeLengthandFrameCRCfields.Thecombinationofthesebitspro
videsarobusttriplebitflipdetectioncapabilityfortheToken(asmanyas3
bitscouldbeincorrectanditwouldstillberecognizedasanerror).The11
bitLengthfieldallowsforaTLPof2Kdwords(8KB)fortheentireTLP.
DLLPs. The SDP Token indicates the beginning of a DLLP and doesnt
includealengthfieldbecauseitwillalwaysbeexactly8byteslong:the2
byte Token is followed by 4 bytes of DLLP payload and 2 bytes of DLLP
LCRC.Perhapscoincidently,thisDLLPlengthisthesameasitwasinear
lierPCIegenerations,buttheyalsodonothaveanendgoodsymbol.

TheEDBTokenisaddedtotheendofTLPsthatarenullified.ForanormalTLP,
there is no end good indication; its assumed to be good unless explicitly
markedasbad.IftheTLPendsupbeingnullified,theLCRCvalueisinverted
and an EDB Token is appended as an extension of the TLP, although its not
includedinthelengthvalue.PhysicallayerreceiversmustcheckfortheEDBat
theendofeveryTLPandinformtheLinklayeriftheyseeone.Notsurprisingly,
receivinganEDBatanytimeotherthanimmediatelyafteraTLPwillbeconsid
eredtobeaFramingError.

416
PCIe 3.0.book Page 417 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

Figure128:Gen3FrameTokenExamples

]
:3
it
N eq t y b

[0
]
11

7]
r[ e

be nce
C
be nc
8:

0:
S ari
]

R
0
]

r[
um ue

um e
:1
:3

C
P

N equ
[4
[0
b

e
11

am

am
N
N

S
11

LE
LE

Fr

Fr
Tx
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
STP
Symbol 0 Symbol 1 Symbol 2 Symbol 3
b

b
00

11

0011 0101
00

11

Tx
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
SDP
Symbol 0 Symbol 1

1111 1000b 0000 0001b 0000 1001b 0000 0000b


Tx
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
EDS
Symbol 0 Symbol 1 Symbol 2 Symbol 3

0000 0011b 0000 0011b 0000 0011b 0000 0011b


Tx
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
EDB
Symbol 0 Symbol 1 Symbol 2 Symbol 3

0000 0000b
Tx
0 1 2 3 4 5 6 7
IDL
Symbol 0

Transmitter Framing Requirements


Tobeginthisdiscussion,itwillbehelpfulfirsttodefineacoupleofthings.First,
recallthataDataStreamstartswiththefirstSymbolfollowing anSDSandit
maycontainDataBlocksmadeupofTokens,TLPsandDLLPs.TheDataStream
finisheswiththelastSymbolbeforeanOrderedSetotherthanSOS,orwhena
FramingErrorisdetected.DuringaDataStreamnoOrderedSetscanbesent
exceptfortheSOS.

Secondly,sinceframingproblemswillusuallyresultinaFramingError,itwill
helptoexplainwhathappensinthatcase.WhenFramingErrorsoccur,theyare

417
PCIe 3.0.book Page 418 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

considered Receiver Errors and will be reported as such. The Receiver stops
processing the Data Stream in progress and will only process a new Data
Stream when it sees an SDS Ordered Set. In response to the error, a recovery
processisinitiatedbydirectingtheLTSSMtotheRecoverystatefromL0.The
expectation is that this will be resolved in the Physical Layer and will not
require any action by the upper layers. In addition, the spec states that the
roundtrip time to accomplish this is expected to take less than 1s from the
timebothPortshaveenteredRecovery.

Now, with that background in place, lets continue with the framing require
ments.WhileinaDataStream,atransmittermustobservethefollowingrules:

WhensendingaTLP:
AnSTPTokenmustbeimmediatelyfollowedbytheentirecontentsof
theTLPasdeliveredfromtheLinkLayer,evenifitsnullified.
IftheTLPwasnullified,theEDBTokenmustappearimmediatelyafter
thelastdwordoftheTLP,butmustnotbeincludedintheTLPlength
value.
AnSTPcannotbesentmorethanonceperSymbolTimeontheLink.
WhensendingaDLLP:
AnSDPTokenmustbeimmediatelyfollowedbytheentirecontentsof
theDLLPasdeliveredfromtheDataLinkLayer.
AnSDPcannotbesentmorethanonceperSymbolTimeontheLink.
WhensendinganSOS(SKPOrderedSet)withinaDataStream:
SendanEDSTokeninthelastdwordofthecurrentDataBlock.
SendtheSOSasthenextOrderedSetBlock.
SendanotherDataBlockimmediatelyaftertheSOS.TheDataStream
resumeswiththefirstSymbolofthissubsequentDataBlock.
IfmultipleSOSsarescheduled,theycantbebacktobackastheywere
in earlier generations. Instead, each one must be preceded by a Data
BlockthatendswiththeEDSToken.TheDatablockcanbefilledwith
TLPs,DLLPsorIDLsduringthistime.
ToendaDataStream,sendtheEDSTokeninthelastdwordofthecurrent
DataBlockandfollowthatwitheithertheEIOStogointoalowpowerLink
state,oranEIEOSforallothercases.
TheIDLTokenmustbesentonallLanesifaTLP,DLLP,orotherFraming
TokenisnotbeingsentontheLink.
FormultiLaneLinks:
AftersendinganIDLToken,thefirstSymbolofthenextTLPorDLLP
mustbeinLane0whenitstarts.AnEDSTokenmustalwaysbethelast
dwordofaDataBlockandthereforemaynotalwaysfollowthatrule.
IDLTokensmustbeusedtofillindwordsduringaSymbolTimethat
would otherwise be empty. For example, if a x8 Link has a TLP that

418
PCIe 3.0.book Page 419 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

ends in Lane 3, but the sender doesnt have another TLP or a DLLP
readytostartinLane4,thenIDLsmustfillintheremainingbytesuntil
theendofthatSymbolTime.
Since packets are still multiples of 4 bytes as they were in the earlier
generations,theyllstartandendon4Laneboundaries.Forexample,a
x8 Link with a DLLP that ends in Lane 3 could start the next TLP by
placingitsSTPTokeninLane4.

Receiver Framing Requirements


WhenaDataStreamisseenattheReceiver,thefollowingrulesapply:
When Framing Tokens are expected, Symbols that look like anything else
willbeFramingErrors.
Someerrorchecksandreportsshowninthelistbelowareoptional,andthe
specpointsoutthattheyareindependentlyoptional.
WhenanSTPisreceived:
ReceiversmustchecktheFrameCRCandFrameParityfields,andany
mismatch will be a Framing Error. (Note that an STP Token with a
FramingErrorisntconsideredtobepartofaTLPwhenreportingthis
error.).
TheSymbolimmediatelyafterthelastDWoftheTLPisthenextToken
toprocess,andReceiversmustchecktoseewhetheritsthestartofan
EDBTokenshowingthattheTLPhasbeennullified.
Optionally check for length value of zero; if detected, its a Framing
Error.
OptionallycheckforthearrivalofmorethanoneSTPTokeninthesame
SymbolTime.Ifcheckinganddetected,thisisaFramingError.
WhenanEDBisreceived:
ReceivermustinformtheLinkLayerassoonasthefirstEDBSymbolis
detected,orafteranyoftheremainingbytesofithavebeenreceived.
IfanySymbolsintheTokenarenotEDBs,theresultisaFramingError.
TheonlylegaltimeforanEDBTokenisrightafteraTLP;anyotheruse
willbeaFramingError.
The Symbol immediately following the EDB Token will be the first
SymbolofthenextTokentobeprocessed.
WhenanEDSTokenisreceivedasthelastDWofaDataBlock:
ReceiversmuststopprocessingtheDataStream.
OnlyaSKP,EIOS,orEIEOSOrderedSetwillbeacceptablenext;receiv
inganyotherOrderedsetwillbeaFramingError.
IfaSKPOrderedSetisreceivedafteranEDS,Receiversmustresume
DataStreamprocessingwiththefirstSymboloftheDataBlockthatfol
lows,unlessaFramingErrorwasdetected.

419
PCIe 3.0.book Page 420 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

WhenanSDPTokenisreceived:
The Symbol immediately after the DLLP is the next Token to be pro
cessed.
Optionally check for more than one SDP Token in the same Symbol
Time.Ifcheckingandthisoccurs,itisaFramingError.
WhenanIDLTokenisreceived:
ThenextTokenisallowedtobeginonanyDWalignedLanefollowing
theIDLToken.ForLinksthatarex4ornarrower,thatmeansthenext
TokencanonlystartinLane0ofthenextSymbolTime.ForwiderLinks
there are more options. For example, a x16 Link could start the next
TokeninLane0,4,8,or12ofthecurrentSymbolTime.
TheonlyTokenthatwouldbeexpectedinthesameSymbolTimeasan
IDLwouldbeanotherIDLoranEDS.
WhileprocessingaDataStream,ReceiverswillseethefollowingasFram
ingErrors:
AnOrderedSetimmediatelyfollowinganSDS.
ABlockwithanillegalSyncHeader(11bor00b).Thiscanoptionallybe
reportedintheLaneErrorStatusregister.
AnOrderedSetBlockonanyLanewithoutreceivinganEDSTokenin
thepreviousBlock.
A Data Block immediately following an EDS Token in the previous
block.
Optionally,verifythatallLanesreceivethesameOrderedSet.

Recovery from Framing Errors


IfaFramingErrorisseenwhileprocessingaDataStream,theReceivermust:

ReportaReceiverError(iftheoptionalAdvancedErrorReportingregisters
areavailable,setthestatusbitshowninFigure129onpage421).
StopprocessingtheDataStream.ProcessinganewDataStreamcanbegin
whenthenextSDSOrderedSetisseen.
Initiate the error recovery process. If the Link is in the L0 state, that will
involve a transition to the Recovery state. The spec says that the time
throughtheRecoverystateisexpectedtobelessthan1s.
Note that recovery from Framing Errors is not necessarily expected to
directlycauseDataLinkLayerinitiatedrecoveryactivityviatheAck/Nak
mechanism.Ofcourse,ifaTLPislostorcorruptedasaresultoftheerror,
thenareplayeventwillbeneeded.

420
PCIe 3.0.book Page 421 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

Figure129:AERCorrectableErrorRegister

31 16 15 14 13 12 11 9 8 7 6 5 1 0

RsvdZ RsvdZ RsvdZ

Header Log Overflow Status


Corrected Internal Error Status
Advisory Non-Fatal Error Status
Replay Timer Timeout Status
REPLAY_NUM Rollover Status
Bad DLLP Status
Bad TLP Status
Receiver Error Status
Note: all bits designated RW1CS

Gen3 Physical Layer Transmit Logic


Figure1210onpage422illustratesaconceptualblockdiagramofthePhysical
LayertransmitlogicthatsupportsGen3speeds.Theoveralldesignisverysimi
lartoGen2sotheresnoneedtogothroughallthedetailsagainbutthereare
somedifferences.ThosewhoarenewtoPCIeareencouragedtoreviewtheear
lierchaptercalledPhysicalLayerLogical(Gen1andGen2)onpage 361to
learnthebasicsofthePhysicalLayerdesign.Letsstartatthetopofthediagram
and explain the changes for Gen3 along the way. As before, its important to
pointoutthatthisimplementationisonlyforinstructionalpurposesandisnot
meanttoshowanactualGen3PhysicalLayerimplementation.

Multiplexer
TLPs and DLLPs arrive from the Data Link Layer at the top. The multiplexer
mixes in the STP or SDP Tokens necessary to build a complete TLP or DLLP.
TheprevioussectiondescribedtheTokenformats.

421
PCIe 3.0.book Page 422 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1210:Gen3PhysicalLayerTransmitterDetails

From Data Link Layer


Packet Boundary Indicator

Throttle N*8
Tx
Control/
Buffer Ordered
Token Logical
Sets
Characters Idle

N*8 8 8 8
Mux

N*8 D/K#

Lane 0 Byte Striping Lane N


8 D/K# 8 D/K#

Gen3 Scrambler Lane 1, ... ,N-1 Gen3 Scrambler


Scrambler Scrambler
8 8
D/K# Tx Local D/K#
PLL
8b/10b 8b/10b
Encoder Encoder
8 10 Tx Clk 8 10

Mux Mux

Gen3 Sync
Serializer Bits Generator Serializer

Mux Mux

Tx Tx

Lane 0 Lane 1, ... ,N-1 Lane N

422
PCIe 3.0.book Page 423 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

Gen3TLPboundariesaredefinedbythedwordcountintheLengthfieldofthe
STPTokenatthebeginningofaTLPpacket,therefore,noENDframecharacter
isneeded.

WhenendingaDataStreamorjustbeforesendinganSOS,theEDSTokenin
muxedintotheDataStream.Atregularintervals,basedonaSkiptimer,anSOS
isinsertedintotheDataStreambythemultiplexer.OtherOrderedSetssuchas
TS1, TS2, FTS, EIEOS, EIOS, SDS may also be muxed based on Link require
mentsandareoutsidetheDataStream.

PacketsaretransmittedinBlockswhichareidentifiedbythe2bitSyncHeader.
TheSychHeaderisaddedbythemultiplexer.However,theSychHeaderisrep
licatedonallLanesofamultiLaneLinkbytheByteStripinglogic.

When there are no packets or Ordered Sets to send but the Link is to remain
activeinL0state,theIDL(LogicalIdle,ordatazero)Tokensareusedasfillers.
Thesearescrambledjustlikeotherdatabytesandarerecognizedasfillerbythe
Receiver.

Byte Striping
ThislogicspreadsthebytestobedeliveredacrossalltheavailableLanes.The
framingrulesweredescribedearlierinTransmitterFramingRequirementson
page 417,sonowletslookatsomeexamplesanddiscusshowtherulesapply.

ConsiderfirsttheexampleshowninFigure1211onpage424,wherea4Lane
Linkisillustrated.NoticethattheSyncHeaderbitsappearonalltheLanesat
thesametimewhenanewBlockbeginsanddefinetheblocktype(aDataBlock
inthisexample).BlockencodingishandledindependentlyforeachLane,but
thebytes(orsymbols)arestripedacrossalltheLanesjustastheywereforthe
earliergenerationsofPCIe.

423
PCIe 3.0.book Page 424 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1211:Gen3ByteStripingx4

Lane 0 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 0 Symbol 4 Symbol 60

Lane 1 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 1 Symbol 5 Symbol 61

Lane 2 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 2 Symbol 6 Symbol 62

Lane 3 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 3 Symbol 7 Symbol 63

Byte Striping x8 Example


Next, consider the x8 Link shown in Figure 1212 on page 425, which is an
examplefromthespecredrawntomakeiteasiertoread.Herethebitstreamis
verticalinsteadofhorizontal.AtthetopwecanseethattheSyncbits,shownin
littleendianorderasrequired,appearonallLanessimultaneouslyandindicate
thataDataBlockisstarting.

In this example, a TLP is sent first, so Symbols 0 4 contain the STP framing
Token,whichincludesalengthof7DWfortheentireTLPincludingtheToken.
The receiver needs to know the length of the TLP because for 8 GT/s speeds
thereisnoENDcontrolcharacter.Instead,thereceivercountsthedwordsandif
thereisnoEDB(EndBad)observed,theTLPisassumedtobegood.Inthiscase,
theTLPendsonLane3ofSymbol3.

424
PCIe 3.0.book Page 425 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

Figure1212:Gen3x8Example:TLPStraddlesBlockBoundary

Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7


0 0 0 0 0 0 0 0
Sync
1 1 1 1 1 1 1 1

Symbol 0 STP Token: Le n gth = 7 , C R C , P a rity, Seq Num

Symbol 1 T LP
Symbol 2
Logical
Symbol 3 LC R C SDP Token
Idle
Symbol 4 D LLP IDL IDL IDL IDL
Symbol 5 IDL IDL IDL IDL IDL IDL IDL IDL
Symbol 6 STP Token: Le n gth = 23, C R C , P a rity, Seq Num H ead
D er
W D1 W 1
Symbol 7 H ead
D er
WD 2W 2 H ea3er
D W D3W 3
TLP
Symbol 15 D ata
D WD 18
W 14 D ata
D WD W
19 15 straddles
0 0 0 0 0 0 0 0 Block
Sync
1 1 1 1 1 1 1 1
boundary
Symbol 0 D ata
D WD 20
W 16 D ata
D WD W
21 17
Symbol 1 LC R C IDL IDL IDL IDL

NextaDLLPissentbeginningwiththeSDPTokenonLanes4and5.Sincea
DLLPisalways8Symbolslong,itwillfinishinLane3ofSymbol4.Momen
tarily,therearenootherpacketstosend,soIDLSymbolsaretransferreduntil
anotherpacketisready.WhenIDLsaresent,thenextSTPTokencanonlystart
inLane0.Intheexample,theTLPstartsinLane0ofSymbol6.

ThepacketlengthforthenextTLPis23DWandthatpresentsaninterestingsit
uationbecausethereareonly20dwordsavailablebeforethenextBlockbound
ary.WhentheDataBlockendsthetransmittersendsSyncandcontinuesTLP
transmissionduringSymbol0ofthenextBlock.Inotherwords,Packetssimply
straddleBlockboundarieswhennecessary.Finally,theTLPfinishesinLane3of
Symbol1.Onceagaintherearenopacketsreadytosend,soIDLsaresent.

Nullified Packet x8 Example


Nullified TLPs can occur when a TLP is being transferred across a switch to
reduce latency. This is called Switch CutThrough operation. The reader may
choosetoreviewthesectionentitledSwitchCutThroughModeonpage 354
beforeproceedingwiththisdiscussion.

425
PCIe 3.0.book Page 426 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

AnullifiedTLPcanoccurwhenaswitchforwardsapackettotheegressport
beforehavingreceivedthepacketattheingressportandbeforeerrorchecking.
Becauseanerrorwasdetectedinthisexample,theTLPmustbenullified.

Figure1213illustratesthestepstakentonullifyTLP.TheTLPbeingsentbythe
egress port, starts in the first block (Lane 0 of Symbol 6). When the error is
detected,theegressportinvertstheCRC(Lanes03ofSymbol1)andaddsan
EDB token immediately following the TLP (Lanes 47 of symbol 1). Together,
thosetwochangesmakeitcleartotheReceiverthatthisTLPhasbeennullified
andshouldbediscarded.NotethattheEDBbytesarenotincludedinthepacket
lengthfield,becausetheydynamicallyaddedtoapacketinflightwhenanerror
occurs.

Figure1213:Gen3x8NullifiedPacket

Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7


0 0 0 0 0 0 0 0
Sync
1 1 1 1 1 1 1 1

Symbol 0 STP Token: Le n gth = 7 , C R C , P a rity, Seq Num

Symbol 1 T LP
Symbol 2
Logical
Symbol 3 LC R C SDP Token
Idle
Symbol 4 D LLP IDL IDL IDL IDL
Symbol 5 IDL IDL IDL IDL IDL IDL IDL IDL
Symbol 6 STP Token: Le n gth = 23, C R C , P a rity, Seq Num H ead
D er
W D1 W 1
Symbol 7 H ead
D er
WD 2W 2 H ea3er
D W D3W 3
TLP
Symbol 15 D ata
D WD 18
W 14 D ata
D WD W
19 15 straddles
0 0 0 0 0 0 0 0 Block
Sync
1 1 1 1 1 1 1 1
boundary
Symbol 0 D ata
D WD 20
W 16 D ata
D WD W
21 17
Symbol 1 LC R C (inverted) EDB EDB EDB EDB
Nullified TLP

Ordered Set Example - SOS


NowletsconsideranexampleofOrderedSettransmission.AsshowninFigure
1214onpage427,anOrderedSetisindicatedbythe2bitSyncHeadervalueof
01b. The bytes that follow will be understood by the receiver to make up an
OrderedSetthatisalways16bytes(128bits)inlength.Theoneexceptionisthe

426
PCIe 3.0.book Page 427 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

SOS(SkipOrderedSet),becauseitcanbechangedbyintermediatereceiversin
incrementsof4bytesatatimeforclockcompensation.Consequently,anSOSis
legallyallowedtobe8,12,16,20,or24Symbolsinlength.Intheabsenceofa
LinkrepeaterdevicethatdoesnotaddordeleteSKPsinaSOS,aSOSwillalso
bemadeupof16bytes.

Figure1214:Gen3x1OrderedSetConstruction

UI UI
UI 2
0 2 10 12
= = = =
e e e e
m m m m
Ti Ti Ti Ti
0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 0 Symbol 1 Symbol 15


(10)

128-bit Payload

Ordered Set Block

ToillustrateanOrderedSet,letsuseanSOStoshowthevariousfeaturesand
howtheyworktogether.ConsiderFigure1215onpage428,whereaDataBlock
is followed by an SOS. The framing rules state that the previous Data Block
mustendwithanEDSTokeninthelastdwordtoletthereceiverknowthatan
OrderedSetiscoming.IfthecurrentDataStreamistocontinue,theOrderedSet
thatfollowsmustbeanSOS,andthatmustbefollowedinturnbyanotherData
Block.Thisexampledoesntshowit,butitspossiblethataTLPmightbeincom
pleteatthispointandwouldstraddletheSOSbyresumingtransmissioninthe
DataBlockthatmustimmediatelyfollowtheSOS.

ReceivingtheEDSTokenmeansthattheDataStreamiseitherendingorpaus
ing to insert an SOS. An EDS is the only Token that can start on a dword
alignedLaneinthesameSymbolTimeasanIDL,andthisexampledoesjust
that,beginninginLane4ofSymbolTime15.RecallthatEDSmustalsobeinthe
lastdwordoftheDataBlock.Accordingtothereceiverframingrequirements,
onlyanOrderedSetBlockisallowedafteranEDSandmustbeanSOS,EIOS,or
EIEOSorelseitwillbeseenasaframingerror.Aswastrueforearlierspecver
sions, the Ordered Sets must appear on all Lanes at the same time. Receivers
mayoptionallychecktoensurethateachLaneseesthesameOrderedSet.

Inourexample,a16byteSOSisseennext,andisrecognizedbytheOrderedSet
SychHeaderaswellastheSKPbytepattern.Therearealways4Symbolsatthe
endoftheSOSthatcontainthecurrent24bitscramblerLFSRstate.InSymbol

427
PCIe 3.0.book Page 428 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

12 the Receiver knows that the SKP characters have ended and also that the
Block has three more bytes to deliver per Lane. These are the output of the
scramblinglogicLFSR,asshowninTable 122onpage 428.

Figure1215:Gen3x8SkipOrderedSet(SOS)Example

Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7


Data
0 0 0 0 0 0 0 0
Sync 1 1 1 1 1 1 1 1
Block
Symbol 0 STP
S T PToken:
: Le n Le
gthn gth
= 7=, 7C,RCCR, CP,aPrity,
a rity,
S eSeq
q NNum
um
Symbol 1 (T L PT 7L PD W )
Symbol 2
Symbol 3 LCRC SDP Token
Symbol 4 D LLP IDL IDL IDL IDL
Symbol 5 IDL IDL IDL IDL IDL IDL IDL IDL End of
Symbol 6 SDP Token D LLP Data
Stream
Symbol 7 IDL IDL IDL IDL IDL IDL IDL IDL Marker
Symbol 15 IDL IDL IDL IDL M a rke
EDS Token r P aof
(End cke t Stream)
Data
1 1 1 1 1 1 1 1
Ordered
Sync 0 0 0 0 0 0 0 0 Set Block
Symbol 0 SKP SKP SKP SKP SKP SKP SKP SKP
Symbol 3 SKP SKP SKP SKP SKP SKP SKP SKP End of
SOS
Symbol 4 S K P _E N D S K P _E N D S K P _E N D S K P _E N D S K P _E N D S K P _E N D S K P _E N D S K P _E N D
LFSR
Symbol 5 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR
output
Symbol 6 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR as filler
Symbol 7 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR
0 0 0 0 0 0 0 0
Data
Sync 1 1 1 1 1 1 1 1 Block

Table122:Gen316bitSkipOrderedSetEncoding

Symbol
Value Description
Number

0to11 AAh SKPSymbol.SinceSymbol0istheOrderedSetIdentifier,


thisisseenasanSOS.

12 E1h SKP_ENDSymbol,whichindicatesthattheSOSwillbecom
pleteafter3moreSymbols

428
PCIe 3.0.book Page 429 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

Table122:Gen316bitSkipOrderedSetEncoding(Continued)

Symbol
Value Description
Number

13 00FFh a)IfLTSSMstateisPolling.Compliance:AAh
b)ElseifpriorblockwasaDataBlock:
Bit[7]=DataParity
Bit[6:0]=LFSR[22:16]
c)Else
Bit[7]=~LFSR[22]
Bit[6:0]=LFSR[22:16]

14 00FFh a)IfLTSSMstateisPolling.Compliance:Error_Status[7:0]
b)ElseLFSR[15:8]

15 00FFh a)IfLTSSMstateisPolling.Compliance:Error_Status[7:0]
b)ElseLFSR[7:0]

TheDataParitybitmentionedinthetableistheevenparityofalltheDataBlock
scrambled bytes that have been sent since the most recent SDS or SOS and is
created independently for each Lane. Receivers are required to calculate and
checktheparity.Ifthebitsdontmatch,theLaneErrorStatusregisterbitcorre
spondingtotheLanethatsawtheerrormustbeset,butthisisnotconsidereda
ReceiverErroranddoesnotinitiateLinkretraining.

The8bitError_StatusfieldonlyhasmeaningwhentheLTSSMisinthePoll
ing.Compliancestate(seePolling.Complianceonpage 529formoredetails).
ForourexampleofanSOSfollowingaDataBlock,byte13istheDataParitybit
andLFSR[22:16],whilethelasttwobytesareLFSRbits[15:0].

Transmitter SOS Rules


TheSOSrulesforTransmitterswhenusing128b/130binclude:
AnSOSmustbescheduledtooccurwithin370to375blocks.InLoopback
mode,however,theLoopbackMastermustscheduletwoSOSswithinthat
time,andtheymustbenomorethantwoblocksfromeachother.
SOSscanstillonlybesentonpacketboundariesandmaybeaccumulated
asaresult.However,consecutiveSOSsarenotpermitted;theymustbesep
aratedbyaDataBlock.
Its recommended that SOS timers and counters be reset whenever the
TransmitterisElectricallyIdle.

429
PCIe 3.0.book Page 430 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

TheComplianceSOSbitinLinkControlRegister2hasnoeffectwhenusing
128b/130b.(ItsusedtodisableSOSsduringCompliancetestingfor8b/10b,
butthatisntanoptionfor128b/130b.)

Receiver SOS Rules


TheSkipOrderedSetrulesforReceiverswhenusing128b/130binclude:
TheymusttoleratereceivingSOSsatanaverageintervalof370375blocks.
NotethatthefirstSOSafterElectricalIdlemayarriveearlierthanthat,since
TransmittersarenotrequiredtoresetSOStimersduringElectricalIdletime.
ReceiversmustchecktoseethateverySOSinaDataStreamisprecededby
aDataBlockthatendswithEDS.

Scrambling
Thescramblinglogicfor128b/130bismodifiedfromthepreviousPCIegenera
tions to address the two issues that 8b/10b encoding handled automatically:
maintainingDCBalanceandprovidingasufficienttransitiondensity.Byway
ofreview,recallthatDCBalancemeansthebitstreamhasanequalnumberof
ones and zeros. This is intended to avoid the problem of DC wonder, in
whichthetransmissionmediumischargedtowardonevoltageortheotherso
much,byaprevalenceofonesorzeros,thatitbecomesdifficulttoswitchthe
signalwithinthenecessarytime.Theotherproblemisthatclockrecoveryatthe
Receiverneedstoseeenoughedgesin theinputsignalto be able tocompare
themtotherecoveredclockandadjustthetimingandphaseasneeded.

Without 8b/10b to handle these issues, three steps were taken: First, the new
scrambling method improves both transition density and DC Balance over
longertimeperiods,butdoesntguaranteethemovershortperiodstheway8b/
10b did. Second, the TS1 and TS2 Ordered Set patterns used during training
includefieldsthatareadjustedasneededtoimproveDCBalance.Andthird,
Receiversmustbemorerobustandtolerantoftheseissuesthantheywereinthe
earliergenerations.

Number of LFSRs
AtthelowerdatarateseveryLanewasscrambledinthesameway,soasingle
LinearFeedbackShiftRegister(LFSR)couldsupplythescramblinginputforall
of them. For Gen3, though, the designers wanted different scrambling values
for neighboring Lanes. The reasons probably include a desire to decrease the
possibility of crosstalk between the Lanes by scrambling their outputs with
respecttoeachotherandavoidhavingthesamevalueoneachLane,asmight

430
PCIe 3.0.book Page 431 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

happen when sending IDLs. The spec describes two approaches to achieving
this goal, one that emphasizes lower latency and one that emphasizes lower
cost.

FirstOption:MultipleLFSRs.One solutionis toimplementaseparate


LFSR for each Lane, and initialize each with a different starting value or
seed.Thishastheadvantageofsimplicityandspeed,atthecostofadd
ing logic. As shown in Figure 1216, each LFSR creates a pseudorandom
outputbasedonthepolynomialgiveninthespecasG(X)=X23+X21+X16+
X8+X5+X2+1.Thispolynomialislongerthanthepreviousversionand
alsobehavesalittledifferentlybecauseofthedifferentseedvalues.Eight
different seed values for each Lane are specified requiring eight different
LFSRs,oneperLane0through7.

Figure1216:Gen3PerLaneLFSRScramblingLogic

D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11

+ + +
Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11

D12 D13 D14 D15 D16 D17 D18 D19 D20 D21 D22

+ +
Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed
D12 D13 D14 D15 D16 D17 D18 D19 D20 D21 D22

Data In + Data Out

The24bitseedvalueforeachLaneislistedinTable 123onpage 432.The


seriesrepeatsitself,meaningtheseedforLane8willbethesameasLane0,
soonlythefirst8valuesareshown.EveryLaneusesthesameLFSRandthe
sametappointstocreatethescramblingoutput,andthedifferentseedval
uesgivethedesireddifference.

431
PCIe 3.0.book Page 432 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table123:Gen3ScramblerSeedValues

Lane SeedValue

0 1DBFBCh

1 0607BBh

2 1EC760h

3 18C0DBh

4 010F12h

5 19CFC9h

6 0277CEh

7 1BB807h

SecondOption:SingleLFSR.The alternative solution, illustrated in


Figure1217onpage433forLanes2,10,18,and26,istousejustoneLFSR
and create the scrambling inputs for each Lane by XORing different tap
pointstogether.SincetheresonlyoneLFSR,theseedvalueisthesamefor
all Lanes (all ones), but the scrambling Tap Equation for each Lane is
derived by combining different tap points, as shown in Table 124 on
page 433. The spec also notes that 4 of the Lanes Tap Equations can be
derivedbyXORingthetapvaluesoftheirbitneighbors:

Lane 0 = Lane 7 XOR Lane 1 (note that the process of going to lower
Lanenumberswrapsaround,withtheresultthatLane7isconsidered
lowerthatLane0)
Lane2=Lane1XORLane3
Lane4=Lane3XORLane5
Lane6=Lane5XORLane7

The singleLFSR solution uses fewer gates than the multiLFSR version
does,butincursextralatencythroughtheXORprocess,providingadiffer
entcost/performanceoption.

432
PCIe 3.0.book Page 433 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)


Figure1217:Gen3SingleLFSRScrambler

D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11

+ + +
Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11

D12 D13 D14 D15 D16 D17 D18 D19 D20 D21 D22

+ +
Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed
D12 D13 D14 D15 D16 D17 D18 D19 D20 D21 D22
+
Tap Equation for Lanes 2, 10, 18, and 26

Data In + Data Out


(for Lanes 2, 10, 18, or 26)

Table124:Gen3TapEquationsforSingleLFSRScrambler

LaneNumbers TapEquation

0,8,16,24 D9xorD13

1,9,17,25 D1xorD13

2,10,18,26 D13xorD22

3,11,19,27 D1xorD22

4,12,20,28 D3xorD22

5,13,21,29 D1xorD3

6,14,22,30 D3xorD9

7,15,23,31 D1xorD9

Scrambling Rules
TheGen3scramblerLFSRs(whetheroneormore)donotcontinuallyadvance,
butonlyadvancebasedonwhatisbeingsent.Thescramblersmustbereinitial
izedperiodicallyandthattakesplacewheneveranEIEOSorFTSOSisseen.The
specgivesseveralrulesforscramblingthatarelistedhereforconvenience:

433
PCIe 3.0.book Page 434 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

SyncHeaderbitsarenotscrambledanddonotadvancetheLFSR.
TheTransmitterLFSRisresetwhenthelastEIEOSSymbolhasbeensent,
andtheReceiverLFSRisresetwhenthelastEIEOSSymbolisreceived.
TS1andTS2OrderedSets:
Symbol0bypassesscrambling
Symbols1to13arescrambled
Symbols14and15mayormaynotbescrambled.Thespecstatesthat
they will bypass scrambling if necessary to improve DC Balance, but
otherwise will be scrambled (see TS1 and TS2 Ordered Sets on
page 510formoredetailsonhowDCBalanceismaintained).
All Symbols of the Ordered Sets FTS, SDS, EIEOS, EIOS, and SOS bypass
scrambling.Despitethis,theoutputdatastreamwillhavesufficienttransi
tion density to allow clock recovery and the symbols chosen for the
OrderedSetsresultinaDCbalancedoutput.
Evenwhenbypassed,TransmittersadvancetheirLFSRsforallOrderedSet
SymbolsexceptforthoseintheSOS.
Receiversdo thesame,checkingSymbol 0ofanincomingOrderedSetto
seewhetheritisanSOS.Ifso,theLFSRsarenotadvancedforanyofthe
SymbolsinthatBlock.OtherwisetheLFSRsareadvancedforalltheSym
bolsinthatBlock.
AllDataBlockSymbolsarescrambledandadvancetheLFSRs.
Symbolsarescrambledinlittleendianorder,meaningtheleastsignificant
bitisscrambledfirstandthemostsignificantbitisscrambledlast.
TheseedvalueforaperLaneLFSRdependsontheLanenumberassigned
to the Lane when the LTSSM first entered Configuration.Idle (having fin
ishedthePollingstate).Theseedvalues,modulo8,areshowninTable 123
onpage 432and, once assigned,wontchangeas long LinkUp = 1 even if
LaneassignmentsarechangedbygoingbacktotheConfigurationstate.
Unlike8b/10b,scramblingcannotbedisabledwhileusing128b/130bencod
ingbecauseitisneededtohelpwithsignalintegrity.Itsnotexpectedthat
theLinkwouldoperatereliablywithoutit,soitmustalwaysbeon.
ALoopbackSlavemustnotscrambleordescrambletheloopedbackbit.

Serializer
ThisshiftregisterworkslikeitdoesforGen1/Gen2dataratesexceptthatitis
nowreceiving8bitsatatimeinsteadof10(i.e.,theserializerisan8bitparallel
toserialshiftregister).

434
PCIe 3.0.book Page 435 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

Mux for Sync Header Bits


Finally,thetwoSyncHeaderbitsmustbeinjectedtodistinguishthenextBlock
ofcharactersasaDataBlockoranOrderedSetBlock.Thesearethefirsttwobits
of each 130bit Block and the logic for them could be added anywhere in the
transmitterthatmakessenseforthedesign.Inthisexamplethebitsareinjected
attheendoftheprocessforsimplicity.Wherevertheyareincluded,theflowof
bytesfromabovemustbestalledtoallowtimeforthem.Inthisexamplethere
willneedtobeawaytoinformthelogicabovetopausefortwobittimes.The
flowofincomingpacketswilljustbequeuedintheTxBufferduringthetime
theSyncbitsarebeingsent.

Gen3 Physical Layer Receive Logic


Asintheearliergenerations,theReceiverslogic,showninFigure1218onpage
436, begins with the CDR (Clock and Data Recovery) circuit. This probably
includesaPLLthatlocksontothefrequencyoftheTransmitterclockbasedon
knowledgeoftheexpectedfrequencyandtheedgesinthebitstreamtogener
atearecoveredclock(RxClock).Thisrecoveredclocklatchestheincomingbits
intoadeserializingbufferandthen,onceBlockAlignmenthasbeenestablished
(during the Recovery state of the LTSSM), another version of the recovered
clockthatisdividedby8.125(RxClock/8.125)latchesthe8bitSymbolsintothe
ElasticBuffer.Afterthat,thedescramblerrecreatestheoriginaldatafromthe
scrambled characters. The bytes bypass the 8b/10b decoder and are delivered
directlytotheByteUnstripinglogic.Finally,theOrderedSetsarefilteredout,
andtheremainingbytestreamofTLPsandDLLPsisforwardeduptotheData
LinkLayer.

In the following discussion, each part is described working upward from the
bottom.ThefocusisondescribingaspectsofthePhysicalLayerchangedfor8.0
GT/s.SubblockunchangedfromGen1/Gen2willnotbedescribedinthissec
tion.

Differential Receiver
Thedifferentialreceiverlogicisunchanged,butthereareelectricalchangesto
improve signal integrity (see Signal Compensation on page 468), as well as
training changes to establish signal equalization, which are covered in Link
EqualizationOverviewonpage 577.

435
PCIe 3.0.book Page 436 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1218:Gen3PhysicalLayerReceiverDetails

To Data Link Layer


eceiTLP/DLLP
Indicator

N*8

Rx
Buffer

TLP/DLLP
N*8 Indicator

Packet
Filtering
Block
N*8 D/K# Type

Byte Un-Striping
Lane 0 Lane N
8 8
Mux Mux
8 8 8 8
D/K# D/K#
Gen3 De-Scrambler Gen3 De-Scrambler
De-Scrambler De-Scrambler
8 8 D/K# 8 8 D/K#

8b/10b 8b/10b
Decoder Decoder
Gen3 Gen3
10 Block 10 Block
Type Type

CDR Logic CDR Logic

Rx Rx

Lane 0 Lane 1, ..,N-1 Lane N

436
PCIe 3.0.book Page 437 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

Figure1219:Gen3CDRLogic

Block Alignment Block Type


& Block Type Control
Detect Logic S ym bols

h
g
De-S erializing Lane
Register f E lastic De-skew
B uffer Delay
e

8 8
Circuit
d
c

C ontrol Local
S erial
b

Clock
Stream
a

PLL
a

D+ Rx
C lock
R x C lock
D ifferential
R ecovery
D- R eceiver Rx C lock / 8.125
S erial B it P LL
S tream

CDR (Clock and Data Recovery) Logic


Rx Clock Recovery
Althoughthenewscramblingschemehelpswithclockrecovery,itdoesntguar
antee good transition density over short intervals. As a result, the CDR logic
must now be able to maintain synchronization for longer periods without as
manyedges.Nospecificmethodforaccomplishingthisisgiveninthespec,but
a more robust PLL (PhaseLocked Loop) or DLL (DelayLocked Loop) circuit
willlikelybeneeded.

AnotheraspectoftheCDRlogicthatsdifferentnowisthattheinternalclock
usedbytheElasticBufferisnotsimplytheRxclockdividedby8asonemight
expect.Thereason,ofcourse,isthattheinputisnotaregularmultipleof8bit
bytes.Instead,itisa2bitSyncHeaderfollowedby16bytes.Thoseextratwo
bitsmustbeaccountedforsomewhere.Thespecdoesntrequireanyparticular
implementation, but one solution would have the clock divided by 8.125, as
showninFigure1219onpage437,toproduce16clockedgesover130bittimes.

437
PCIe 3.0.book Page 438 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

TheBlockTypeDetectionlogicmightthenbeusedtotaketheextratwobitsout
of the deserializer that it needs to examine anyway, when a block boundary
timeisreached,ensuringthatonly8bitbytesaredeliveredtotheElasticBuffer.

Justtotieupallthelooseendsonthisdiscussion,theinternalclockforthe8.0
GT/s data rate will actually be 8.0 GHz / 8.125 = 0.985 GHz. That results in
slightlylessthanthe1.0GB/sdataratethatsusuallyusedtodescribetheGen3
bandwidth, but the difference is small enough (1.5% less than 1 GB/s) that it
usuallyisntmentioned.

Deserializer
TheincomingdataisclockedintoeachLanesserialtoparallelconverterbythe
recoveredRxclock,asshowninFigure1219onpage437.The8bitSymbolsare
senttotheElasticBufferandclockedintotheElasticBufferbyaversionofthe
RxClockthathasbeendividedby8.125toproperlyaccommodate16bytesin
130bits.

Achieving Block Alignment


The EIEOSs sent during training serve to identify boundaries for the 130bit
blocks.AsshowninFigure1220onpage438,thisOrderedSetcanberecog
nizedinabitstreambecauseitappearsasapatternofalternatingbytesof00h
andFFh.Whenthispatternisseen,thelastSymboloftheEIEOSisinterpreted
as the Block boundary, and testing the next 130 bits will reveal whether the
boundary is correct. If not, the logic continues to search for this pattern. This
process is described in the spec as occurring in three phases: Unaligned,
Aligned,andLocked.

Figure1220:EIEOSSymbolPattern

0 00000000
1 11111111
2 00000000
3 11111111
4 00000000

13 11111111
14 00000000
15 11111111

438
PCIe 3.0.book Page 439 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

UnalignedPhase.Receivers enter this phase after a period of Electrical


Idle,suchasafterchangingto8.0GT/sorexitingfromalowpowerLink
state.Inthisphase,theBlockAlignmentlogicwatchesforthearrivalofan
EIEOS,sincetheendofthealternatingbytesmustcorrespondtotheendof
theBlock.WhenanEIEOSisseen,thealignmentisadjustedandthelogic
proceedstothenextphase.Until then,itmustalsoadjustitsBlockalign
mentbasedonthearrivalofanySOS.

AlignedPhase.In this phase Receivers continues to monitor for EIEOS


and make any necessary adjustments to their bit and Block alignment if
theyseeone.However,sincetheyvetentativelyidentifiedblockboundaries
theycanalsonowsearchforanSDS(StartofDataStream)OrderedSetto
indicatethebeginningofaDataStream.WhenanSDSisseen,thereceiver
proceedstotheLockedphase.Untilthen,itmustalsoadjustitsBlockalign
mentbasedonthearrivalofSOSs.IfanundefinedSyncHeaderisdetected
(value of 00b or 11b) the Receiver is allowed to return to the Unaligned
phase. The spec notes that this will happen during Link training when
EIEOSisfollowedbyaTSOrderedSet.
LockedPhase.OnceaReceiverreachesthisphase,itnolongeradjustsits
Blockalignment.Instead,itnowexpectstoseeaDataBlockaftertheSDS
and if the alignment has to be readjusted at this point, some misaligned
data will probably be lost. If an undefined Sync Header is detected the
ReceiverisallowedtoreturntotheUnalignedorAlignedphase.Receivers
canbedirectedtotransitionoutoftheLockedphasetooneoftheothersas
long as Data Stream processing is stopped (see Data Stream and Data
Blocksonpage 413fortherulesregardingDataStreams).
SpecialCase:Loopback.While discussing Block alignment, the spec
describeswhathappenswhentheLinkisinLoopbackmode.TheLoopback
MastermustbeabletoadjustalignmentduringLoopback,andisallowed
tosendEIEOSandadjustitsReceiverbasedonadetectedEIEOSwhenthey
areechoedbackduringLoopback.Active.TheLoopbackSlavemustbeable
toadjustalignmentduringLoopback.Entrybutmustnotadjustalignment
during Loopback.Active. The Slaves Receiver is considered to be in the
LockedphasewhentheSlavebeginstoloopbackthebitstream.

Block Type Detection


OnceBlockAlignmenthasbeenachieved,theReceivercanrecognizethestart
timesoftheincomingblocksandexaminethefirsttwobitstoidentifywhichof
thetwopossibletypesarecomingin.OrderedSetBlocksareonlyinterestingto
the Physical Layer, so theyre not forwarded to the higher layers, but Data

439
PCIe 3.0.book Page 440 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Blocksdogetforwarded.WhentheSyncHeaderisdetected,thisinformationis
signaledtootherpartsofthePhysicalLayertodeterminewhetherthecurrent
blockshouldberemovedfromthebytestreamgoingtothehigherlayers.The
clockrecoverymechanismandSyncHeaderdetectioneffectivelyaccomplishes
the conversion from 130 bits to 128 bits that must take place in the Physical
Layer.

NotethatsincetheblockinformationisthesameforeveryLane,thislogicmay
simplybeimplementedforonlyoneLane,suchasLane0asshowninFigure
1218onpage436.However,ifdifferentLinkwidthsandLaneReversalwere
supportedthenmoreLaneswouldneedtoincludethislogictoensurethatthere
wouldalwaysbeoneactiveLanewiththislogicavailable.Anexamplemightbe
thateveryLanewhichisabletooperateasLane0wouldimplementit,butonly
theonethatwascurrentlyactingasLane0woulduseit.Notealsothat,since
the spec doesnt give details in this regard, the examples discussed and illus
tratedhereareonlyeducatedguessesataworkableimplementation.

Receiver Clock Compensation Logic


Background
Theclockrequirementsfor8.0GT/sarethesameastheywereintheearlierspec
versions:theclocksofbothLinkpartnersmustbewithin+/300ppm(partsper
million)ofthecenterfrequency,whichresults(intheworstcase)ingainingor
losingoneclockafterevery1666clocks.

Elastic Buffers Role


ThereceivedSymbolsareclockedintotheelasticbuffer,asshowninFigure12
21onpage441,usingtherecoveredclockandclockedoutusingthereceivers
localclock.TheElasticBuffercompensatesforthefrequencydifferencebyadd
ingorremovingSKPSymbolsasbefore,butnowitdoessofourSymbolsata
time instead of only one at a time. When a SKP Ordered Set arrives, control
logicwatchingthestatusofthebuffermakesanevaluation.Ifthelocalclockis
runningfaster,thebufferwillbeapproachinganunderflowconditionandthe
logic can compensate by appending four extra SKPs when the SOS arrives to
quickly refill the buffer. On the other hand, if the recovered clock is running
faster,thebufferwillbeapproachinganoverflowconditionandthelogicwill
compensateforthatbydeletingfourSKPstoquicklydrainthebufferwhenan
SOSisseen.

440
PCIe 3.0.book Page 441 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

Figure1221:Gen3ElasticBufferLogic

Block Alignment Block Type


& Block Type Control
Detect Logic S ym bols
h
g
De-S erializing Lane
Register E lastic De-skew
f

B uffer Delay
e

8 8
Circuit
d
c

C ontrol Local
S erial
b

Clock
Stream
a

PLL
a

D+ Rx
C lock
R x C lock
D ifferential
R ecovery
D- R eceiver Rx C lock / 8.125
S erial B it P LL
S tream

Gen3TransmittersscheduleanSOSonceevery370to375blocksbut,asbefore,
theycanonlybesentonblockboundaries.IfapacketisinprogresswhenSOSs
arescheduled,theyareaccumulatedandinsertedatthenextpacketboundary.
However,unlikethelowerdatarates,twoconsecutiveSOSsarenotallowedat
8.0GT/s;theymustbeseparatedbyaDataBlock.Receiversmustbeabletotol
erateSOSsseparatedbythemaximumpacketpayloadsizeadevicesupports.

Thefactthatadjustmentsareonlymadeinincrementsof4Symbolsmayaffect
the depth of the Elastic Buffer, since a difference of 4 would need to be seen
beforeanycompensationisapplied,andalargepacketmaybeinprogressat
whatwouldotherwisebetheappropriatetime.Forthatreason,carewillneed
tobeexercisedindeterminingtheoptimalsizeofthisbuffer,soletsconsideran
example.TheallowedtimebetweenSOSsof375blocksat16Symbolsperblock
equals6000Symboltimes.Dividingthatbytheworstcasetimetogainorlosea
clockof1666meansthat3.6clockscouldbegainedorlostduringthatperiod.If
thelargestpossibleTLP(4KB)hadstartedjustpriortothenextSOSbeingsent,
theoveralldelayforitbecomesabout6000+4096=10096Symboltimesforax1
Link, which translates to a gain or loss of 10096 / 1666 = 6.06 clocks. Conse

441
PCIe 3.0.book Page 442 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

quently,ifTLPsof4KBinsizearesupported,thebuffermightbedesignedto
handle7SymbolstoomanyortoofewbeforeanSOSisguaranteedtoarrive.It
may happen that two SOSs are scheduled before the first one is sent. At the
lowerdatarates,thequeuedSOSsaresentbacktoback,butfor8.0GT/sthey
arenotandmustbeseparatedbyaDataBlock.WheneveranSOSdoesarriveat
the Receiver, it can add or remove 4 SKP Symbols to quickly fill or drain the
bufferandavoidaproblem.

Lane-to-Lane Skew
Flight Time Variance Between Lanes
FormultiLaneLinks,thedifferenceinarrivaltimesbetweenlanesisautomati
callycorrectedattheReceiverbydelayingtheearlyarrivalsuntiltheyallmatch
up.Thespecallowsthistobeaccomplishedbyanymeansadesignerprefers,
butusingadigitaldelayaftertheelasticbufferhasoneadvantageinthatthe
arrival time differences are now digitized to the local Symbol clock of the
receiver. If the input to one lane makes it on a clock edge and another one
doesnt,thedifferencebetweenthemwillbemeasuredinclockperiods,sothe
earlyarrivalcansimplybedelayedbytheappropriatenumberofclockstogetit
tolineupwiththelatecomers(seeFigure1222onpage444).Thefactthatthe
maximum allowable skew at the receiver is a multiple of the clock periods
makesthiseasyandinfersthatthespecwritersmayhavehadthisimplementa
tioninmind.Asdefinedinthespec,thereceivermustbecapableofdeskewing
upto20nsforGen1(5Symboltimeclocksat4nsperSymbol)and8nsforGen2
(4 Symboltime clocks at 2ns per Symbol), and 6ns for Gen3 (6 Symboltime
clocksat1nsperSymbol).

De-skew Opportunities
The same Symbol must be seen on all lanes at the same time to perform de
skewing,andanyOrderedSetwilldo.However,deskewingisonlyperformed
intheL0s,Recovery,andConfigurationLTSSMstates.Inparticular,itmustbe
completedasaconditionfor:

LeavingConfiguration.Complete
BeginningtoprocessaDataStreamafterleavingConfiguration.Idleor
Recovery.Idle
LeavingRecovery.RcvrCfg
LeavingRx_L0s.FTS

442
PCIe 3.0.book Page 443 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

IfskewvalueschangewhileinL0(basedontemperatureorvoltagechanges,for
example),aReceivererrormayoccurandcausereplayedTLPs.Iftheproblem
becomespersistent,theLinkwouldeventuallytransitiontotheRecoverystate
anddeskewingwouldtakeplacethere.Thespecnotesthat,whiledevicesare
notallowedtodeskewtheirLaneswhileinL0,theSOSsthatmustbesentperi
odicallyinthisstatecontainanLFSRvaluethatisintendedtoaidexternaltools
in doing this. These tools, unconstrained by the rules for Data Streams, can
searchfortheSOSsandusethepatternstoachieveBitLock,BlockAlignment
andLanetoLanedeskewinthemidstofaDataStream.
ThespecnotesthatwhenleavingL0stheTransmitterwillsendanEIEOS,then
the correct number of FTSs with another EIEOS inserted after every 32 FTSs,
thenonelastEIEOStoassistwithBlockAlignmentand,finally,anSDSOrdered
SetforthepurposeofdeskewinginadditiontostartingtheDataStream.

Receiver Lane-to-Lane De-skew Capability


Understandably,thetransmitterisonlyallowedtointroduceaminimalamount
ofskewsoastoleavetherestoftheskewbudgettocoverroutingdifferences
andothervariations.Theamountofallowedskewthatcanbecorrectedatthe
ReceiverisshowninTable 125onpage 443,whereitcanbeseenthatthisskew
correspondseasilytoanumberofSymboltimesforGen3justasitdidforthe
earlierdatarates.Thatallowsthesameoptionofusingdelayregisterstoaccom
plishdeskewaftertheelasticbufferaswasdescribedforGen1/Gen2Physical
Layerimplementationsearlier.

Table125:SignalSkewParameters

Gen1 Gen2 Gen3

Txmaxskew 1.3ns 1.3ns 1.1ns

Rxmaxskew 20ns 8ns 6ns

Symboltimeperiod 4ns 2ns 1ns

Rxskewexpressed 5 4 6
inSymbolTimes

Whenusing8b/10bencoding,anunambiguousdeskewmechanismistowatch
fortheCOMcontrolcharacter,whichmustappearonallLanessimultaneously.
That option is not available for 128b/130b, but Ordered Sets still arrive at the
sametimeonalltheLanes,suchastheSOS,SDS,andEIEOS.Asaresult,the
processcanbeverymuchthesameeventhoughthepatterntosearchforwhen
deskewingtheLanesisdifferent.

443
PCIe 3.0.book Page 444 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1222:ReceiverLinkDeSkewLogic

SYNC
SYNC
SOS, SDS, SOS, SDS,
EIEOS EIEOS
Lane 0 Rx Delay
(symbols)

SYNC
SYNC
SOS, SDS, SOS, SDS,
EIEOS EIEOS
Lane 1 Rx Delay
(symbols)

SYNC
SYNC

SOS, SDS, SOS, SDS,


EIEOS EIEOS
Lane 2 Rx Delay
(symbols)
SYNC

SYNC
SOS, SDS, SOS, SDS,
EIEOS EIEOS
Lane 3 Rx Delay
(symbols)

Descrambler
General
Receiversfollowexactlythesamerulesforgeneratingthescramblingpolyno
mialthattheTransmitterdoesandsimplyXORthesamevaluetotheinputdata
a second time to recover the original information. Like on the transmit side,
theyareallowedtoimplementaseparateLFSRforeachLaneorjustone.

Disabling Descrambling
Unlike at Gen1/Gen2 data rates, in Gen3 mode, descrambling cannot be dis
abledbecauseofitsroleinfacilitatingclockrecoveryandsignalintegrity.Atthe
lower rates, the disable scrambling bit in the control byte of TS1s and TS2s
wouldbeusedtoinformaLinkneighborthatscramblingwasbeingturnedoff.
Thatbitisreservedforratesof8.0GT/sandhigher.

444
PCIe 3.0.book Page 445 Sunday, September 2, 2012 11:25 AM

Chapter 12: Physical Layer - Logical (Gen3)

Byte Un-Striping
ThislogicisbasicallyunchangedfromGen1orGen2implementation.Atsome
point,thebytestreamsforGen3andforthelowerdatarateswillhavetomuxed
together, and the example in Figure 1223 on page 445 shows that happening
justbeforetheunstripinglogic.

Figure1223:PhysicalLayerReceiveLogicDetails

To Data Link Layer


eceiTLP/DLLP
Indicator

N*8

Rx
Buffer

TLP/DLLP
N*8 Indicator

Packet
Filtering
Block
N*8 D/K# Type

Byte Un-Striping
Lane 0 Lane N
8 8
Mux Mux
8 8 8 8
D/K# D/K#
Gen3 De-Scrambler Gen3 De-Scrambler
De-Scrambler De-Scrambler
8 8 D/K# 8 8 D/K#

8b/10b 8b/10b
Decoder Decoder
Gen3 Gen3
10 Block 10 Block
Type Type

CDR Logic CDR Logic

Rx Rx

Lane 0 Lane 1, ..,N-1 Lane N

445
PCIe 3.0.book Page 446 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Packet Filtering
The serial byte stream supplied by the byte unstriping logic contains TLPs,
DLLPs, Logical Idles (IDLs), and Ordered Sets. The Logical Idle bytes and
OrderedSetsareeliminatedhereandarenotforwardedtotheDataLinklayer.
What remains are the TLPs and DLLPs, which get forwarded along with an
indicatoroftheirpackettype.

Receive Buffer (Rx Buffer)


TheRxBufferholdsreceivedTLPsandDLLPsuntiltheDataLinkLayerisable
toacceptthem.TheinterfacetotheDataLinkLayerisnotdescribedinthespec,
andsoadesignerisfreetochoosedetailslikethewidthofthisbus.Thewider
thepath,thelowertheclockfrequencywillbe,butmoresignalsandlogicwill
beneededtosupportit.

Notes Regarding Loopback with 128b/130b


ThespecmakesaspecialpointtodescribetheoperationofLoopbackModeat
thehigherrate.Thebasicrulescanbesummarizedasfollows:

Loopback masters must send actual Ordered Sets or Data Blocks, but
theyarentrequiredtofollowthenormalprotocolruleswhenchanging
fromDataBlockstoOrderedSetsorviceversa.Inotherwords,theSDS
OrderedSetandEDStokenarenotrequired.Slavesmustnotexpector
checkforthepresenceofthem.
MastersmustsendSOSasusual,andmustallowforthenumberofSKP
Symbolsintheloopbackstreamtobedifferentbecausethereceiverwill
beperformingclockcompensation.
LoopbackslavesareallowedtomodifytheSOSbyaddingorremoving
4SKPSymbolsatatimeastheynormallywouldforclockcompensa
tion,buttheresultingSOSmuststillfollowtheproperformatrules.
EverythingshouldbeloopedbackexactlyasitwassentexceptforSOS
whichcanchange asjust described, andboth EIEOSandEIOSwhich
havedefinedpurposesinloopbackandshouldbeavoided.
IfaslaveisunabletoacquireBlockalignment,itwontbeabletoloop
back all bits as received and is allowed to add or remove Symbols as
neededtocontinueoperation.

446
PCIe 3.0.book Page 447 Sunday, September 2, 2012 11:25 AM

13 PhysicalLayer
Electrical
The Previous Chapter
ThepreviouschapterdescribesthelogicalPhysicalLayercharacteristicsforthe
thirdgeneration(Gen3)ofPCIe.Themajorchangeincludestheabilitytodouble
thebandwidthrelativetoGen2speedwithoutneedingtodoublethefrequency
(Linkspeedgoesfrom5GT/sto8GT/s).Thisisaccomplishedbyeliminating
8b/10bencodingwheninGen3mode.Morerobustsignalcompensationisnec
essary at Gen3 speed. Making these changes is more complex than might be
expected.

This Chapter
ThischapterdescribesthePhysicalLayerelectricalinterfacetotheLink,includ
ingsomelowlevelcharacteristicsofthedifferentialTransmittersandReceivers.
Theneedforsignalequalizationandthemethodsusedtoaccomplishitarealso
discussedhere.Thischaptercombineselectricaltransmitterandreceiverchar
acteristicsforbothGen1,Gen2andGen3speeds.

The Next Chapter


ThenextchapterdescribestheoperationoftheLinkTrainingandStatusState
Machine(LTSSM)ofthePhysicalLayer.TheinitializationprocessoftheLinkis
describedfromPowerOnorResetuntiltheLinkreachesthefullyoperational
L0stateduringwhichnormalpackettrafficoccurs.Inaddition,theLinkpower
managementstatesL0s,L1,L2,L3arediscussedalongwiththecausesoftransi
tionsbetweenthestates.TheRecoverystateduringwhichbitlock,symbollock
orblocklockcanbereestablishedisdescribed.

447
PCIe 3.0.book Page 448 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Backward Compatibility
ThespecbeginsthePhysicalLayerElectricalsectionwiththeobservationthat
newerdataratesneedtobebackwardcompatiblewiththeolderrates.Thefol
lowingsummarydefinestherequirements:

Initialtrainingisdoneat2.5GT/sforalldevices.
ChangingtootherratesrequiresnegotiationbetweentheLinkpartnersto
determinethepeakcommonfrequency.
Root ports that support 8.0 GT/s are required to support both 2.5 and 5.0
GT/saswell.
Downstreamdevicesmustobviouslysupport2.5GT/s,butallhigherrates
areoptional.Thismeansthatan8GT/sdeviceisnotrequiredtosupport5
GT/s.

Inaddition,theoptionalReferenceclock(Refclk)remainsthesameregardless
ofthedatarateanddoesnotrequireimprovedjittercharacteristicstosupport
thehigherrates.

Inspiteofthesesimilarities,thespecdoesdescribesomechangesforthe8.0GT/
srate:

ESDstandards:EarlierPCIeversionsrequiredallsignalandpowerpinsto
withstand a certain level of ESD (ElectroStatic Discharge) and thats true
forthe3.0spec,too.ThedifferenceisthatmoreJEDECstandardsarelisted
andthespecnotesthattheyapplytodevicesregardlessofwhichratesthey
support.
Rx poweredoff Resistance: The new impedance values specified for 8.0
GT/s (ZRXHIGHIMPDCPOS and ZRXHIGHIMPDCNEG) will be applied to
devicessupporting2.5and5.0GT/saswell.
TxEqualizationTolerance:RelaxingthepreviousspectoleranceontheTx
deemphasisvaluesfrom+/0.5dBto+/1.0dBmakesthe3.5and6.0dB
deemphasistoleranceconsistentacrossallthreedatarates.
Tx Equalization during Tx Margining: The deemphasis tolerance was
alreadyrelaxedto+/1.0dBforthiscaseintheearlierspecs.Theaccuracy
for8.0GT/sisdeterminedbytheTxcoefficientgranularityandtheTxEQ
tolerancesfortheTransmitterduringnormaloperation.
VTXACCM and VRXACCM: For 2.5 and 5.0 GT/s these are relaxed to 150
mVPPfortheTransmitterand300mVPPfortheReceiver.

448
PCIe 3.0.book Page 449 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Component Interfaces
Components from different vendors must work reliably together, so a set of
parametersarespecifiedthatmustbemetfortheinterface.For2.5GT/sitwas
implied,andfor5.0GT/sitwasexplicitlystated,thatthecharacteristicsofthis
interfacearedefinedatthedevicepins.Thatallowsacomponenttobecharac
terizedindependently,withoutrequiringtheuseofanyotherPCIecomponents.
Otherinterfacesmaybespecifiedataconnectororotherlocation,butthoseare
notcoveredinthebasespecandwouldbedescribedinotherformfactorspecs
likethePCIExpressCardElectromechanicalSpec.

Physical Layer Electrical Overview


Theelectricalsubblockassociatedwitheachlane,asshowninFigure131on
page450,providesthephysicalinterfacetotheLink andcontainsdifferential
Transmitters and Receivers. The Transmitter delivers outbound Symbols on
eachLanebyconvertingthebitstreamintotwosingleendedelectricalsignals
withoppositepolarity.Receiverscomparethetwosignalsand,whenthediffer
enceissufficientlypositiveornegative,generateaoneorzerointernallytorep
resenttheintendedserialbitstreamtotherestofthePhysicalLayer.

449
PCIe 3.0.book Page 450 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure131:ElectricalSubBlockofthePhysicalLayer

Physical Layer Physical Layer

Tx Rx Tx Rx

Logical Logical

Tx Rx Tx Rx
Electrical Electrical

Link CTX
Tx+ Tx- Rx+ Rx- Tx- Tx+ Rx- Rx+

CTX

WhentheLinkisintheL0fullonstate,thedriversapplythedifferentialvolt
ageassociatedwithalogical1andlogical0whilemaintainingthecorrectDC
commonmodevoltage.Receiverssensethisvoltageastheinputstream,butifit
drops below a threshold value, its understood to represent the Electrical Idle
Link condition. Electrical Idle is entered when the Link is disabled, or when
ASPM logic puts the Link into lowpower Link states such as L0s or L1 (see
ElectricalIdleonpage 736formoreonthistopic).

DevicesmustsupporttheTransmitterequalizationmethodsrequiredforeach
supporteddataratesotheycanachieveadequatesignalintegrity.Deemphasis
is applied for 2.5 and 5.0 GT/s, and a more complex equalization process is
appliedfor8.0GT/s.ThesearedescribedinmoredetailinSignalCompensa
tiononpage 468,andRecovery.Equalizationonpage 587.
ThedriversandReceiversareshortcircuittolerant,makingPCIeaddincards
suitedforhot(poweredon)insertionandremovaleventsinahotplugenviron
ment.TheLinkconnectingtwocomponentsisACcoupledbyaddingacapaci
tor inline, typically near the Transmitter side of the Link. This serves to de

450
PCIe 3.0.book Page 451 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

couple the DC part of the signal between the Link partners and means they
donthavetoshareacommonpowersupplyorgroundreturnpath,aswhenthe
devicesareconnectedoveracable.Figure131onpage450illustratestheplace
mentofthiscapacitor(CTX)ontheLink.

High Speed Signaling


ThehighspeedsignalingenvironmentofPCIeischaracterizedbythedrawing
inFigure132onpage451.Thislowvoltagedifferentialsignalingenvironment
isacommonmethodusedinmanyserialtransportsandonereasonisforthe
noiserejectionitprovides.Electricalnoisethataffectsonesignalwillalsoaffect
the other because they are on adjacent pins and their traces are very close to
eachother.Sincebothsignalsareinfluenced,asshowninFigure133onpage
452,thedifferencebetweenthemdoesntchangemuchandisthereforenotseen
atthereceiver.

A design goal for the 3.0 spec revision was that the 8.0 GT/s rate should still
work with existing standard FR4 circuit boards and connectors, and that was
achieved by changing the encoding scheme from the old 8b/10b to the new
128b/130b model to keep the frequency low. This goal will probably change
withthenextspeedstep(Gen4).

Figure132:DifferentialTransmitter/Receiver

Detect
Logic

CTX ZTX
D+ D+
+
No Spec

Lane in
Transmitter one Receiver
CTX direction
ZTX
-
D- D-
ZTX ZTX ZRX ZRX

VRX-CM = 0 V
VCM
VTX-CM = 0 - 3.6 V
ZTX = ZRX = 50 Ohms +/- 20%
CTX = 75 - 265 nF (Gen1 & Gen2)
= 176 - 265 nF (Gen3)

451
PCIe 3.0.book Page 452 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure133:DifferentialCommonModeNoiseRejection

D+

D-
Reference voltage shift
Differential
voltage remains
+ Differential the same
voltage
Tx Rx
-
+
0 V 0 V
-

Single-
ended
voltage Single-ended
voltage changes

Transient Noise

Tx Rx
+ +
Vcm Vcm
- -

Differential
voltage
remains same

Clock Requirements

General
For all data rates, both Transmitter and Receiver clocks must be accurate to
within+/300ppm(partspermillion)ofthecenterfrequency.Intheworstcase,
theTransmitterandReceivercouldbothbeoffby300ppminoppositedirec
tions, resulting in a maximum difference of 600 ppm. That worstcase model
translatestoagainorlossof1clockevery1666clocksandthatsthedifference
thataReceiversclockcompensationlogicmusttakeintoaccount.
Devicesareallowedtoderivetheirclocksfromanexternalsource,andthe100
MHzRefclkisstilloptionallyavailableforthispurposeinthe3.0spec.Using
theRefclkpermitsbothLinkpartnerstoreadilymaintainthe600ppmaccuracy
evenwhenSpreadSpectrumClockingisapplied.

452
PCIe 3.0.book Page 453 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

SSC (Spread Spectrum Clocking)


SSCisanoptionaltechniqueusedtomodulatetheclockfrequencyslowlyover
a prescribed range to spread the signals EMI (ElectroMagnetic Interference)
acrossarangeoffrequenciesratherthanallowingitalltobeconcentratedatthe
center frequency. Spreading the radiated energy helps a device or system to
pass government emissions standards by staying under a threshold value, as
illustratedinFigure134onpage454.Notethatthefrequencyofinterestforthe
signalisonlyhalftheclockratebecausetworisingclockedgesareneededto
createonecycleonthedata,asillustratedinFigure135onpage454.Forexam
ple,a2.5GT/srateusesabitclockof2.5GHz,resultinginafrequencyofinter
estonthetracesof1.25GHz.
TheuseofSCCisnotrequiredbythespecbut,ifwillbesupported,thefollow
ingrulesapply:
The clock can be modulated by +0% to 0.5% from nominal (5000 ppm),
referredtoasdownspreading.Afrequencymodulationenvelopeisnot
specified,butasawtoothwavepatternliketheoneshowninFigure136on
page 455 yields good results. Note that there is a tradeoff with down
spreading, because the average clock frequency will now be 0.25% lower
than it would have been without SSC, resulting in a slight performance
reduction.
Themodulationratemustbebetween30KHzand33KHz.
The+/300ppmrequirementforclockfrequencyaccuracystillholdsand
thereforesodoesthemaximum600ppmvariationbetweenLinkpartners.
ThespecstatesthatmostimplementationswillrequirebothLinkpartners
tousethesameclocksource,althoughitsnotrequired.Onewaytodothat
wouldbeforthemtobothuseamodulatedversionoftheRefclktoderive
theirownclocks(seeCommonRefclkonpage 456).

453
PCIe 3.0.book Page 454 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Emitted Power (dB) Figure134:SSCMotivation

EMI Power Threshold

Ordinary Signal

Spread-Spectrum
Signal

range = 0.5% Signal


Frequency

Frequency (GHz)

Figure135:SignalRateLessThanHalftheClockRate

Signal on
the wire

Tx Clock

454
PCIe 3.0.book Page 455 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Figure136:SSCModulationExample

nominal

Frequency

nominal - 0.5%
Time
modulation modulation
period/2 period

Refclk Overview
Receivers must generate their own clocks to operate their internal logic, but
therearesomeoptionsforgeneratingtherecoveredclockfortheincomingbit
stream. The details for them have developed with each succeeding version of
thespecandarebasedonthedatarate.

2.5 GT/s
In the early spec versions using the 2.5 GT/s rate, information regarding the
optional Refclk was not included in the base spec but instead in the separate
CEM (Card ElectroMechanical) spec for PCIe. A number of parameters were
specified there and several general terms have been carried forward to the
newerversionsofthespec.TheRefclkwasdescribedasa100MHzdifferential
clockdrivinga100differentialload(+/10%)withatracelengthlimitedto4
inches. SSCis allowed, asdescribedinSSC (SpreadSpectrum Clocking)on
page 453.

5.0 GT/s
When the 5.0 GT/s rate was developed, the spec writers chose to include the
Refclk information in the electrical section of the base spec and listed three
optionsfortheclockarchitecture:

455
PCIe 3.0.book Page 456 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

CommonRefclk.The first architecture described is one in which both


LinkpartnersmakeuseofthesameRefclk,asshowninFigure137onpage
456.Therearethreestraightforwardadvantagesforthisimplementation:

First, the jitter associated with the reference clock is the same for
bothTxandRxandisthustrackedandaccountedforintrinsically.
Second, the use of SSC will be simplest with this model because
maintainingthe600ppmseparationbetweentheTxandRxclocks
iseasyifbothfollowthesamemodulatedreference.
Third, the Refclk remains available during lowpower Link states
L0sandL1andthatallowstheReceiversCDRtomaintainasem
blanceoftherecoveredclockevenintheabsenceofabitstreamto
supply the edges in the data. That, in turn, keeps the local PLLs
from drifting as much as they otherwise would, resulting in a
reduced recoverytime backtoL0 comparedto the other clocking
options.
Figure137:SharedRefclkArchitecture

+
Tx Lane in Rx
Register Tx one Rx Register
direction

CDR
PLL

PLL
Refclk

DataClockedRxArchitecture.In this clock architecture, the Receiver


doesnt use a reference clock at all, but simply recovers the Transmitter
clock from the data stream, as shown in Figure 139 on page 457. This
implementation is clearly the simplest of the three and would therefore
ordinarily be preferred. The spec doesnt prohibit the use of SSC in this
model, but doing so would bring up two issues. First, the Receiver CDR
must remain locked onto the input frequency as it modulates through a
muchwiderrange(5600ppminsteadoftheusual600ppm),andthatcould
require more complex logic. And second, the maximum clock frequency
separationof600ppmmuststillbemaintainedanditslessclearhowthat
wouldbedonewithoutacommonreference.

456
PCIe 3.0.book Page 457 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Figure138:DataClockedRxArchitecture

+
Tx Lane in Rx
Register Tx one Rx Register
direction

CDR
PLL

Refclk

SeparateRefclks.Finally,itsalsopossiblefortheLinkpartnerstousedif
ferentreferenceclocks,asshowninFigure139onpage457.However,this
implementation makes substantially tighter demands on the Refclks
becausethejitterseenattheReceiverwillbetheRSS(RootSumofSquares)
combination of them both, making the timing budget difficult. It also
becomesenormouslymoredifficulttomanageSSCinthismodelandthats
why the spec states that SSC must be turned off in this case. Overall, the
spec gives the impression that this is the least desirable alternative, and
statesthatitdoesntexplicitlydefinetherequirementsforthisarchitecture.

Figure139:SeparateRefclkArchitecture

+
Tx Lane in Rx
Register Tx one Rx Register
direction

CDR
PLL

Refclk 1 PLL
Refclk 2

8.0 GT/s
Thesamethreeclockarchitecturesaredescribedinthespecforthisdatarate,
too.OnedifferenceisthattwotypesofCDRaredefinednow:a1storderCDR
for the shared Refclk architecture, and a 2nd order CDR for the data clocked
architecture.Thisjustreflectsthefactthat,asitwasforthelowerdatarates,the
CDRforthedataclockedarchitecturewillneedtobemoresophisticatedtobe
abletostaylockedwhenthereferencevariesoverawiderangeforSSC.

457
PCIe 3.0.book Page 458 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Transmitter (Tx) Specs

Measuring Tx Signals
ThespecnotesthatthemethodsformeasuringtheTxoutputarelimitedatthe
higherfrequencies.At2.5GT/sitspossibletoputatestprobeverynearthepins
oftheDUT(DeviceUnderTest),butforthehigherratesitsnecessarytousea
breakoutchannelwithSMA(SubMiniatureversionA)microwavetypecoax
ialconnectors,asillustratedatTP1(TestPoint1),TP2,andTP3inFigure1310
on page 458. Note that its necessary to have a lowjitter clock source to the
device under test, so that jitter seen at the output is only introduced by the
device itself. The spec also mentions that its important during testing for the
devicetohaveasmanyofitsLanesandotheroutputsinuseatthesametimeas
possible,soastobestsimulatearealsystem.

Sincethebreakoutchannelintroducessomeeffectstothesignal,for8.0GT/sits
necessarytobeabletomeasurethoseeffectsandremove(deembed)themfrom
thesignalbeingtested.Onewaytoaccomplishthisisforthetestboardtosup
plyanothersignalpaththatisverysimilartotheoneusedforthedevicepins.
Characterizing this replica channel with a known signal gives the needed
informationaboutthechannel,allowingitseffectstobedeembeddedfromthe
DUTsignalssothesignalatthecomponentpinscanberecovered.

Figure1310:TestCircuitMeasurementChannels

DUT
TP1

Breakout Channel
Low-Jitter
Clock Source

Replica Channel

TP2 TP3

458
PCIe 3.0.book Page 459 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Tx Impedance Requirements
For best accuracy, the characteristic differential impedance of the Breakout
Channelshouldbe100differentialwithin10%,withasingleendedimped
anceof50.Tomatchthisenvironment,Transmittershaveadifferentiallow
impedance value during signaling between 80 and 120 at 2.5 GT/s, and no
morethan120at5.0and8.0GT/s.Forreceivers,thesingleendedimpedance
is4060at2.5or5.0GT/s,butfor8.0GT/snospecificvalueisgiven.Instead,
itssimplynotedthatthesingleendedreceiverimpedancemustbe50within
20%bythetimetheDetectLTSSMstateisenteredsothatthedetectcircuitwill
sensetheReceivercorrectly.

TransmittersmustalsomeetthereturnlossparametersRLTXDIFFandRLTXCM
anytimedifferentialsignalsaresent.Asaverybriefintroductiontothistermi
nology, return loss is a measure of energy transmitted through or reflected
backfromatransmissionpath.ReturnlossisoneofseveralScatteringparam
eters (Sparameters) that are used to analyze highfrequency signal environ
ments. When frequencies are low, a lumpedelement description is sufficient,
butwhentheybecomehighenoughthatthewavelengthapproachesthesizeof
thecircuit,adistributedmodelisneededandthatswhatSparametersareused
torepresent.Thespecdescribesanumberofthesetocharacterizeatransmis
sionpathbutthedetailsofthishighfrequencyanalysisarereallybeyondthe
scopeofthisbook.
Whenasignalisnotbeingdriven,aswouldbethecaseinthelowpowerLink
states, the Transmitter may go into a highimpedance condition to reduce the
powerdrain.Forthatcase,itonlyhastomeettheITXSHORTvalueandthedif
ferentialimpedanceisnotdefined.

ESD and Short Circuit Requirements


All signals and power pins must withstand a 2000V ESD (ElectroStatic Dis
charge) using the Human Body Model and 500V using the Charged Device
Model.FormoredetailsonthesemodelsorESD,seetheJEDECJESE22A114A
spec.
TheESDrequirementnotonlyprotectsagainstelectrostaticdamage,butfacili
tatessupportofsurprisehotinsertionandremovalevents(addingorremoving
anaddincardwhilethepowerison).Thatgoalalsomotivatestherequirement
thatTransmittersandReceiversbeabletowithstandsustainedshortcircuitcur
rentsofITXSHORT(seeTable 135onpage 498).

459
PCIe 3.0.book Page 460 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Receiver Detection
General
TheDetectblockintheTransmittershowninFigure1311onpage461isused
tocheckwhetheraReceiverispresentattheotherendoftheLinkaftercoming
outofreset.Thisstepisalittleunusualintheserialtransportworldbecauseits
easy enough to send packets to the Link partner and test its presence by
whetherornotitresponds.ThemotivationforthisapproachinPCIe,however,
istoprovideanautomatichardwareassistinatestenvironment.Iftheproper
load is detected, but the Link partner refuses to send TS1s and participate in
LinkTraining,thecomponentwillassumethatitmustbeinatestenvironment
andwillbeginsendingtheCompliancePatterntofacilitatetesting.SinceaLink
willalwaysstartoperationat2.5GT/safteraresetorpowerupevent,Detectis
only used for the 2.5 GT/s rate. Thats why the Receivers singleended DC
impedanceisspecifiedforthatrate(ZRXDC=40to60),andwhytheDetect
logic must be included in every design regardless of its intended operating
speed.

DetectionisaccomplishedbysettingtheTransmittersDCcommonmodevolt
agetoonevalueandthenchangingittoanother.Knowingtheexpectedcharge
timewhenaReceiverispresent,thelogiccomparesthemeasuredtimeagainst
that. IfaReceiverisattached,thechargetime (RC timeconstant)isrelatively
long due to the Receivers termination. Otherwise, the charge time is much
shorter

Detecting Receiver Presence


1. Afterresetorpowerup,TransmittersdriveastablevoltageontheD+and
Dterminal.
2. Transmittersthenchangethecommonmodevoltageinapositivedirection
by no more than the VTXRCVDETECT amount of 600mV specified for all
threedatarates.
3. Detectlogicmeasuresthechargetime:
Receiverisabsentifthechargetimeisshort.
Receiverispresentifthechargetimeislong(dominatedbytheseries
capacitorandReceivertermination).

Thespecmentionsapossibleproblemhere:theproperloadmayappearonone
ofthedifferentialsignalsbutnottheother,andifdetectiondoesntcheckbothit
couldmisinterpretthesituation.Thesimplewaytoavoidthatwouldbetoper
formtheDetectoperationonbothD+andD.The3.0specdoesnotrequirethis,

460
PCIe 3.0.book Page 461 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

but mentions that future spec revisions may. Therefore, it would be wise to
includethisfunctionalityinnewdesigns.

Figure1311:ReceiverDetectionMechanism

Detect
Logic
Receiver Present
CTX => Long Charge time
ZTX
D+ D+
+

No Spec
Lane in
Transmitter one Receiver
CTX direction
ZTX
-
D- D-
ZTX ZTX ZRX ZRX

VRX-CM = 0 V
VCM

Detect
Logic
Receiver Absent
CTX
D+ => Short Charge time

Transmitter
CTX

D-
ZTX ZTX

VCM

461
PCIe 3.0.book Page 462 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Transmitter Voltages
Differential signaling (as opposed to the singleended signaling employed in
PCIandPCIX)isidealforhighfrequencysignaling.Someadvantagesofdiffer
entialsignalingare:

Receiverslook at thedifference betweenthe signals, so thevoltageswing


foreachoneindividuallycanbesmaller,allowinghigherfrequencieswith
outexceedingthepowerbudget.
EMI is reduced becauseofthe noisecancellation thatresults from having
thetwosignalsbysidebyside,usingoppositepolarityvoltages.
Noiseimmunityisverygood,becausenoisethataffectsonesignalwillalso
affect the other in the same way, with the result that the Receiver doesnt
noticethechange(refertoFigure133onpage452).

DC Common Mode Voltage


AftertheDetectstateofLinktraining,theTransmitterDCcommonmodevolt
age VTXDCCM (see Table 133 on page 489) must remain at the same voltage.
ThecommonmodevoltageisturnedoffonlyintheL2orL3lowpowerLink
states, inwhichmainpowerto thedeviceis removed.Adesignercanchoose
anycommonmodevoltageintherangefrom0to3.6V.

Full-Swing Differential Voltage


TheTransmitteroutputconsistsoftwosignals,D+andD,thatareidenticalbut
use opposite polarities. A logical one is indicated when the D+ signal is high
andtheDsignallow,whilealogicalzeroisrepresentedbydrivingtheD+sig
nallowandtheDsignalhigh,asshowninFigure1313onpage464.

ThedifferentialpeaktopeakvoltagedrivenbytheTransmitterVTXDIFFpp(see
Table 133onpage 489)isbetween800mVand1200mV(1300mVfor8.0GT/s).

Logical1issignaledwithapositivedifferentialvoltage.
Logical0issignaledwithanegativedifferentialvoltage.

DuringElectrical IdletheTransmitterholdsthedifferentialpeakvoltageVTX
IDLEDIFFp (see Table 133 on page 489) very near zero (020 mV). During this
timetheTransmittermaybeineitheraloworhighimpedancestate.

TheReceiversensesalogicaloneorzero,aswellasElectricalIdle,byevaluating
thevoltageontheLink.Thesignallossexpectedathighfrequencymeansthe

462
PCIe 3.0.book Page 463 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Receivermustbeabletosenseanattenuatedversionofthesignal,definedas
VRXDIFFpp(seeTable 135onpage 498).

Figure1312:DifferentialSignaling

V+
D+
Vcm

Receiver subtracts
D- from D+ value to
arrive at differential
D- voltage.
Vcm

V-

Differential Notation
Adifferentialsignalvoltageisdefinedbytakingthedifferenceinthevoltageon
thetwoconductors,D+andD.Thevoltagewithrespecttogroundoneachcon
ductorisVD+andVD.ThedifferentialvoltageisgivenbyVDIFF=VD+VD.
TheCommonModevoltage,VCM,isdefinedasthevoltagearoundwhichthe
signalisswitching,whichisthemeanvaluegivenbyVCM=(VD++VD)/2.

The spec uses two terms when discussing differential voltages and confusion
sometimesarisesasaresult.AsillustratedinFigure1313onpage464,thePeak
valueisthemaximumvoltagedifferencebetweenthesignals,whilethePeakto
Peak voltage is that value plus the maximum in the opposite direction. For a
symmetricsignal,thePeaktoPeakvalueissimplytwicethePeakvalue.

1. DifferentialPeakVoltage=>VDIFFp=(max|VD+VD|)
2. DifferentialPeaktoPeakVoltage=>VDIFFpp=2*(max|VD+VD|)

As an example, assumeVCM = 0 V,then if the D+ value is 300mV and the D


valueis300mV,thenVDIFFpwouldbe300(300)=600mVforalogicalone.
Similarly,itwouldbe(300)(+300)=600mVforalogicalzero.TheVDIFFpp
forthissymmetriccasewouldbe1200mV.TheallowedVDIFFpprangefor2.5
GT/s and 5.0 GT/s is 800 to 1200 mV, while for 8.0 GT/s it is 800 to 1300 mV
beforeequalizationisapplied.

463
PCIe 3.0.book Page 464 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1313:DifferentialPeaktoPeak(VDIFFpp)andPeak(VDIFFp)Voltages

D+
VDIFFp VDIFFp
VCMp
(Logical 1) (Logical 0)

D-

0V
VDIFFp-p = 2 * max | VD+ - VD- | = VDIFFp (Logical 1) + VDIFFp (Logical 0)

Reduced-Swing Differential Voltage


Thefullswingvoltageisneededforchannelsthatarelongorotherwiselossy,
andTransmittersarerequiredtosupportit.Butwhenthesignalenvironmentis
shortandlowloss,ahighvoltageisunnecessaryandapowersavingscanbe
realized by reducing it. With this in mind, the spec for 2.5 GT/s and 5.0 GT/s
defines another, reducedswing voltage for powersensitive systems where a
shortchannelisbeingused.Inthismodethevoltageisreducedtoabouthalfof
itsfullswing range. Supportforthis operationisoptional,andthemeansfor
selectingitisnotdefinedandwillbeimplementationspecific.

Thesameistruefor8.0GT/ssignaling,exceptthatinthiscaseitsachievedby
usingalimitedrangeofcoefficients.Forexample,themaximumboostforthe
reducedswingcaseislimitedto3.5dB.Aswiththelowerdatarates,support
forthisvoltagemodelisoptional,butnowthemeansofachievingitisstraight
forward:justsettheTxcoefficientvaluestomakeithappen.

ItshouldbenotedthattheReceivervoltagelevelsareindependentofthetrans
mitter,whichisintuitivelywhatwedexpect:thereceivedsignalalwaysneeds
tomeetthenormalrequirementsandsotheTransmitterandchannelmustbe
designedtoguaranteethatitwill.

Equalized Voltage
Intheinterestofmaintainingagoodflowinthissection,thislargetopiciscov
eredseparatelyinthesectioncalledSignalCompensationonpage 468.

464
PCIe 3.0.book Page 465 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Voltage Margining
TheconceptofmarginingisthatTransmittercharacteristicslikeoutputvoltage
canbeadjustedacrossawiderangeofvaluesduringtestingtodeterminehow
wellitcanhandleasignalingenvironment.The2.5GT/sratedidntincludethis
capability,butvoltagemarginingwasaddedwiththe5.0GT/srateandmustbe
implemented by Transmitters that use that rate or higher. Other parameters,
likedeemphasisorjittercanoptionallybemargined as well.Thegranularity
forthemarginingadjustmentsmustbecontrollableonaLinkbasisandmaybe
controllableonaLanebasis.ThiscontrolisaccomplishedbymeansoftheLink
Control 2 register in the PCIe Capability register block. The transmit margin
field,showninFigure1314onpage465,contains3bitsandcanthusrepresent
8 levels. Their values are not defined, and not all of them need to be imple
mented.Thedefaultvalueisallzeros,whichrepresentsthenormaloperating
range.

Itsimportanttonotethatthisfieldisonlyintendedfordebugandcompliance
testing purposes during which software is only allowed to modify it during
thosetimes.Atallothertimes,thevalueisrequiredtobesettothedefaultofall
zeros.

Figure1314:TransmitMarginFieldinLinkControl2Register

Link Control 2 Register


15 12 11 10 9 7 6 5 4 3 0

Compliance Preset/
De-emphasis
Compliance SOS
Enter Modified Compliance
Transmit Margin
Selectable De-emphasis
Hardware Autonomous
Speed Disable
Enter Compliance
Target Link Speed

465
PCIe 3.0.book Page 466 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

For8.0GT/s,transmittersarerequiredtoimplementvoltagemargininganduse
thesamefieldintheLinkControl2register,butequalizationaddssomecon
straintstotheoptionsbecauseitcantrequirefinercoefficientorpresetresolu
tionthanthe1/24resolutiondefinedfornormaloperation.

During Tx margining the equalization tolerance for 2.5 GT/s and 5.0 GT/s is
relaxed from +/ 0.5 dB to +/ 1.0 dB. For the 8.0 GT/s rate, the tolerance is
definedbythecoefficientgranularityandthenormalequalizertolerancesspec
ifiedforthetransmitter.

Receiver (Rx) Specs

Receiver Impedance
Receivers are required to meet the RLRXDIFF and RLRXCM (see Table 135 on
page 498) parameters unless the device is powered down, as it would be, for
example,intheL2andL3powerstatesorduringaFundamentalReset.Inthose
cases, a Receiver goes to the high impedance state and must meet the
ZRXHIGHIMPDCNEGandZRXHIGHIMPDCNEGparameters.
(SeeTable 135onpage 498.)

Receiver DC Common Mode Voltage


TheReceiversDCcommonmodevoltageisspecifiedtobe0Vforalldatarates,
andthatsrepresentedinFigure1315onpage467byshowingthesignaltermi
nationsconnectedtoground.TheCTXinlinecapacitorpermitsthisvoltageto
besomethingdifferentattheTransmitter,whichisspecifiedtobeintherange
from03.6V.ThatsnotasinterestingwhentheTransmitterandReceiverarein
thesameenclosureandhavethesamepowersupply,butiftheyreconnected
overacableandresideindifferentmachineswithdifferentpowersuppliesit
becomesmoreimportant.Inthatcaseitsdifficulttoavoidreferencevoltagedif
ferencesbetweenthemachinesand,sincethesignalvoltagesarealreadysmall,
such a difference could make the signal difficult to recognize at the Receiver.
ThelocationofthiscapacitormustbeneartheTransmitterpinswhenaconnec
torofsomekindwillbeusedbut,iftheresnoconnector,itcanbelocatedatany
convenientplaceonthetransmissionline.Althoughitcouldbeintegratedintoa
device,itsexpectedthatCTXwillbeexternalbecauseitwouldbetoobigtointe
grate.

466
PCIe 3.0.book Page 467 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

ThedrawinginFigure1315onpage467alsoshowsanoptionalsetofresistors
at the Receiver, labeled as No Spec because they are not mentioned in the
spec. The story here is that Receiver designers dislike using a commonmode
voltageofzeroforthesimplereasonthatitusuallyrequiresthemtoimplement
two reference voltages, one above zero and one below it. A preferred imple
mentationoffsetsthesignalentirelyaboveorbelowzero,sothatonlyonerefer
ence voltage is needed.The circuit shown within the dotted line accomplishes
thisbyaddingasmallvalueinlinecapacitortodecoupletheDCcomponentof
the signal on the wire from that of the Receiver itself. Then, a resistor ladder
serves to offset the Receivers commonmode voltage in one direction or the
othertoaccomplishthegoal.

Figure1315:ReceiverDCCommonModeVoltageAdjustment

Small Big
Ratio of resistors
Big sets DC common
mode voltage

Small Big

Detect Big
Logic

CTX ZTX
D+ D+
+
No Spec

Lane in
Transmitter one Receiver
CTX direction
ZTX
-
D- D-
ZTX ZTX ZRX ZRX

VRX-CM = 0 V
VCM

467
PCIe 3.0.book Page 468 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Transmission Loss
The Transmitter drives a minimum differential peaktopeak voltage
VTXDIFFppof800mV.TheReceiversensitivityisdesignedforaminimumdif
ferential peaktopeak voltage (VRXDIFFpp) of 175 mV. This translates to a
13.2dBlossbudgetthataLinkisdesignedfor.Althoughaboarddesignercan
determinetheattenuationlossbudgetofaLinkplottedagainstvariousfrequen
cies, the Transmitter and Receiver eye diagram measurement are the ultimate
determinantoflossbudgetforaLink.EyediagramsaredescribedinEyeDia
gramonpage 485.ATransmitterthatdrivesuptothemaximumalloweddif
ferentialpeaktopeakvoltageof1200mVcancompensateforalossyLinkthat
hasworstcaseattenuationcharacteristics.

AC Coupling
PCI Express requires inline ACcoupling capacitors be placed on each Lane,
usuallyneartheTransmitter.Thecapacitorscanbeintegratedontothesystem
board, or integrated into the device itself, although the large size they would
needmakesthatunlikely.AnaddincardwithaPCIExpressdeviceonitmust
placethecapacitorsonthecardclosetotheTransmitterorintegratethecapaci
torsintothe PCIe silicon.These capacitorsprovide DCisolation between two
devices on both ends of a Link thus simplifying device design by allowing
devicestouseindependentpowerandgroundplanes.

Signal Compensation

De-emphasis Associated with Gen1 and Gen2 PCIe


For2.5GT/sand5.0GT/stransmission,PCIemandatestheuseofafairlysimply
form of Transmitter equalization called deemphasis to reduce the effects of
signal distortion along the Link transmission line. This distortion problem is
alwayspresentbutgetsworsewithincreasedfrequencyandlossytransmission
lines.

The Problem
Asdataratesgethigher,theUnitInterval(UIbittime)becomessmaller,with
theresultthatitsincreasinglydifficulttoavoidhavingthevalueinonebittime
affectthevalueinanotherbittime.Thechannelalwaysresistschangestothe
voltage level, The faster we attempt to switch voltage, the more pronounced

468
PCIe 3.0.book Page 469 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

thateffectbecomes.However,whenasignalhasbeenheldatthesamevoltage
forseveralbittimes,aswhensendingseveralbitsinarowofthesamepolarity,
thechannelhasmoretimetoapproachthetargetvoltage.Theresultinghigher
voltage makes it difficult to change to the opposite value within the required
time when the polarity does change. This problem of previous bits affecting
subsequentbitsisreferredtoasISI(intersymbolinterference).

How Does De-Emphasis Help?


Deemphasisreducesthevoltageforrepeatedbitsinabitstream.Althoughit
soundscounterintuitiveatfirstbecausethisreducesthesignalswingandthus
theenergythatreachestheReceiver,reducingtheTransmittervoltageforthese
casescansubstantiallyimprovesignalquality.Figure1316onpage469illus
trateshowthisworksbyshowingaTransmitteroutputof1000010000,where
therepeatedbitsofthesamepolarityhavebeendeemphasized.Deemphasis
canbethoughtofasatwotapTxequalizer,andsomerulesrelatedtoitare:
Whenthesignalchangestotheoppositepolarityoftheprecedingbititsnot
deemphasized,butusesthepeaktopeakdifferentialvoltageasspecified
byVTXDIFFpp(seeTable 133onpage 489).
Thefirstbitofaseriesofsamepolaritybitsisnotdeemphasized.
Onlysubsequentbitsofthesamepolarityafterthefirstbitaredeempha
sized.
Thedeemphasizedvoltageisreducedby3.5dBfromnormalfor2.5GT/s,
whichtranslatestoaboutaonethirdreductioninvoltage.
TheBeaconsignalisdeemphasized,too,butusesslightlydifferentrules.
(seeBeaconSignalingonpage 483).

Figure1316:TransmissionwithDeemphasis

De-emphasized Voltage Level

1 0 0 0 0 1 0 0 0 0
1.3V 3.5 dB
1.225 D-

De-emphasized
VTX-DIFFp VTX-DIFFp VTX-CMp
=600mV =450mV =1 V

0.775 D+
0.7 V 3.5 dB
1 UI = 400 ps

469
PCIe 3.0.book Page 470 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Solution for 2.5 GT/s


For 2.5 GT/s, each subsequent bit transmitted after the first bit of the same
polaritymustbedeemphasizedby3.5dBtoaccommodatethisworstcaseloss
budget. Of course, for lowloss environments this is less important and for a
veryshortpathitcanevenmakethereceivedsignallookworse.Afterall,de
emphasisisessentiallydistortingthetransmittedsignalintheoppositewayof
thedistortionthatisexpectedduringtransmissionsoastocancelitout.Ifthere
turns out to be little or no distortion, then deemphasis will make the signal
lookworse.Thespecdoesntdescribeanywaytotestthesignalenvironmentor
adjustthedeemphasislevel,butdoesntprohibitadesignerfromdeveloping
animplementationspecificmethodofdoingso.

AnexampleofthebenefitofdeemphasisisshowninFigure1317onpage471,
whichisascopecaptureconvertedintoadrawingforclarity.Thecaptureswere
taken from a device driving a long path and using a bit stream with several
repeatedbitstoshowthesignaldistortion.Thetraceatthetopshowsthatthe
bitpatternforonesideofthedifferentialpair(alsocalledasingleendedsignal)
has2bitsofonepolarityfollowedby5bitsoftheoppositepolarity.Fiveconsec
utivebitsistheworstcasefor8b/10b,andthisparticularpatternonlyappearsin
a few characters like the COM character. The channel resists highspeed
changes but will continue to charge up if the driver keeps trying to reach a
higher voltage and that can be seen in this example. When the bits arent
repeatedthereisnttimeforthevoltagetogoasfar,butrepeatedbitsgivemore
timeforthechange.Theproblemthiscreatesisseeninthebitfollowingthe5th
inarow(highlightedintheoval),whichfailstoreachagoodsignalvaluedur
ingitsUIbecausethevoltagedifferencewastoolargetoovercomeinthatshort
time.Thedifferencebetweenthevalueitreachesandthevalueitshouldhave
reachedisshownbythelinemarkingthelevelreachedbyotherbitsthatarent
experiencingasmuchISI.

Inthelowerhalfoftheillustration,adeemphasizedversionofthesignaliscap
turedandcomparedtotheoriginal.Herewecanseethatreducingthevoltage
forrepeatedbitspreventsthevoltagefromchargingupasmuchandresultsina
cleaner signal because the bits that follow are not influenced as much by the
previousbits.Forboththe2consecutivebitsandthenthe5consecutivebits,the
overchargingproblemisreduced,whichimprovesthetimingjitteraswellas
thevoltagelevels.Consequently,thetroublesomebitlooksmuchbetterwithde
emphasis turned on and the received signal approaches the normal voltage
swinginthatbittime.

470
PCIe 3.0.book Page 471 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Figure1317:BenefitofDeemphasisattheReceiver

5 bits in a row

-
Without De Emphasis

With De-Emphasis

InFigure1318onpage472bothpositiveandnegativeversionsofthedifferen
tialsignalareshownsoastoillustratetheresultingeyeopening.Theimproved
signalqualityfromdeemphasisisclearbecausetheeyeopeningatthetrouble
sometimeinthelowertraceissomuchlargerthantheonewithoutdeempha
sisintheuppertrace.

471
PCIe 3.0.book Page 472 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1318:BenefitofDeemphasisatReceiverShownWithDifferentialSignals

-
Without De Emphasis

With De-Emphasis

Solution for 5.0 GT/s


As one might expect, increasing data rates exacerbates the problem of ISI
becausethebittimesgetprogressivelysmaller,andmoreaggressiveequaliza
tiontechniquesareneeded.Thechangefor5.0GT/sisincremental,andconsists
ofprovidingthreechoicesregardingtheamountofdeemphasistobeapplied.

1. Whenrunningat2.5GT/sspeed,3.5dBdeemphasisisrequired.
2. When running at 5.0 GT/s speed, 6.0 dB deemphasis is recommended,
whiletheuseof3.5dBisoptional.6.0dBdeemphasislevelisintendedto
compensateforthegreatersignalattenuationathigherfrequency.AsFig
ure1319onpage473suggests,a3.5dBreductionrepresentsa33%reduc
tioninvoltage,whilea6dBreductionrepresentsa50%reduction.Toavoid
apossibleconfusion,notethatthedBmeasureofpowerandvoltagearedif
ferent by a factor of two. A 3 dB reduction represents a 50% change in
powerbutonlya25%changeinvoltage.

472
PCIe 3.0.book Page 473 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Figure1319:DeemphasisOptionsfor5.0GT/s

2.5 GT/s 3.5 dB


de-emphasis

5.0 GT/s 6.0 dB


de-emphasis

3. Normally, a Transmitter operates in the fullswing mode and can use the
entireavailablevoltagerangetohelpovercomesignalattenuation.Thevolt
ageneedstostartoutatahighervaluetocompensatefortheloss,asshown
inthetophalfofFigure1320onpage474.However,for5.0GT/sanother
optionisprovidedcalledreducedswingmode.Thisisintendedtosupport
short,lowlosssignalingenvironments,asshowninthelowerhalfofFigure
1320 on page 474, and reduces the voltage swing by about half to save
power.Thismodealsoprovidesthethirddeemphasisoptionbyturningoff
deemphasisentirely,whichmakessensebecause,asmentionedearlier,the
signaldistortionitcreateswouldnotbereducedbylossinthepathandthe
resultingsignalattheReceiverwouldlookworse.

473
PCIe 3.0.book Page 474 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1320:ReducedSwingOptionfor5.0GT/swithNoDeemphasis

Full Swing (high transmission amplitude)


Tx

Transmitter Receiver
Long path
Rx
+ +
_ _

Reduced Swing (low transmission amplitude)

Short path
Transmitter Receiver Tx

Rx
+ +
_ _

Solution for 8.0 GT/s - Transmitter Equalization


When going to the 8.0 GT/s data rate, the signal conditioning model changes
significantly.Transmitterequalizationbecomesmorecomplexandahandshake
trainingprocedureisusedtoadapttotheactualsignalingenvironmentrather
thanmakingassumptionsaboutwhatwillbeneeded.Tolearnmoreaboutthe
process of evaluating the Link, refer to the section called Recovery.Equaliza
tiononpage 587.Basically,thatprocessallowsaReceivertorequestthatthe
LinkpartnersTransmitteruseacertaincombinationofcoefficientsandthenthe
receivertestshowwellthereceivedsignallooksandpossiblyproposesothersif
theresultisntgoodenough.

Sometimesstudentsaskwhetherthismodelisreallysufficienttoachievegood
errorrates,sinceevaluatingasignalacrossallthepossiblesituationsrequires
daysoftestinginthelabtoachieveaBERof1015orbetter.Theanswertothis
hastwoparts.First,evenwiththehandshakeprocess,thecoefficientswillbean
approximationthatworkedwellwhenthetrainingwasdonebutmayormay
notworkaswellunderotherconditions.Extrapolationfromasmallsamplesize

474
PCIe 3.0.book Page 475 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

isanecessarypartofarrivingatworkingvaluesquicklyanditworksreason
ably well. Second, associated with 8 GT/s transfer rate, its only necessary to
achieve a minimum BER of 1012, and that doesnt take as long to verify as it
wouldBERof1015.

Three-Tap Tx Equalizer Required


ToaccomplishbetterwaveshapingattheTransmitter,thespecrequirestheuse
ofa3tapFIR(FiniteImpulseResponse)filter,meaningafilterwith3bittime
spacedinputs.AconceptualdrawingofthisisshowninFigure1321onpage
475,whereitcanbeseenthattheoutputvoltageisthesumofthreeversionsof
the input: the original input, a version delayed by one bit time and a third
delayedbyanotherbittime.ThistypeofFIRfilterisoftenusedinotherSER
DESapplications above 6.0Gb/s, anditshelpfulforPCIebecause it compen
sates for the fact that the channel spreads the signal across a longer time.
Anotherwayofthinkingaboutitisthatagivenbitisaffectedbyboththebit
valuethatprecededitandthebitthatcomesafterit.

Figure1321:3TapTxEqualizer

6 Output

Tap (-1) Tap (0) Tap (+1)


C-1 C0 C+1

Input 1 UI delay 1 UI delay

Withthisinmind,thethreeinputscanbedescribedbytheirtimingpositionas
precursorforC1,cursorforC0,andpostcursorforC+1,whichcombine
tocreateanoutputbasedontheupcominginput,thecurrentvalue,andthepre
viousvalue.Adjustingthecoefficientsforthetapsallowstheoutputwavetobe
optimally shaped. This effect is illustrated by the pulseresponse waveform
showninFigure1322onpage476.Lookingatasinglepulseallowstheadjust
menttothesignaltobemoreeasilyrecognized.

475
PCIe 3.0.book Page 476 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Thefiltershapestheoutputaccordingtothecoefficientvalues(ortapweights)
assignedtoeachtap.Thesumoftheabsolutevalueofthethreecoefficientmag
nitudestogetherisdefinedtobeunitysothatonlytwoofthemneedtobegiven
forthethirdonetobecalculated.Consequently,onlyC1andC+1aregivenin
thespecandC0isalwaysimpliedandisalwayspositive.

Figure1322:Tx3TapEqualizerShapingofanOutputPulse

Unmodified Signal

t
UI UI UI UI

Cursor

V
Pre-cursor Post-cursor
reduction reduction

Equalized Signal
t
UI UI UI UI
Cursor

Pre-shoot, De-emphasis, and Boost


Theeffectofthecoefficientvaluesistoadjusttheoutputvoltagetocreateupto
fourdifferentvoltagelevelstoaccommodatedifferentsignalingenvironments,
as shown in Figure 1323 on page 477. This waveform was taken from a test
deviceand showsarepresentative example,but thevoltage levelsdependon
whetheraTransmitterimplementspreshootordeemphasisorboth.

The waveform shows the four general voltages to be transmitted, which are:
maximumheight (Vd), normal (Va), deemphasized (Vb), and preshoot (Vc).

476
PCIe 3.0.book Page 477 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Thisschemeisbackwardcompatiblewiththe2.5and5.0GT/smodelthatonly
usesdeemphasis,becausepreshootanddeemphasiscanbedefinedindepen
dently.Thevoltagesbothwithandwithoutdeemphasisarethesameasthey
havebeenforthelowerdatarates,exceptthatnowtherearemoreoptionsfor
the deemphasis value, ranging from 0 to 6 dB. Preshoot is a new feature
designedtoimprovethesignalinthefollowingbittimebyboostingthevoltage
in the current bit time. Finally, the maximum value is simply what the signal
wouldbeifbothC1andC+1werezero(andC0was1.0).Asillustratedbythebit
stream shown at the top of the diagram, we may summarize the strategy for
thesevoltagesasfollows:

When the bits on both sides of the cursor have the opposite polarity, the
voltagewillbeVd,themaximumvoltage.
Whenarepeatedstringofbitsistobesent:
ThefirstbitwilluseVa,thenextlowervoltagetothemaximumvoltage
Vd.
BitsbetweenthefirstandlastbitsuseVb,thelowestvoltage.
ThelastrepeatedbitbeforeapolaritychangeusesVc,thenexthigher
voltagetothelowestvoltageVb.

Figure1323:8.0GT/sTxVoltageLevels

1 0 1 0 0 1 1 1 1 1 0 1 0 1 0 1

Va Vb Vc Vd

477
PCIe 3.0.book Page 478 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Presets and Ratios


AsdescribedinRecovery.Equalizationonpage 587,whentheLinkisprepar
ingtochangefromalowerdatarateto8.0GT/s,theDownstreamPortsendsEQ
TS2sthatgivetheUpstreamPortasetofpresetvaluestouseforitscoefficients
asastartingpointfromwhichtobegintestingtheLinksignalquality.Thelistof
11possiblepresetsalongwiththeircorrespondingcoefficientvaluesandvolt
ageratiosisgiveninTable 131onpage 478.Notethatthevoltagesaregivenas
aratiowithrespecttothemaxvalue.Thesevalueswereselectedtomatchthe
earlierspecversions.Asanexampleofhowthatisused,thefirstentry,P4,uses
nodeemphasisorpreshoot,soallthevoltagevaluesareequaltothemaxvalue
andtheratiosareall1.000.

Table131:TxPresetEncodingswithCoefficientsandVoltageRatios

Preset Preshoot De-emphasis C-1 C+1 Va/Vd Vb/Vd Vc/Vd


Number (dB) (dB)

P4 0.0. 0.0 0.000 0.000 1.000 1.000 1.000

P1 0.0. -3.5 +/- 1 dB 0.000 -0.167 1.000 0.668 0.668

P0 0.0. -6.0 +/- 1.5 dB 0.000 -0.250 1.000 0.500 0.500

P9 3.5 +/- 1 dB 0.0 -0.166 0.000 0.668 0.668 1.000

P8 3.5 +/- 1 dB -3.5 +/- 1 dB -0.125 -0.125 0.750 0.500 0.750

P7 3.5 +/- 1 dB -6.0 +/- 1.5 dB -0.100 -0.200 0.800 0.400 0.600

P5 1.9 +/- 1 dB 0.0 -0.100 0.000 0.800 0.800 1.000

P6 2.5 +/- 1 dB 0.0 -0.125 0.000 0.750 0.750 1.000

P3 0.0 -2.5 +/- 1 dB 0.000 -0.125 1.000 0.750 0.750

P2 0.0 -4.4 +/- 1.5 dB 0.000 -0.200 1.000 0.600 0.600

P10 0.0 Defined by LF 0.000 (FS-LF) /2 1.000 Not Not


fixed fixed

478
PCIe 3.0.book Page 479 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Equalizer Coefficients
Presetsallowadevicetouseoneof11possiblestartingvaluestobeusedforthe
partnersTransmittercoefficientswhenfirsttrainingtothe8.0GT/sdatarate.
ThisisaccomplishedbysendingEQTS1sandEQTS2sduringtrainingwhich
gives a coarse adjustment of Tx equalization as a starting point. If the signal
using the preset delivers the desired 1012 error rate, no further training is
needed.Butifthemeasurederrorrateistoohigh,theequalizationsequenceis
usedtofinetunethecoefficientsettingsbytryingdifferentC1andC+1values
andevaluatingtheresult,repeatingthesequenceuntilthedesiredsignalqual
ityorerrorrateisachieved.

An8.0GT/stransmitterisrequiredtoreportitsrangeofsupportedcoefficient
valuestoitsneighboringReceiver.Therearesomeconstraintsonthis:

Devicemustsupportall11presetsaslistedinTable 131onpage 478.


TransmittersmustmeetthefullswingVTXEIEOSFSsignalinglimits
Transmitters may optionally support the reducedswing, and if they do
theymustmeettheVTXEIEOSRSlimits
Coefficients must meet the boost limits (VTXBOOSTFS = 8.0 dB min, VTX
BOOSTRS=2.5dBmin)andresolutionlimits(EQTXDOEFFRESS=1/24maxto
1/63min).

Applyingtheseconstraintsandusingthemaximumgranularityof1/24creates
alistofpreshoot,deemphasis,andboostvaluesforeachsetting.Thisispre
sentedinatableinthespecthatispartiallyreproducedfromthespecherein
Table 132onpage 480.Thetablecontainsblankentriesbecausetheboostvalue
cantexceed8.0+/1.5dB=9.5dB.Thatresultsinadiagonalboundarywhere
theboosthasreached9.5forthefullswingcase.Forreducedswing,thebound
aryisat3.5dB.The6shadedentriesalongtheleftandtopedgesofthetable
thatgoasfaras4/24arepresetssupportedbyfullorreducedswingsignaling.
Theother4shadedentriesarepresetssupportedforfullswingsignalingonly.

479
PCIe 3.0.book Page 480 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table132:TxCoefficientTable

PS DE C+1
Boost
0/24 1/24 2/24 3/24 4/24 5/24 6/24

0/24 0.0 0.0 0.0 -0.8 0.0 -1.8 0.0 -2.5 0.0 -3.5 0.0 -4.7 0.0 -6.0-
0.0 0.8 1.6 2.5 3.5 4.7 6.0

1/24 0.8 0.0 0.8 -0.8 0.9 -1.7 1.0 -2.8 1.2 -3.9 1.3 -5.3 1.6 -6.8
0.8 1.6 2.5 3.5 4.7 6.0 7.6

2/24 1.6 0.0 1.7 -0.9 1.9 -1.9 2.2 -3.1 2.5 -4.4 2.9 -6.0 3.5 -8.0
C-1 1.6 2.5 3.5 4.7 6.0 7.6 9.5

3/24 2.5 0.0 2.8 -1.0 3.1 -2.2 3.5 -3.5 4.1 -5.1 4.9 -7.0 -
2.5 3.5 4.7 6.0 7.6 9.5

4/24 3.5 0.0 3.9 -1.2 4.4 -2.5 5.1 -4.1 6.0 -6.0 - -
3.5 4.7 6.0 7.6 9.5

5/24 4.7 0.0 5.3 -1.3 6.0 -2.9 7.0 -4.9 - - -


4.7 6.0 7.6 9.5

6/24 6.0 0.0 6.8 -1.6 8.0 -3.5 - - - -


6.0 7.6 9.5

CoefficientExample.Letsdrillalittledeeperonthecoefficientsbyusing
presetnumberP7fromTable 131onpage 478asanexample.Inthisentry,
C1=0.100,andC+1=0.200,andsinceC0mustbepositiveandthesumof
theirabsolutevaluesmustbeone,itsimpliedthatC0=0.700.

Matchingthesevaluestothetableofcoefficientspacegiveninthespecis
not straightforward because the coefficients are given as fractions rather
than decimal values, but converting the fractions to their decimal values
matchesthemprettyclosely.TheC1valueof0.100isclosestto2/24(0.083),
whileC+1at0.200isalittlelessthan5/24(0.208).Thecoefficienttableentry
forthosefractionsishighlightedasoneofthepresetvalues,givingussome
confidencethatthisisontherighttrack.Inthepresettable,P7listsapre
shootvalueof3.5+/1dB,andthevalueinthecoefficienttableisshownas
2.9dB.Ifwecorrectforthedifferenceincoefficientvalues,((0.083/.1)*3.5=
2.9)wearriveatthesamepreshootvalue.Thedifferenceincoefficientval
uesfordeemphasiswasmuchsmaller(0.200vs.0.208)andso,aswemight
expect,bothtablesshowthisas6.0dB.

480
PCIe 3.0.book Page 481 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

WhatvoltagesdotheP7coefficientscreate?Assumingafullswingvoltage
ofVdasastartingpointthen,accordingtotheratiosinthepresettable,the
othervoltageswouldbeVa=0.8Vd,Vb=0.4Vd,andVc=0.6Vd.Howwell
do those correspond to the values that would result from using the pre
shoot and deemphasis numbers? Deemphasis was given as 6.0 dB, and
we alreadyknow thatrepresents a 50% voltage reduction, so wed expect
that Vb should be half of Va, which it is. Preshoot was given as 3.5 dB
meaning the ratio of Vc/Vb is 0.668, and 0.4/0.668 = 0.598Vd for Vc; very
closetothe0.6Vdweexpected.Lastofall,theBoostvalue,whichistheratio
ofVd/Vb,isnotgiveninthepresettablebut,usingtheformula20*log(Vd/
Vb),theboostfromthepresetvaluesturnsouttobe7.9dB.Thatsreason
ably close to the 7.6 dB value given in the coefficient table and gives us
someconfidencethatthetablesareconsistentamongthemselves.

Sohowarethefourvoltagesobtained?Thereareessentiallythreeprogram
mabledriverswhoseoutputissummedtoderivethefinalsignalvaluetobe
launched.Ifthecursorsettingremainsunchanged,andthepreandpost
cursor taps arenegative,then theanswercanbe found bysimplyadding
thetapsas(C0+C1+C+1).

Vd=(C0+C1+C+1)=(0.700+0.100+0.200)=1.0*maxvoltage.Thisis
the boosted value that results when a bit is both preceded and fol
lowedbybitsoftheoppositepolarity.Inallfourvoltageslistedhere,if
thepolarityofthebitsisinvertedthenthevalueswouldallbenegative.
Va=(0.700+(0.100)+0.200)=0.8*maxvoltage.Thisisthevaluethat
resultswhenabitisprecededbytheoppositepolaritybutfollowedby
thesamepolarity,meaningitisthefirstinarepeatedstringofbits.
Vb = (0.700 + (0.100) + (0.200)) = 0.4 * max voltage. This is the de
emphasizedvaluethatresultswhenabitisbothprecededandfollowed
by bits of the same polarity, meaning its in the middle of a repeated
stringofbits.
Vc=(0.700+0.100+(0.200))=0.6*maxvoltage.Thisisthepreshoot
valuethatresultswhenabitisprecededbythesamepolaritybutfol
lowed by the opposite polarity, meaning its the last bit in a repeated
stringofbits.

Whatdetermineswhenthecoefficientsareaddedorsubtractedtoarriveat
thesenumbers?Thisturnsouttobefairlysimple,sinceitsjustamatterof
the polarity of the timeshifted pre and postcursor inputs. This is illus
trated in Figure 1324 on page 482. The singleended waveform labeled
WeightedCursor(C0)showsthepositivehalfofthedifferentialbitstream
currentlybeingtransmitted.Ifthewaveformsareunderstoodasshiftingto
therightwithtime,thenthenextlowertrace(C+1)isthepostcursorsignal.

481
PCIe 3.0.book Page 482 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Thisversionarrivesoneclocklaterandisweightednegativelybyitscoeffi
cient, causing it to be inverted. The top trace (C1) arrives a clock earlier
thanthecursorandistheprecursorvaluethatisalsoweightednegatively
accordingtoitsowncoefficient.

Finally, the bottom trace shows the result of summing all three inputs to
arriveatthefinalsignalthatisactuallylaunchedontothewire.Intheillus
tration,thisisoverlaidwiththesingleendedoutputwaveformfromFigure
1323 on page477 to show thatit approximatesarealcapture fairly well.
Somevoltagecalculationsareshownfromourpreviousexampletodemon
stratehowtheresultingvoltagesareobtained.

Figure1324:Tx3TapEqualizerOutput

Weighted
Pre-Cursor
(C-1)

Weighted 1 0 1 0 0 1 1 1 1 1 0

Cursor (C0)

Weighted
Post-Cursor
(C+1)

Vd (0.7 + (-0.1) + (-0.2))


= 0.4

Vc
Va
Vb
Output
(C0 + C-1 + C+1)
Vc

Va
Vd (-0.7 + (-0.1) + (0.2)) Vd
= - 0.6
(-0.7 + (0.1) + (-0.2))
(-0.7 + (-0.1) + (-0.2)) = - 0.8
= -1.0

482
PCIe 3.0.book Page 483 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

ThecoefficientpresetsareexchangedbeforetheLinkchangesto8.0GT/s,
and then they may be updated during the Link equalization process (see
Recovery.Equalizationonpage 587formoredetails).

EIEOSPattern.At8.0GT/s,somevoltagesaremeasuredwhenthesignal
has a low frequency because the highfrequency changes wont reach the
levels we want to measure. The EIEOS sequence contains 8 consecutive
ones followed by 8 consecutive zeros in a pattern that repeats for 128 bit
times.Itspurposeisprimarilytoserveasanunambiguousindicationthata
TransmitterisexitingfromElectricalIdle,whichscrambleddatacantguar
antee.ItslaunchvoltageisdefinedasVTXEIEOSFSforfullswingandVTX
EIEOSRSforreducedswingsignals.

ReducedSwing.Transmittersmaysupportareducedswingsignalmuch
astheydidfor5.0GT/s:toachievebothpowersavingsandabettersignal
over short, lowloss transmission paths. The output voltage has the same
1300 mV max value as the fullswing case, but allows a lower minimum
voltageof232mVasdefinedforVTXEIEOSRS.Operatingatreducedswing
limitsthenumberofpresetsbecausethemaximumboostsupportedis3.5
dB.

Beacon Signaling
General
DeemphasisisalsoappliedtotheBeaconsignal,soadiscussionabouttheBea
conisincludedinthissection.AdevicewhoseLinkisintheL2statecangener
ate a wakeupevent to requestthat power berestored so itcan communicate
withthesystem.TheBeaconisoneoftwomethodsavailableforthispurpose.
TheothermethodistoasserttheoptionalsidebandWAKE#signal.Anexample
ofwhattheBeaconmightlooklikeisshowninFigure1325onpage484.This
version shows the differential signals pulsing and then decaying in opposite
directionsandisreminiscentofaflashingbeaconlight.Otheroptionsareavail
ablefortheBeacon,butthisoneillustratestheconceptwell.

483
PCIe 3.0.book Page 484 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1325:ExampleBeaconSignal

WhileaLinkisinL2powerstate,itsmainpowersourceandclockareturned
offbutanauxiliaryvoltagesource(Vaux)keepsasmallpartofthedevicework
ing, including the wakeup logic. To signal a wakeup event, a downstream
devicecandrivetheBeaconupstreamtostarttheL2exitsequence.Aswitchor
bridge receiving a Beacon on its Downstream Port must forward notification
upstream by sending the Beacon on its Upstream Port or by asserting the
WAKE#pin.SeeWAKE#onpage 773.

The motivation for creating two wakeup mechanisms is to provide choices


regardingpowerconsumption.TousetheBeacon,allthebridgesandswitches
betweenanEndpointandtheRootComplexwillneedtouseVaux sotheycan
detect and generate the signal. If a system is always plugged in and uncon
cernedabouttheamountofstandbypower,theBeaconinbandsignalmaybe
preferredoverhavingtorouteanextrasidebandsignal.Butinamobilesystem
withlimitedbatterylifewhereconservingpowerisahighpriority,theWAKE#
pin is preferred because that approach uses as little Vaux as possible. The pin
couldbeconnecteddirectlyfromtheEndpointtotheRootComplexandthen
nootherdeviceswouldneedtobeinvolvedoruseVaux.

Properties of the Beacon Signal


A lowfrequency, DCbalanced differential signal consisting of a periodic
pulseofbetween2nsand16s.
Themaximumtimebetweenpulsescanbenomorethan16s.
ThetransmittedBeaconsignalmustmeettheelectricalvoltagespecsdocu
mentedinTable 133onpage 489.

484
PCIe 3.0.book Page 485 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

ThesignalmustbeDCbalancedwithinamaximumtimeof32s.
Beaconsignaling,likenormaldifferentialsignaling,mustbedonewiththe
Transmitterinthelowimpedancemode(50singleended,100differen
tialimpedance).
Whensignaled,theBeaconsignalmustbetransmittedonLane0,butdoes
nothavetobetransmittedonotherLanes.
Withoneexception,thetransmittedBeaconsignalmustbedeemphasized
according to the rules defined in the previous section. For Beacon pulses
greaterthan500ns,theBeaconsignalvoltagemustbe6dbdeemphasized
from the VTXDIFFpp spec. The Beacon signal voltage may be deempha
sizedbyupto3.5dBforBeaconpulsessmallerthan500ns.

Eye Diagram

Jitter, Noise, and Signal Attenuation


As the bit stream travels from the Transmitter on one end of a link to the
Receiverontheotherend,itissubjecttothefollowingdisruptiveinfluences:
Deterministic(i.e.,predictable)jitterinducedbytheLinktransmissionline.
DatadependentjitterinducedbythedynamicdatapatternsontheLink.
Noiseinducedintothesignalpair.
Signalattentuationduetotheimpedanceeffectofthetransmissionline.

The Eye Test


ToverifythataReceiverseesansignalthatiswithintheallowedvariation,an
eyetestmaybeperformed.Thefollowingdescriptionofthismeasurementwas
providedbyJamesEdwardsfromanarticleheauthoredforOEMagazine.

Themostcommontimedomainmeasurementforatransmissionsystemistheeye
diagram. The eye diagram is a plot of data points repetitively sampled from a
pseudorandombitsequenceanddisplayedbyanoscilloscope.Thetimewindowof
observationistwodataperiodswide.Fora[PCIExpresslinkrunningat2.5GT/s],
theperiodis400ps,andthetimewindowissetto800ps.Theoscilloscopesweepis
triggeredbyeverydataclockpulse.Aneyediagramallowstheusertoobservesys
temperformanceonasingleplot.
To observe every possible data combination, the oscilloscope must operate like a
multipleexposure camera. The digital oscilloscopes display persistence is set to
infinite.Witheachclocktrigger,anewwaveformismeasuredandoverlaiduponall

485
PCIe 3.0.book Page 486 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

previous measured waveforms. To enhance the interpretation of the composite


image,digitaloscilloscopescanassigndifferentcolorstoconveyinformationonthe
numberofoccurrencesofthewaveformsthatoccupythesamepixelonthedisplay,a
processknownascolorgrading.Moderndigitalsamplingoscilloscopesincludethe
abilitytomakealargenumberofautomatedmeasurementstofullycharacterizethe
variouseyeparameters.

Normal Eye Diagram


An ideal trace capture would paint an eye pattern that matched the outline
showninthecenterofFigure1326onpage486labeledNormal.Aslongas
the pattern resides entirely within that region, the Transmitter and Link are
within tolerance. Note that the differential voltage parameters and values
shownarepeakvoltagesinsteadofthepeaktopeakvoltagesusedinthespec,
becauseonlypeakvoltagescanberepresentedinaneyediagram.Figure1327
onpage488showsascreencaptureofagoodeyediagram.

Figure1326:TransmitterEyeDiagram

Overshoot

Normal

Minimum Eye
VTX-DIFF-p-MAX

VTX-DIFFp-MIN

De-emphasized Eye
Eye Opening

Normal

Undershoot

Jitter Jitter
TTX-EYE
UI = Unit Interval

486
PCIe 3.0.book Page 487 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Effects of Jitter
Jitter(timinguncertainty)iswhathappenswhenanedgearriveseitherbefore
orafteritsidealtime,andactstoreducesignalintegrityandclosetheeyeopen
ing. Its caused by a variety of factors, from environmental effects to the data
patterninflight,tonoiseorsignalattenuationthatcausesthesignalsvoltage
level to overshoot or undershoot the normal zone. At 2.5 GT/s this could be
treatedasasimplelumpedeffect,butathigherdataratesitbecomesamoresig
nificantissueandmustbeconsideredinseveraldifferentparts.Aimingatthis
goal,the8.0GT/sdataratedefines5differentjittervalues.Thedetailsofjitter
analysisandminimizationarebeyondthescopeofthisbook,butletsatleast
definethetermsthespecuses.Jitterisdescribedasbeinginoneofseveralcate
gories:
1. Uncorrelatedjitterthatisnotdependenton,orcorrelatedto,thedata
patternbeingtransmitted.
2. RjRandomjitterfromunpredictablesourcesthatareunboundedandusu
ally assumed to fit a Gaussian distribution. Often caused by electrical or
thermalnoiseinthesystem.
3. DjDeterministicjitterthatspredictableandboundedinitspeaktopeak
value. Often caused by EMI, crosstalk, power supply noise or grounding
problems.
4. PWJPulseWidthJitteruncorrelated,edgetoedge,highfrequencyjitter.
5. DjDD Deterministic Jitter, using the DualDirac approximation. This
modelisamethodofquicklyestimatingtotaljitterforalowBERwithout
requiring the large sample size that would normally be needed. It uses a
representativesampletakenoverarelativelyshortperiod(anhourorso)
andextrapolatesthecurvestoarriveatacceptableapproximatevalues.
6. DDjDatadependentjitterisafunctionofthedatapatternbeingsent,and
thespecstatesthatthisismostlyduetopackagelossandreflection.ISIisan
exampleofDDj.
Figure1328onpage488showsascreencaptureofabadEyeDiagramat2.5
GT/s.Sincethisiscapturedwithoutdeemphasis,thetracesshouldallstayout
sidetheMinimumEyearea,shownonthescreenbythetrapezoidshapeinthe
middle.Thisexampleillustratesthatjittercanaffectbothedgearrivaltimesand
voltagelevels,causingsometraceinstancestoencroachonthekeepoutareaof
thediagram.

487
PCIe 3.0.book Page 488 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1327:RxNormalEye(NoDeemphasis)

Figure1328:RxBadEye(NoDeemphasis)

488
PCIe 3.0.book Page 489 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Transmitter Driver Characteristics


Table 133onthispagelistssomeTransmitterdrivercharacteristics.Thisisnot
intendedtoreplicatethetablesfromthespec,buttogivesomebasicparameters
toillustratesomedifferencesbetweenthedatarates,suchasUI,andtoshow
thatsomethingshaveremainedunchanged,suchastheTxcommonmodevolt
age.

Table133:TransmitterSpecs

Item 2.5GT/s. 5.0GT/s 8.0GT/s Units Notes

UI 399.88 199.94 124.9625 ps UnitInterval(bittime)


(min) (min) (min)
400.12 200.06 125.0375
(max) (max) (max)

TTXEYE 0.75 0.75(min) Seenotes UI TransmitterEye,includ


(min) ingalljittersources.
For8.0GT/s,fivejitter
sourcesarespecifiedsep
arately.

TTXRFMIS Not 0.1(max) Not UI RiseandFalltimediffer


MATCH Specified Specified encemeasuredfrom20%
to80%differentially.

VTXDIFFpp 0.8(min) 0.8(min) SeeTable mV Peaktopeakdifferential


1.2(max) 1.2(max) 134 voltage.

VTXDIFFpp 0.4(min) 0.4(min) SeeTable mV Lowpowervoltage.


LOW 1.2(max) 1.2(max) 134

VTXDCCM 0to3.6 0to3.6 0to3.6 V DCcommonmodevolt


ageatTxpins.

VTXDE 3(min) 3(min) SeeTable mV Ratiofor3.5dBde


RATIO3.5dB 4(max) 4(max) 134 emphasizedbits.

VTXDE n/a 5.5(min) SeeTable mV Ratiofor6dBdeempha


RATIO6dB 6.5(max) 134 sizedbits.

489
PCIe 3.0.book Page 490 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table133:TransmitterSpecs(Continued)

Item 2.5GT/s. 5.0GT/s 8.0GT/s Units Notes

ITXSHORT 90 90 90 mA Totalsingleendedcur
rentTxcansupplywhen
shortedtoground.

VTXIDLE 0(min) 0(min) 0(min) mV Peakdifferentialvoltage


DIFFACP 20(max) 20(max) 20(max) underElectricalIdlestate
ofLink.Mustincludea
bandpassfilterpassing
frequenciesfrom10KHz
to1.25GHz.

TTXIDLEMIN 20(min) 20(min) 20(min) ns MinimumtimeaTrans


mittermustbeinElectri
calIdle.

TTXIDLESET 8(max) 8(max) 8(max) ns TimeallowedforTxto


TOIDLE meetElectricalIdlespec
afterlastbitofrequired
EIOSs.

TTXIDLETO 8 8 8 ns MaxtimeforTxtomeet
DIFFDATA differentialtransmission
specafterElectricalIdle
exit.

ZTXDIFFDC 80(min) 120(max) 120(max) DCdifferentialTximped


120(max) ance.Typicalvalueis100
.Minvaluefor5.0and
8.0GT/sisboundedby
RLTXDIFF

490
PCIe 3.0.book Page 491 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Table133:TransmitterSpecs(Continued)

Item 2.5GT/s. 5.0GT/s 8.0GT/s Units Notes

RLTXDIFF 10(min) 10(min) 10(min) dB Txpackagereturnloss.


for0.51.25 for0.51.25 Notethatthefrequencyis
GHz GHz thesignalonthewire.
Notethatathigherrates
8(min)for 8(min)for
itbecomesnecessaryto
>1.252.5 >1.252.5
specifydifferentparame
GHz GHz
tersfordifferentfrequen
4(min)for cies.
>2.5to4
GHz

CTX 75(min) 75(min) 176(min) nF RequiredACcoupling


265(max) 265(max) 265(max) caponeachLaneplaced
inthemediaorinthe
componentitself.

LTXSKEW 500ps+ 500ps+ 500ps+ ps Skewbetweenanytwo


2UI 4UI 6UI LanesinthesameTrans
(max) (max) mitter.

Table134:ParametersSpecificto8.0GT/s

Symbol Value Units Notes

VTXFSNOEQ 1300(max) mvPP NoEQisapplied;measuredusing64


800(min) zerosfollowedby64ones.

VTXRSNOEQ 1300(max) mvPP NoEQisapplied;measuredusing64


zerosfollowedby64ones.

VTXBOOSTFS 8.0(min) dB Txboostratioforfullswing.


(Assumes+/1.5dBtolerance)

VTXBOOSTRS 2.5(min) dB Txboostratioforreducedswing.


(Assumes+/1.0dBtolerance)

EQTXCOEFF 1/24(max) n/a Txcoefficientresolution


RES 1/63(min)

491
PCIe 3.0.book Page 492 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Receiver Characteristics

Stressed-Eye Testing
Receiversaretestedusingastressedeyetechnique,inwhichasignalwithspe
cificproblemsispresentedtotheinputpinsandtheBERismonitored.Thespec
presentsthesefor2.5and5.0GT/sseparatelyfrom8.0GT/sbecauseofthedif
ferenceinthemethodsused,andthengivesathirdsectionthatdefinesparame
terscommontoallthespeeds.

2.5 and 5.0 GT/s


At2.5GT/s,theparametersaremeasuredattheReceiverpinsandthereisan
implied correlation between the margins observed and the BER. At 5.0 GT/s,
receivertolerancingisapplied.Thisisatwostepmethodinwhichatestboard
iscalibratedtoshowtheworstcasesignalmarginsasdefinedinthespec.Then,
oncethecalibrationisdone,thetestloadisreplacedbythedevicetobetested
and its BER is observed. There are actually two sets of worstcase numbers
basedontheclockingscheme:oneisdefinedforthecommonclockarchitecture
andanotherforthedataclockedarchitecture.Athigherspeedseveryelement
of the signal path must be carefully considered, and thats true for the device
package,too.Theeffectsaddedtothesignalbythepackagemustbecompre
hendedinthetestingprocess.

Thecalibrationchannelitselfmustbedesignedwithspecificcharacteristicsin
mind, but the spec observes that a trace length of 28 inches on an FR4 PCB
shouldsufficetocreatethenecessaryISI.Asignalgeneratorisusedtoinjectthe
CompliancePatternwiththeappropriatejitterelementsincluded.

8.0 GT/s
Themethodfortestingthestressedeyeat8.0GT/sissimilar,buttherearesome
differences.Onedifferenceisthatthesignalcantbeevaluatedatthedevicepin
and so a replica channel is used to allow measuring the signal as it would
appearatthepinifthedevicewereanidealtermination.

InordertoevaluatetheReceiversabilitytoperformequalizationproperly,its
recommended that multiple calibration channels with different insertion loss
characteristicsbeusedsothereceivercanbetestedinmorethanoneenviron
ment. As with the transmitter at 8.0 GT/s, the calibration channel for the
receiverconsistsofdifferentialtracesterminatedatbothendswithcoaxialcon
nectors.

492
PCIe 3.0.book Page 493 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Toestablishthecorrectcorrelationbetweenthechannelandthereceiveritsnec
essary to model what the receiver see internally after equalization has been
applied.Thatmeanspostprocessingismustbeappliedthatwillmodelwhat
happensintheReceiver,includingthefollowingitems,thedetailsofwhichare
describedinthespec:
Packageinsertionloss
CDRClockandDataRecoverylogic
Equalizationthataccountsforthelongestcalibrationchannel,including
FirstorderCTLE(ContinuousTimeLinearEqualizer)
OnetapDFE(DecisionFeedbackEqualizer)

Receiver (Rx) Equalization


Transmitter equalization is mandatory, but the signal may still suffer enough
degradation going through the longest permissible channel that the eye is
closed and the signal is unrecognizable at the Receiver. To accommodate this
thespecdescribesreceiverequalizationlogic,butsaysisnotintendedtoserve
as an implementation guideline. What it does say is that a version will be
requiredforcalibratingthestressedeyewhenusingthelongestallowedcalibra
tionchannel.Asdescribedearlier,thatrequirementisdescribedasafirstorder
CTLEandaonetapDFE.

Continuous-Time Linear Equalization (CTLE)


A linear equalizer removes the undesirable frequency components from the
received signal. For PCIe this could be as simple as a passive highpass filter
thatreducesthevoltageofthelowfrequencycomponentfromthereceivedsig
nalwhichattenuatesbyaloweramountonthetransmissionline.Itcouldalso
be done with amplification to open up the received eye, however that would
amplifythehighfrequencynoisealongwiththesignalandcreateotherprob
lems.

OneformofreceiverequalizationwouldbeacircuitliketheoneshowninFig
ure1329onpage494,whichisaDiscreteTimeLinearEqualizer(DLE).Thisis
simplyanFIRfilter,similartotheoneusedbythetransmitter,toprovidewave
shaping as a means of compensating for channel distortion. One difference is
thatitusesaSampleandHold(S&H)circuitonthefrontendtoholdtheana
loginputvoltageatasampledvalueforatimeperiod,ratherthanallowingitto
constantlychange.ThespecdoesntmentionDLE,andthereasonsmayinclude
itshighercostandpowercomparedtoCTLE.AswiththetransmitterFIR,more
tapsprovidebetterwaveshapingbutaddcost,soonlyasmallnumberareprac
tical.

493
PCIe 3.0.book Page 494 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1329:RxDiscreteTimeLinearEqualizer(DLE)

Input
6
Received S&H
Signal
C0 C+1

1 UI delay 1 UI delay

Incontrast,CTLEisnotlimitedtodiscretetimeintervalsandimprovesthesig
naloveralongertimeinterval.AsimpleRCnetworkcanserveasanexampleof
a CTLE highpass filter, as shown in Figure 1330 on page 494. This serves to
reduce the lowfrequency distortion caused by the channel without boosting
thenoiseinthehighfrequencyrangeofinterestandcleansthesignalforuseat
the next stage. Figure 1331 on page 495 illustrates the attenuation effect of
CTLEhighpassfilteronthereceivedlowfrequencycomponentofasignale.g.
continuous1sorcontinuous0s.

Figure1330:RxContinuousTimeLinearEqualizer(CTLE)

R
Channel Input

494
PCIe 3.0.book Page 495 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical


Figure1331:EffectofRxContinuousTimeLinearEqualizer(CTLE)onReceivedSignal

Decision Feedback Equalization (DFE)


AnexampleonetapDFEcircuitliketheonedescribedinthespecisshownin
Figure 1332 on page 495, where it can be seen that the received signal is
summedwiththefeedbackvalueandthenfedintoadataslicer.Aslicerisan
A/Dcircuitthattakestheanaloglookinginputandconvertsitintoaclean,full
swingdigitalsignalforinternaluse.Itmakesitsbestguessanddecideswhether
theinputisapositiveornegativevalueandoutputseither+1or1.Thisdeci
sionissentintoanFIRfilterwithonlyonetap,whichisjustadelayedversion
weightedaccordingtoacoefficientsetting.Theoutputofthisfilteristhenfed
backandsummedwiththereceivedsignalforuseasthenewinputtothedata
slicer.

Figure1332:Rx1TapDFE

Output
Received
Signal
6
Slicer
- d1 Coefficient

1 UI

495
PCIe 3.0.book Page 496 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Thespeconlydescribesasingletapfilter,butatwotapversionisshowninFig
ure1333onpage497toillustrateanotheroption.Themotivationforincluding
moretapsistocreateacleaneroutput,sinceeachtapreducesthenoiseforone
moreUI.Thus,twotapsfurtherreducetheundesirablecomponentsofthesig
nal, as shown in the pulse response waveform at the bottom of the drawing.
Thisversionisalsoshownasadaptive,meaningitsabletomodifythecoeffi
cientvaluesontheflybasedondesignspecificcriteria.

Thecoefficientsofthefiltercouldbefixed,butiftheyreadjustablethereceiver
isallowedtochangethematanytimeaslongasdoingsodoesntinterferewith
the current operation. In the section called Recovery.Equalization on
page 587,ReceiverPresetHintsaredescribedasbeingdeliveredbytheDown
streamPorttotheUpstreamPortonaLink,usingEQTS1s.Thepresetgivesa
hint,intermsofdBreduction,atastartingpointforchoosingthesecoefficients.

Sincethespecdoesntrequireit,whattheReceiverchoosestodoregardingsig
nalcompensationwillbeimplementationspecific.Industryliteraturestatesthat
DFEismoreeffectivewhenworkingwithanopeneye,andthatswhyitsusu
allyemployedafteralinearequalizerthatservestocleanuptheinputenough
forDFEtoworkwell.

496
PCIe 3.0.book Page 497 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Figure1333:Rx2TapDFE

Output
Received Slicer
Signal
6
Adaptive
Coefficient

6 Adjustment

- d2 - d1

1 UI 1 UI

V
1st tap reduction
2nd tap
reduction
t
UI UI UI UI Rx Original
Cursor Rx after DFE

Receiver Characteristics
SomeselectedReceivercharacteristicsarelistedinTable 135onpage 498.The
ReceiverEyeDiagraminFigure1334onpage499alsoillustratessomeofthe
parameterslistedinthetable.

497
PCIe 3.0.book Page 498 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table135:CommonReceiverCharacteristics

2.5GT/
Item 5.0GT/s. 8.0GT/s Units Notes
s.

UI 399.88 199.94 124.9625 ps UnitInterval=bittime.


(min) (min) (min)
400.12 200.06 125.0375
(max) (max) (max)

TRXEYE 0.4 Indirectly UI MinimumeyewidthforaBER


(min) specified or1012.Athigherratesandlong
channelstheeyeiseffectively
closed,makingexternalmea
surementimpractical.

VRXEYE 300 120(CC) Not mVpp CC=commonclocked,DC=


100(DC) specified diff dataclocked

VRXDIFF 175 120(min) Indirectly mV Peaktopeakdifferentialvoltage


PPCC (min) 1200 specified sensitivityofcommonclocked
1200 (max) Receiver.
(max)

VRXDIFF 175 100(min) Indirectly mV Peaktopeakdifferentialvoltage


PPDC (min) 1200 specified sensitivityofdataclocked
1200 (max) Receiver.
(max)

VRXIDLE 65(min)175(max) mV ElectricalIdledetectthreshold


DETDIFFp attheReceiverpins.
p

ZRXDIFF 80 Coveredby Athigherfrequenciesimped


DC (min) RLRXDIFF ancecannolongerberepre
120 sentedbyalumpedsumvalue
(max) andmustbedescribedinmore
detail.

ZRXDC 40 40(min) Bounded DCimpedanceneededfor


(min) 60(max) by ReceiverDetect.
60 RLRXCM
(max)

498
PCIe 3.0.book Page 499 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Table135:CommonReceiverCharacteristics(Continued)

2.5GT/
Item 5.0GT/s. 8.0GT/s Units Notes
s.

LRXSKEW 20 8 6 ns MaxLanetoLaneskewthata
Receivermustbeabletocorrect.

RLRXDIFF 10 10(min) 10(min) dB Rxpackage+Sidifferential


(min) for0.05 for0.05 returnloss
1.25GHz, 1.25GHz,
8(min) 8(min)
for>1.25 for>1.25
2.5GHz 2.5GHz,
5(min)
for>2.5
4.0GHz

RLRXCM 6(min) 6(min) 6(min) dB CommonmodeRxreturnloss


for0.05
2.5GHz,
5(min)
for>2.54
GHz

Figure1334:2.5GT/sReceiverEyeDiagram

VRX-DIFFp-MIN = 88 mV
VRX-CM-DC= 0 V

TRX-EYE-MIN = 0.4 UI

499
PCIe 3.0.book Page 500 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Link Power Management States


Figure1335onpage500throughFigure1339onpage504illustratetheelectri
calstateofthePhysicalLayerwhilethelinkisinvariouspowermanagement
statesanddescribeseveralcharacteristics.OneoftheseistheTxandRxtermi
nations,whicharesometimesimplementedasactivelogic

Figure1335:L0FullOnLinkState

Detect

CTX ZTX
D+ D+
+

No Spec
Lane in
Transmitter Receiver
one
ON ON
CTX direction
Z TX
-
D- D-
ZTX ZTX ZRX Z RX Clock
Clock Source
High or Low VRX-CM = 0 V Low impedance
Source VCM ON
impedance termination termination
ON
Transmission and reception in progress
Recommended Power Budget about 80 mW per Lane
One direction of the Link can be in L0 while the other
side is in L0s
Transmitter and Receiver clock PLL are ON
Transmitter is On, Receiver is ON
Low impedance termination at transmitter

500
PCIe 3.0.book Page 501 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Figure1336:L0sLowPowerLinkState

Detect Held at 0 - 3.6 V DC common mode voltage

CTX ZTX
D+ D+
+

No Spec
Lane in
Transmitter Receiver
one
ON ON
CTX direction
Z TX
-
D- D-
ZTX Z TX ZRX Z RX Clock
Clock Source
High or Low Low impedance
Source VRX-CM = 0 V
VCM ON
impedance termination termination
ON
Transmitter holds Electrical Idle voltage (VTX-DIFFp < 20 mV) and DC common
mode voltage ( VTX-CM-DC 0 3.6 V)
Recommended Power Budget <= 20 mW per Lane
Recommended exit latency < 50 ns, however designers indicate that a more
realistic number appears to be 1 us-2 us
One direction of the Link can be in L0s while the other is in L0
Transmitter and Receiver clock PLL are ON but Rx Clock loses sync
Transmitter is On, Receiver is ON
High or Low impedance termination at transmitter

501
PCIe 3.0.book Page 502 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1337:L1LowPowerLinkState

Detect Held at 0 - 3.6 V DC common mode voltage

CTX ZTX
D+ D+
+

No Spec
Lane in
Transmitter one Receiver
ON ON
direction
CTX ZTX
-
D- D-
ZTX ZTX ZRX ZRX Clock
Clock Source
High or Low VRX-CM = 0 V Low impedance
Source VCM
impedance termination termination
May be OFF
May be OFF
Transmitter holds Electrical Idle voltage and DC common mode voltage
Recommended Power Budget <= 5 mW per Lane
Recommended exit latency < 10 microseconds (may be greater)
Both directions of the Link must be in L1 at the same time
Transmitter and Receiver clock PLL may be OFF, but clock to device ON
Transmitter is On, Receiver is ON
High or Low impedance termination at transmitter

502
PCIe 3.0.book Page 503 Sunday, September 2, 2012 11:25 AM

Chapter 13: Physical Layer - Electrical

Figure1338:L2LowPowerLinkState

Detect Transmitter most likely OFF,


no DC value maintained
CTX ZTX
D+ D+
+

No Spec
Lane in
Transmitter one Receiver
OFF direction OFF
CTX Z TX
-
D- D-
ZTX Z TX ZRX Z RX Clock
Clock Source
High or Low VRX-CM = 0 V High impedance
Source VCM
impedance termination termination OFF
OFF
Low frequency Transmitter holds Electrical Idle voltage, but not required to hold
DC common mode voltage. Most likely OFF.
for Beacon ON Recommended Power Budget <= 1 mW per Lane
Recommended exit latency < 12 - 50 milliseconds
Both directions of the Link in L2
Transmitter and Receiver clock PLL OFF, and clock to device OFF
Low frequency clock for Beacon in transmitter ON
Main power to device OFF, but Vaux ON
Transmitter is OFF, Receiver is OFF
High or Low impedance termination at transmitter, high impedance at receiver

503
PCIe 3.0.book Page 504 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1339:L3LinkOffState

Detect DC common mode voltage OFF

CTX ZTX
D+ D+
+

No Spec
Lane in
Transmitter one Receiver
OFF direction OFF
CTX Z TX
-
D- D-
ZTX Z TX ZRX ZRX Clock
Clock High impedance High impedance Source
termination VRX-CM = 0 V termination
Source VCM OFF
OFF Transmitter does not hold DC common mode voltage
Low frequency Recommended Power Budget: zero
for Beacon OFF Recommended L3 -> L0 exit latency < 12 - 50 milliseconds after
power turned ON
Both directions of the Link in L3
Transmitter and Receiver clock PLL OFF, and clock to device OFF
Low frequency clock for Beacon in transmitter OFF
Main power to device OFF, Vaux OFF
Transmitter and Receiver OFF
High impedance termination at transmitter and receiver

504
PCIe 3.0.book Page 505 Sunday, September 2, 2012 11:25 AM

14 LinkInitialization
&Training
The Previous Chapter
The previous chapter describes the Physical Layer electrical interface to the
Link, including some lowlevel characteristics of the differential Transmitters
andReceivers.Theneedforsignalequalizationandthemethodsusedtoaccom
plishitarealsodiscussedhere.Thischaptercombineselectricaltransmitterand
receivercharacteristicsforbothGen1,Gen2andGen3speeds.

This Chapter
This chapter describes the operation of the Link Training and Status State
Machine(LTSSM)ofthePhysicalLayer.TheinitializationprocessoftheLinkis
describedfromPowerOn or ResetuntiltheLink reachesfullyoperationalL0
state during which normal packet traffic occurs. In addition, the Link power
managementstatesL0s,L1,L2,andL3arediscussedalongwiththestatetransi
tions.TheRecoverystate,duringwhichbitlock,symbollockorblocklockare
reestablishedisdescribed.LinkspeedandwidthchangeforLinkbandwidth
managementisalsodiscussed.

The Next Chapter


ThenextchapterdiscusseserrortypesthatoccurinaPCIePortorLink,how
they are detected, reported, and options for handling them. Since PCIe is
designedtobebackwardcompatiblewithPCIerrorreporting,areviewofthe
PCI approach to error handling is included as background information. Then
wefocusonPCIeerrorhandlingofcorrectable,nonfatalandfatalerrors.

505
PCIe 3.0.book Page 506 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Overview
Linkinitializationandtrainingisahardwarebased(notsoftware)processcon
trolledbythePhysicalLayer.Theprocessconfiguresandinitializesadevices
linkandportsothatnormalpackettrafficproceedsonthelink.

Figure141:LinkTrainingandStatusStateMachineLocation

Memory, I/O, Configuration R/W Requests or Message Requests or Completions


(Software layer sends / receives address/transaction type/data/message index)
Software layer
Transmit Receive

Transaction Layer Packet (TLP) Transaction Layer Packet (TLP)


Header Data Payload ECRC Header Data Payload ECRC

Transaction layer
Flow Control
Transmit Receive
Virtual Channel
Buffers Buffers
Management
per VC per VC
Ordering

Link Packet DLLPs e.g. DLLPs Link Packet


Sequence TLP LCRC ACK/NAK CRC ACK/NAK CRC Sequence TLP LCRC

Data Link layer De-mux


TLP Replay
Buffer
TLP Error
Mux Check

Physical Packet Physical Packet


Start Link Packet End Start Link Packet End

Physical layer Encode Decode

Parallel-to-Serial Link Serial-to-Parallel


Training
Differential Driver Differential Receiver
(LTSSM)

Port

506
PCIe 3.0.book Page 507 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Thefulltrainingprocessisautomaticallyinitiatedbyhardwareafteraresetand
ismanagedbytheLTSSM(LinkTrainingandStatusStateMachine),shownin
Figure141onpage506.

Several things are configured during the Link initialization and training pro
cess.Letsconsiderwhattheyareanddefinesometermsupfront.

BitLock:WhenLinktrainingbeginstheReceiversclockisnotyetsynchro
nizedwiththetransmitclockoftheincomingsignal,andisunabletoreliably
sample incoming bits. During Link training, the Receiver CDR (Clock and
DataRecovery)logicrecreatestheTransmittersclockbyusingtheincoming
bitstreamasaclockreference.Oncetheclockhasbeenrecoveredfromthe
stream,theReceiverissaidtohaveacquiredBitLockandisthenabletosam
pletheincomingbits.FormoreontheBitLockmechanism,seeAchieving
BitLockonpage 395.
SymbolLock:For8b/10bencoding(usedinGen1andGen2),thenextstepis
to acquire Symbol Lock. This is a similar problem in that the receiver can
nowseeindividualbitsbutdoesntknowwheretheboundariesofthe10bit
Symbolsarefound.AsTS1sandTS2sareexchanged,Receiverssearchfora
recognizable pattern in the bit stream. A simple one to use for this is the
COMSymbol.Itsuniqueencodingmakesiteasytorecognizeanditsarrival
showstheboundaryofboththeSymbolandtheOrderedSetsinceaTS1or
TS2mustbeinprogress.Formoreonthis,seeAchievingSymbolLockon
page 396.
BlockLock:For8.0GT/s(Gen3),theprocessisalittledifferentfromSymbol
Lockbecausesince8b/10bencodingisnotused,therearenoCOMcharac
ters.However,Receiversstillneedtofindarecognizablepacketboundaryin
the incoming bit stream. The solution is to include more instances of the
EIEOS(ElectricalIdleExitOrderedSet)inthetrainingsequenceandusethat
tolocatetheboundaries.AnEIEOSisrecognizableasapatternofalternating
00handFFhbytes,anditdefinestheBlockboundarybecause,bydefinition,
whenthatpatternendsthenextBlockmustbegin.
LinkWidth:DeviceswithmultipleLanesmaybeabletousedifferentLink
widths.Forexample,adevicewithax2portmaybeconnectedtoonewitha
x4 port. During Link training, the Physical Layer of both devices tests the
Linkandsetsthewidthtothehighestcommonvalue.
Lane Reversal: The Lanes on a multiLane devices port are numbered
sequentially beginning with Lane 0. Normally, Lane 0 of one devices port
connectstoLane0oftheneighborsport,Lane1toLane1,andsoon.How
ever,sometimesitsdesirabletobeabletologicallyreversetheLanenumbers
tosimplifyroutingandallowtheLanestobewireddirectlywithouthaving
tocrisscross(seeFigure142onpage508).Aslongasonedevicesupportsthe
optionalLaneReversalfeature,thiswillwork.Thesituationisdetecteddur

507
PCIe 3.0.book Page 508 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ingLinktrainingandonedevicemustinternallyreverseitsLanenumbering.
Sincethespecdoesntrequiresupportforthis,boarddesignerswillneedto
verifythatatleastoneoftheconnecteddevicessupportsthisfeaturebefore
wiringtheLanesinreverseorder.

Figure142:LaneReversalExample(SupportOptional)

Example 1 Example 2
Neither device supports Lane Reversal Device B supports Lane Reversal

Device A Device A
(Upstream Device) (Upstream Device)
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

Lanes
after
0 1 2 3 0 1 2 3 reversal

3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 Lanes
before
Device B Device B reversal
(Downstream Device) (Downstream Device)

Traces must cross to wire the Lanes Lane Reversal allows Lane
correctly, adding complexity and cost. numbers to match directly.

PolarityInversion:TheD+andDdifferentialpairterminalsfortwodevices
may also be reversed as needed to make board layout and routing easier.
Every Receiver Lane must independently check for this and automatically
correctitasneededduringtraining,asillustratedinFigure143onpage509.
To do this, the Receiver looks at Symbols 6 to 15 of the incoming TS1s or
TS2s.IfaD21.5isreceivedinsteadofaD10.2inaTS1,oraD26.5insteadof
the D5.2 expected for a TS2, then the polarity of that lane is inverted and
mustbecorrected.UnlikeLanereversal,supportforthisfeatureismanda
tory.

508
PCIe 3.0.book Page 509 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure143:PolarityInversionExample(SupportRequired)

Device A
(Upstream Device)
D+ D- D+ D-
After Polarity Inversion
D- D+

After Polarity Inversion


D+ D-
Before Polarity Inversion
D- D+ D- D+
Device B
(Downstream Device)

LinkDataRate:Afterareset,Linkinitializationandtrainingwillalwaysuse
thedefault2.5Gbit/sdatarateforbackwardcompatibility.Ifhigherdatarates
areavailable,theyareadvertisedduringthisprocessand,whenthetraining
is completed, devices will automatically go through a quick retraining to
changetothehighestcommonlysupportedrate.
LanetoLane Deskew: Trace length variations and other factors cause the
parallelbitstreamsofamultiLaneLinktoarriveattheReceiversatdifferent
times, a problem referred to as signal skew. Receivers are required to com
pensateforthisskewbydelayingtheearlyarrivalsasneededtoalignthebit
streams (see LanetoLane Skew on page 442). They must correct a rela
tivelybigskewautomatically(20nsdifferenceinarrivaltimeispermittedat
2.5GT/s), and that frees board designers from the sometimes difficult con
straint of creating equallength traces. Together with Polarity Inversion and
LaneReversal,thisgreatlysimplifiestheboarddesignerstaskofcreatinga
reliablehighspeedLink.

Ordered Sets in Link Training

General
AllofthedifferenttypesofPhysicalLayerOrderedSetsweredescribedinthe
sectioncalledOrderedsetsonpage 388.TrainingSequencesTS1andTS2are
of interest during the training process. The format for these when in Gen1 or
Gen2modeisshowninFigure144onpage510,whileforGen3modeofopera
tion, they are as shown in Figure 145 on page 511. A detailed description of
theircontentsfollows.

509
PCIe 3.0.book Page 510 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure144:TS1andTS2OrderedSetsWhenInGen1orGen2Mode

0 COM K28.5
1 Link # 0 - 255 = D0.0 - D31.7, PAD = K23.7
2 Lane # 0 - 31 = D0.0 - D17.1, PAD = K23.7
3 # FTS # of FTSs required by Receiver for L0s recovery
4 Rate ID Bit 1 must be set, indicates 2.5 GT/s support
5 Train Ctl
6 TS ID or Equalization info when
changing to 8.0 GT/s, else
9 EQ Info TS1 or TS2 Identifier
10
TS1 Identifier = D10.2
TS ID
TS2 Identifier = D5.2
15

TS1 and TS2 Ordered Sets


As seen in the illustrations, TS1s and TS2s consist of 16 Symbols. They are
exchangedduringthePolling,Configuration,andRecoverystatesoftheLTSSM
described in Link Training and Status State Machine (LTSSM) on page 518.
TheSymbolsaredescribedbelowandsummarizedinTable 141onpage 514for
TS1sandTable 142onpage 516forTS2s.

Tomakethedescriptionsalittleshorterandeasiertoread,thetermGen1will
beusedtoindicateddatarateof2.5GT/s,Gen2toindicateddatarateof5.0
GT/sandGen3toindicatedataratesof8.0GT/s.Also,notethatthePADchar
acterusedintheLinkandLanenumbersisrepresentedbytheK23.7character
forthelowerdatarates,butasthedatabyteF7hforGen3.Inourdiscussionthe
distinction between the types of PAD is not interesting and will simply be
implied.

510
PCIe 3.0.book Page 511 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure145:TS1andTS2OrderedSetBlockWhenInGen3ModeofOperation

0 TS1 = 1Eh, TS2 = 2Dh


1 Link # 0 - 31, PAD = F7h
2 Lane # 0 - 31, PAD = F7h
3 # FTS # of FTSs required by Receiver for L0s recovery
4 Rate ID Bit 3 indicates 8.0 GT/s support
5 Train Ctl
6 Equalization presets
EQ Info
9 and coefficients or TS2
10
TS1 Identifier = 4Ah
TS ID TS2 Identifier = 45h
13
14 TS ID TS1, TS2, or
15 DC Balance Symbols

Table 141onpage 514andTable 142onpage 516isasummaryofTS1andTS2


contents.Amoredetaileddescriptionofthe16TS1/TS2Symbolsfollows:
Symbol0:
ForGen1orGen2,thefirstSymbolofanyOrderedSetistheK28.5(COM)
character. Receivers use this character to acquire Symbol Lock. Since it
mustappearonallLanesatthesametimeitsalsousefulfordeskewing
theLanes.
ForGen3,anOrderedSetisidentifiedbythe2bitSyncHeaderthatmust
precedetheBlock(notshownintheillustration),andthefirstSymbolafter
thatindicateswhichOrderedSetwillfollow.ForaTS1,thefirstSymbolis
1Eh,andforaTS2,its2Dh.
Symbol 1 (Link#):In thePollingstatethis fieldcontainsthePAD Symbol,
butintheotherstatesaLinkNumberisassigned.
Symbol2(Lane#):InthePollingstatethisfieldcontainsthePADSymbol,
butintheotherstatesaLaneNumberisassigned.
Symbol 3 (N_FTS): Indicates the number of Fast Training Sequences the
ReceiverwillneedinordertoachievetheL0statewhenexitingfromtheL0s
power state at the current speed. Transmitters will send at least that many

511
PCIe 3.0.book Page 512 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

FTSstoexitL0s.Theamountoftimeneededforthisdependsonhowmany
are needed and the data rate in use. For example, at 2.5 GT/s each Symbol
takes4nsso,if200FTSswereneededtherequiredtimewouldbe200FTS*4
SymbolsperFTS*4ns/Symbol=3200ns.IftheExtendedSynchbitissetin
thetransmitterdevice,atotalof4096FTSsmustbesent.Thislargenumberis
intended to provide enough time for external Link monitoring tools to
acquireBitandSymbolLock,sincesomeofthemmaybeslowinthisregard.
Symbol 4 (Rate ID): Devices report which data rates they support, along
with a little more information used for hardwareinitiated bandwidth
changes. The 2.5 GT/s rate must always be supported and the Link will
alwaystraintothatspeedautomaticallyafterresetsothatnewercomponents
willremainbackwardcompatiblewitholderones.If8.0GT/sissupported,
its also required that 5.0 GT/s must be available. Other information in this
Symbolincludesthefollowing:
Autonomous Change: If set, any requested bandwidth change was initi
atedforpowermanagementreasons.Ifachangeisrequestedandthisbit
isnotset,thenunreliableoperationhasbeendetectedatthehigherspeed
orwiderLinkandthechangeisrequestedtofixthatproblem.
SelectableDeemphasis
UpstreamPortssetthistoindicatetheirdesireddeemphasislevelat
5.0GT/s.Howtheymakethischoiceisimplementationspecific.Inthe
Recovery.RcvrCfgstate,theyregisterthevaluetheyreceiveforthisbit
internally(thespecdescribesitasbeingstoredinaselect_deemphasis
variable).
Downstream Ports and Root Ports: In the Polling.Compliance state
the select_deemphasis variable must be set to match the received
value of this bit. In the Recovery.RcvrCfg state, the Transmitter sets
this bit in its TS2s to match the Selectable Deemphasis field in the
LinkControl2register.Sincethisregisterbitishardwareinitialized,
theexpectationisthatitsassignedtoanoptimalvalueatpowerup
byfirmwareorastrappingoption.
In Loopback mode at 5.0 GT/s, the Slave deemphasis value is
assignedbythisbitintheTS1ssentbytheMaster.
LinkUpconfigureCapability:ReportswhetherawideLinkwhosewidth
is reduced will be capable of going back to the wide case or not. If both
sides of a Link report this during Configuration.Complete, this fact is
recordedinternally(e.g.anupconfigure_capablebitisset).
Symbol 5 (Training Control): Communicates special conditions such as a
HotReset,EnableLoopbackmode,DisableLink,DisableScrambling.

512
PCIe 3.0.book Page 513 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Symbols69(EqualizationControl):
ForGen1orGen2,Symbols79arejustTS1orTS2indicators,andSymbol
6usuallyis,too.However,ifbit7ofSymbol6issettooneinsteadofthe
zero that would be there forthe TS1 or TS2 identifier, that indicates that
thisisanEQTS1orEQTS2sentfromtheDownstreamPort(DSPport
thatfacesdownstream,likeaRootPort).TheEQlabelstandsforequal
ization,andmeansthattheLinkisgoingtochangeto8.0GT/sandsothe
Upstream Port (USP port that faces upstream, like an Endpoint Port)
needstoknowwhatequalizervaluestouse.ForEQTS1sorTS2s,Symbol6
gives that information to the USP in the form of Transmitter Presets and
Receiver Preset Hints. Ports that support 8.0 GT/s must accept either TS
type(regularorEQ),butportsthatdonotsupportitarenotrequiredto
accept the EQ type. The possible values for these presets are listed in
Table 148onpage 579andTable 149onpage 580.
For Gen3, Symbols 69 provide Preset values and Coefficients for the
Equalizationprocess.Bit7ofSymbol6inaTS2cannowbeusedbyaUSP
torequestthatequalizationberedone.Ifitdoes,bit6mayalsobesetto
indicate that the time needed to repeat the equalization process wont
causeproblems,suchasacompletiontimeout,aslongasitsdonequickly
(within 1ms of returning to L0). This might be needed, for example, if a
problem was detected with the equalization results. A DSP can also use
bits6and7toasktheUSPtomakesucharequestandguaranteenoside
effects,althoughtheUSPisnotrequiredtorespondtothis.Formoreon
theequalizationprocess,seeLinkEqualizationOverviewonpage 577.
Symbols1013:TS1orTS2identifiers.
Symbols1415:(DCBalance)
ForGen1andGen2,thesearejustTS1orTS2indicatorssinceDCBalance
ismaintainedby8b/10bencoding.
ForGen3,thecontentsofthesetwoSymbolsdependontheDCBalanceof
the Lane. Each Lane of a Transmitter must independently track the run
ningDCBalanceforallthescrambledbitssentforTS1sandTS2s.Run
ningDCBalancemeansthedifferencebetweenthenumberofonessent
vs.thenumberofzeroessent,andLanesmustbecapableoftrackingadif
ferenceofupto511ineitherdirection.Thesecounterssaturateattheirmax
value but continue to track reductions. For example, if the counter indi
catesthat511moreonesthanzeroeshavebeensent,thennomatterhow
manymoreonesaresent,thevaluewillstayat511.However,if2zeroes
aresent,thecounterwillcountdownto509.WhenaTS1orTS2issent,the
followingalgorithmisusedtodetermineSymbols14and15:
IftherunningDCBalancevalueis>31attheendofSymbol11and
moreoneshavebeensent,Symbol14=20handSymbol15=08h.If
morezeroeshavebeensent,Symbol14=DFhandSymbol15=F7h.

513
PCIe 3.0.book Page 514 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology


If the running DC Balance value is > 15, Symbol 14 = the normal
scrambledTS1orTS2identifier,whileSymbol15=08htoreducethe
numberofones,orF7htoreducethenumberofzeroesintheDCBal
ancecount.
Otherwise,thenormalTS1orTS2identifierSymbolswillbesent.
OthernotesonGen3DCBalance:
TherunningDCBalanceisresetbyanexitfromElectricalIdleoran
EIEOSafteraDataBlock.
The DC Balance Symbols bypass scrambling to ensure that the
expectedbitpatternissent.

Table141:SummaryofTS1OrderedSetContents

Symbol
Description
Number

0 ForGen1orGen2,theCOM(K28.5)Symbol
ForGen3,1EhindicatesaTS1.

1 LinkNumber
PortsthatdontsupportGen3:0255,PAD
DownstreamportsthatsupportGen3:031,PAD
UpstreamportsthatsupportGen3:0255,PAD

2 LaneNumber
031,PAD

3 N_FTS
NumberofFTSOrderedSetsrequiredbyreceivertoachieveL0whenexiting
L0s:0255

4 DataRateIdentifier:
Bit0Reserved.
Bit12.5GT/ssupported(mustbesetto1b)
Bit25.0GT/ssupported(mustbesetifbit3isset)
Bit38.0GT/ssupported
Bits5:4Reserved
Bit6AutonomousChange/SelectableDeemphasis
DownstreamPorts:UsedinPolling.Active,Configuration.Linkwidth.Start,
andLoopback.EntryLTSSMstates,andreservedinallotherstates.
UpstreamPorts:UsedinPolling.Active,Configuration,Recovery,and
Loopback.EntryLTSSMstatesandreservedinallotherstates.
Bit7Speedchange.ThiscanonlybesettooneintheRecovery.RcvrLock
LTSSMstate,andisreservedinallotherstates.

514
PCIe 3.0.book Page 515 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Table141:SummaryofTS1OrderedSetContents(Continued)

Symbol
Description
Number

5 TrainingControl(0=Deassert,1=Assert)
Bit0HotReset
Bit1DisableLink
Bit2Loopback
Bit3DisableScrambling(for2.5or5.0GT/s;reservedforGen3)
Bit4ComplianceReceive(optionalfor2.5GT/s,requiredforallotherrates)
Bits7:5Reserved,Setto0

6 ForGen1orGen2:
TS1identifier(4Ah)encodedasD10.2
EQTS1sencodethisas
Bits2:0Receiverpresethint
Bits6:3TransmitterPreset
Bit7setto1b
ForGen3:
Bits1:0EqualizationControl(EC).OnlyusedinRecovery.Equalizationand
LoopbackLTSSMstates;mustbe00binallotherstates.
Bit2ResetEIEOSIntervalCount.OnlyusedinRecovery.Equalization
LTSSMstate;reservedinallotherstates.
Bits6:3TransmitterPreset
Bit7UsePreset.(Ifone,usethepresetvaluesinsteadofthecoefficientval
ues.Ifzero,usethecoefficientsratherthanthepresets.)OnlyusedinRecov
ery.EqualizationandLoopbackLTSSMstates;reservedinallotherstates.

7 ForGen1orGen2GT/s,TS1identifier(4Ah)encodedasD10.2
ForGen3:
Bits5:0FS(FullSwingvalue)whentheECfieldofSymbol6is01b,other
wise,PrecursorCoefficient.
Bits7:6Reserved.

8 ForGen1orGen2,TS1identifier(4Ah)encodedasD10.2
ForGen3:
Bits5:0LF(LowFrequencyvalue)whentheECfieldofSymbol6is01b,oth
erwise,CursorCoefficient.
Bits7:6Reserved.

515
PCIe 3.0.book Page 516 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table141:SummaryofTS1OrderedSetContents(Continued)

Symbol
Description
Number

9 ForGen1orGen2,TS1identifier(4Ah)encodedasD10.2
ForGen3:
Bits5:0PostcursorCoefficient.
Bit6RejectCoefficientValues.OnlysetinspecificPhasesoftheRecov
ery.EqualizationLTSSMstate;mustbe0botherwise.
Bit7Parity(P)ThisistheevenparityofallbitsofSymbols6,7,and8and
bits6:0ofSymbol9.Receiversmustcalculatethisandcompareittothe
receivedParitybit.ReceivedTS1sareonlyvalidiftheParitybitsmatch.

1013 ForGen1orGen2,TS1identifier(4Ah)encodedasD10.2
ForGen3,TS1identifier(4Ah)

1415 ForGen1orGen2,TS1identifier(4Ah)encodedasD10.2
ForGen3,TS1identifier(4Ah),oraDCBalanceSymbol.

TheobservantreadermaywonderwhyEQTS1sareshowninSymbol6forthe
lowerdataratessinceonly8.0GT/sdataratesuseequalization.Thatsbecause
theyreusedtodeliverEQvaluesforLanesthatsupportGen3butarecurrently
operating at a lower rate and want to change to 8.0 GT/s. For more details
regarding this and the Equalization process for Gen3 in general, see Link
EqualizationOverviewonpage 577.

Table142:SummaryofTS2OrderedSetContents

Symbol
Description
Number

0 ForGen1orGen2,theCOM(K28.5)Symbol
ForGen3,2DhindicatesaTS2.

1 LinkNumber
PortsthatdontsupportGen3:0255,PAD
DownstreamportsthatsupportGen3:031,PAD
UpstreamportsthatsupportGen30255,PAD

2 LaneNumber
031,PAD

516
PCIe 3.0.book Page 517 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Table142:SummaryofTS2OrderedSetContents(Continued)

Symbol
Description
Number

3 N_FTS
NumberofFTSOrderedSetsrequiredbyreceivertoachieveL0whenexiting
L0s:0255

4 DataRateIdentifier:
Bit0Reserved.
Bit12.5GT/ssupported(mustbesetto1b)
Bit25.0GT/ssupported(mustbesetifbit3isset)
Bit38.0GT/ssupported
Bits5:4Reserved
Bit6AutonomousChange/SelectableDeemphasis/LinkUpconfigureCapa
bility.UsedinPolling.Configuration,Configuration.Complete,andRecovery
LTSSMstates;reservedinallotherstates.
Bit7Speedchange.ThiscanonlybesettooneintheRecovery.RcvrLock
LTSSMstate,andisreservedinallotherstates.

5 TrainingControl(0=Deassert,1=Assert)
Bit0HotReset,
Bit1DisableLink
Bit2Loopback
Bit3DisableScrambling(for2.5or5.0GT/s;reservedforGen3)
Bits7:4Reserved,Setto0

6 ForGen1orGen2:
TS2identifier(4Ah)encodedasD10.2
EQTS2sencodethisas
Bits2:0ReceiverpresetHint
Bits6:3TransmitterPreset
Bit7EqualizationCommand
ForGen3:
Bits5:0Reserved.
Bit6QuiesceGuarantee.DefinedforuseinRecovery.RcvrCfgonly;
reservedinallotherstates.
Bit7RequestEqualization.DefinedforuseinRecovery.RcvrCfgonly;
reservedinallotherstates.

713 ForGen1orGen2,TS2identifier(45h)encodedasD5.2
ForGen3,TS2identifier(45h)

1415 ForGen1orGen2,TS2identifier(45h)encodedasD5.2
ForGen3,TS2identifier(45h),oraDCBalanceSymbol

517
PCIe 3.0.book Page 518 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Link Training and Status State Machine (LTSSM)

General
Figure146onpage519illustratesthetoplevelstatesoftheLinkTrainingand
StatusStateMachine(LTSSM).Eachstateconsistsofsubstates.ThefirstLTSSM
state entered after exiting Fundamental Reset (Cold or Warm Reset) or Hot
ResetistheDetectstate.
TheLTSSMconsistsof11toplevelstates:Detect,Polling,Configuration,Recov
ery, L0, L0s, L1, L2, Hot Reset, Loopback, and Disable. These can be grouped
intofivecategories:

1. LinkTrainingstates
2. ReTraining(Recovery)state
3. SoftwaredrivenPowerManagementstates
4. ActiveStatePowerManagement(ASPM)states
5. Otherstates

WhenexitingfromanytypeofReset,theflowoftheLTSSMfollowstheLink
Training states: Detect => Polling => Configuration => L0. In L0 state, normal
packettransmission/receptionisinprogress.
TheLinkReTrainingalsocalledRecoverystateisenteredforavarietyofrea
sons,suchaschangingbackfromalowpowerLinkstate,likeL1,orchanging
the Link bandwidth (through speed or width changes). In this state, the Link
repeats as much of the training process as needed to handle the matter and
returnstoL0(normaloperation).
Powermanagementsoftwaremayalsoplaceadeviceintoalowpowerdevice
state(D1,D2,D3HotorD3Cold)andthatwillforcetheLinkintoalowerPower
ManagementLinkstate(L1orL2).
Iftherearenopacketstosendforatime,ASPMhardwaremaybeallowedto
automatically transition the Link into low power ASPM states (L0s or ASPM
L1).
Inaddition,softwarecandirectaLinktoentersomeotherspecialstates:Dis
abled, Loopback, or Hot Reset. Here, these are collectively called the Other
statesgroup.

518
PCIe 3.0.book Page 519 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure146:LinkTrainingandStatusStateMachine(LTSSM)

Initial S tate after any


R eset or as directed
by the D ata Link Layer

D isabled D etect
Training S tates

R e-Training State
E xte rn a l
Lo op ba ck
P ow er M gt S tates
P olling
A SPM S tates
H ot
R eset O ther S tates
F rom
C onfiguration F rom C onfiguration
or Recovery Recovery
L2 R e co ve ry

L1 L0 L0s

Overview of LTSSM States


Belowisabriefdescriptionofthe11highlevelLTSSMstates.

Detect:Theinitialstateafterreset.Inthisstate,adeviceelectricallydetectsa
ReceiverispresentatthefarendoftheLink.Thatsanunusualthinginthe
worldofserialtransports,butitsdonetofacilitatetesting,aswellseeinthe
nextstate.DetectmayalsobeenteredfromanumberofotherLTSSMstates
asdescribedlater.
Polling:Inthisstate,TransmittersbegintosendTS1sandTS2s(at2.5GT/s
forbackwardcompatibility)sothatReceiverscanusethemtoaccomplishthe
following:
AchieveBitLock
AcquireSymbolLockorBlockLock
CorrectLanepolarityinversion,ifneeded
LearnavailableLanedatarates

519
PCIe 3.0.book Page 520 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Ifdirected,InitiatetheCompliancetestsequence:Thewaythisworksis
thatifareceiverwasdetectedintheDetectstatebutnoincomingsignal
isseen,itsunderstoodtomeanthatthedevicehasbeenconnectedtoa
testload.Inthatcase,itshouldsendthespecifiedCompliancetestpat
terntofacilitatetesting.Thisallowstestequipmenttoquicklyverifythat
voltage,BER,timing,andotherparametersarewithintolerance.
Configuration: Upstream and Downstream components now play specific
rolesastheycontinuetoexchangeTS1sandTS2sat2.5GT/stoaccomplish
thefollowing:
DetermineLinkwidth
AssignLanenumbers
OptionallycheckforLanereversalandcorrectit
DeskewLanetoLanetimingdifferences
From this state, scrambling can be disabled, the Disable and Loopback
states canbeentered, andthenumber of FTSOrderedSetsrequired to
transitionfromtheL0sstatetotheL0stateisrecordedfromtheTS1sand
TS2s.
L0:Thisisthenormal,fullyactivestateofaLinkduringwhichTLPs,DLLPs
andOrderedSetscanbeexchanged.Inthisstate,theLinkcouldberunning
athigherspeedsthan2.5GT/s,butonlyafterretraining(Recovery)theLink
andgoingthroughaspeedchangeprocedure.
Recovery:ThisstateisenteredwhentheLinkneedsretraining.Thiscould
becausedbyerrorsinL0,orrecoveryfromL1backtoL0,orrecoveryfrom
L0siftheLinkdoesnottrainproperlyusingtheFTSsequence.InRecovery,
Bit Lock and Symbol/Block Lock are reestablished in a manner similar to
thatusedinthePollingstatebutittypicallytakesmuchlesstime.
L0s: This ASPM state is designed to provide some power savings while
affordingaquickrecoverytimebacktoL0.ItsenteredwhenoneTransmitter
sendstheEIOSwhileintheL0state.ExitfromL0sinvolvessendingFTSsto
quicklyreacquireBitandSymbol/BlockLock.
L1:Thisstateprovidesgreaterpowersavingsbytradingoffalongerrecovery
time than L0s does (see Active State Power Management (ASPM) on
page 735).EntryintoL1involvesanegotiationbetweenbothLinkpartnersto
enterittogetherandcanoccurinoneoftwoways:
ThefirstisautonomouswithASPM:hardwareinanUpstreamPortwith
noscheduledTLPsorDLLPstotransmitcanautomaticallynegotiateto
put its Link into the L1 state. If the Downstream Port agrees, the Link
entersL1.Ifnot,theUpstreamPortwillenterL0sinstead(ifenabled).
Thesecondistheresultofpowermanagementsoftwareissuingacom
mandingadevicetoalowpowerstate(D1,D2,orD3Hot).Asaresult,
theUpstreamPortnotifiestheDownstreamPortthattheymustenterL1,
theDownstreamPortacknowledgesthat,andtheyenterL1.

520
PCIe 3.0.book Page 521 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

L2: In this state the main power to the devices is turned off to achieve a
greaterpower savings.Almostall of thelogicis off, but a smallamount of
powerisstillavailablefromtheVauxsourcetoallowthedevicetoindicatea
wakeup event. An Upstream Port that supports this wakeup capability can
sendaverylowfrequencysignalcalledtheBeaconandaDownstreamPort
canforwardittotheRootComplextogetsystemattention(seeBeaconSig
naling on page 483). Using the Beacon, or a sideband WAKE# signal, a
devicecantriggerasystemwakeupeventtogetmainpowerrestored.[AnL3
Linkpowerstateisalsodefined,butitdoesntrelatetotheLTSSMstates.The
L3stateisthefulloffconditioninwhichVauxpowerisnotavailableanda
wakeupeventcantbesignaled.]
Loopback:ThisstateisusedfortestingbutexactlywhataReceiverdoesin
thismode(forexample:howmuchofthelogicparticipates)isleftunspeci
fied.Thebasicoperationissimpleenough:thedevicethatwillbetheLoop
back Master sends TS1 Ordered Sets that have the Loopback bit set in the
TrainingControlfieldtothedevicethatwillbetheLoopbackSlave.Whena
device sees two consecutive TS1s with the Loopback bit set, it enters the
LoopbackstateastheLoopbackSlaveandechoesbackeverythingthatcomes
in. The Master, recognizing that what it is sending is now being echoed,
sendsanypatternofSymbolsthatfollowthe8b/10bencodingrules,andthe
Slaveechoesthembackexactlyastheyweresent,providingaroundtripver
ificationofLinkintegrity.
Disable:ThisstateallowsaconfiguredLinktobedisabled.Inthisstate,the
Transmitter is in the Electrical Idle state while the Receiver is in the low
impedancestate.ThismightbenecessarybecausetheLinkhasbecomeunre
liable or due to a surprise removal of the device. Software commands a
devicetodothisbysettingtheDisablebitintheLinkControlregister.The
devicethensends16TS1swiththeDisableLinkbitsetintheTS1Training
Controlfield.ReceiversaredisabledwhentheyreceivethoseTS1s.
HotReset:SoftwarecanresetaLinkbysettingtheSecondaryBusResetbitin
theBridgeControlregister.ThatcausesthebridgesDownstreamPorttosend
TS1s with the Hot Reset bit set in the TS1 Training Control field (see Hot
Reset (Inband Reset) on page 837) When a Receiver sees two consecutive
TS1swiththeHotResetbitset,itmustresetitsdevice.

Introductions, Examples and State/Substates


ThebalanceofthischaptercoverseachoftheLTSSMstates.Dependingonthe
complexityofagivenstate,thediscussionmayincludeanintroduction,general
background,and/orexamplesthataccompaniesthedetaileddiscussionofthe
State/Substate.Insomecases,thereadermaychoosetoskipthedetailedcover

521
PCIe 3.0.book Page 522 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

age and jump to introductory material. Each section is organized to facilitate


theseoptions.
Everydevicemustperforminitiallinktrainingatthebaserateof2.5GT/s.Fig
ure147highlightsthestatesinvolvedintheinitialtrainingsequence.Devices
capableofoperatingat5.0or8.0GT/smusttransitiontotheRecoverystateto
changethespeedtothehigherratechosen.

Figure147:StatesInvolvedinInitialLinkTrainingat2.5Gb/s

Initial S tate after any


R eset or as directed
by the D ata Link Layer

D isabled D etect
Training S tates

R e-Training State
E xte rn a l
Lo op ba ck
P ow er M gt S tates
P olling
A SPM S tates
H ot
R eset O ther S tates
F rom
C onfiguration F rom C onfiguration
or Recovery Recovery
L2 R e co ve ry

L1 L0 L0s

Detect State

Introduction
Figure 148 represents the two substates and transitions associated with the
Detectstate.TheactionsassociatedwiththeDetectstateareperformedbyeach

522
PCIe 3.0.book Page 523 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

transmitterintheprocessofdetectingthepresenceofareceiverattheopposite
endofthelink.Becausethereareonlytwosubstatesandbecausetheyarefairly
simple,wewillmovedirectlytothesubstatediscussions.

Figure148:DetectStateMachine

Entry from Reset.


Also from Disabled,
Loopback, L2, Polling,
Configuration or
Recovery

No Electrical
Idle on Link or
12 ms timeout Receiver
Detected
Detect.Quiet Detect.Active
No Detect
12 ms Charge or
DC common mode
voltage stable

Exit to
Polling

Detailed Detect Substate


Detect.Quiet
Thissubstateistheinitialstateafteranyreset(exceptFunctionLevelReset)or
powerupeventandmustbeenteredwithin20msafterReset.Thissubstateis
alsoenteredfromotherstatesifunabletomoveforward(Seethestatesthatmay
enterDetect.QuietinFigure148onpage523).Thepropertiesofthissubstate
arelistedbelow:

TheTransmitterstartsinElectricalIdle(buttheDCcommonmodevoltage
doesnthavetobewithinthenormallyspecifiedrange).
Theintendeddatarateissetto2.5GT/s(Gen1).Ifitsettoadifferentrate
when this substate was entered, the LTSSM must stay in this substate for
1msbeforechangingtheratetoGen1.
The Physical Layers status bit (LinkUp = 0) informs the Data Link Layer
thattheLinkisnotoperational.TheLinkUpstatusbitisaninternalstatebit

523
PCIe 3.0.book Page 524 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

(notfoundinstandardconfigspace)andalsoindicateswhenthePhysical
LayerhascompletedLinkTraining(LinkUp=1),therebyinformingtheData
LinkLayerandFlowControlinitializationtobeginitspartofLinkinitial
ization(formoreonthis,seeTheFCInitializationSequenceonpage 223).
Any previous equalization (Eq.) status is cleared by setting the four Link
Status2registerbitstozero:Eq.Phase1Successful,Eq.Phase2Successful,
Eq.Phase3Successful,Eq.Complete.
Variables:
Several variables are cleared to zero: (directed_speed_change=0b,
upconfigure_capable=0b, equalization_done_8GT_data_rate=0b,
idle_to_rlock_transitioned=00h). The select_deemphasis variable setting
dependsontheporttype:foranUpstreamPortitsselectedbyhardware,
whileforaDownstreamPortittakesthevalueintheLinkControl2regis
teroftheSelectablePreset/Deemphasisfield.
Since these variables were defined beginning with the 2.0 spec version,
devices designed to earlier spec versions wont have them and will
behaveasifdirected_speed_changeandupconfigure_capableweresetto
0bandidle_to_rlock_transitionedwassettoFFh.
ExittoDetect.Active
ThenextsubstateisDetect.Activeaftera12mstimeoutorwhenanyLane
exitsElectricalIdle.

Detect.Active
This substate is entered from Detect.Quiet. At this time the Transmitter tests
whetheraReceiverisconnectedoneachLanebysettingaDCcommonmode
voltageofanyvalueinthelegalrangeandthenchangingit.Thedetectionlogic
observestherateofchangeasthetimeittakesthelinevoltagetochargeupand
compares it to an expected time, such as how long it would take without a
Receiver termination. If a Receiver is attached, the charge time will be much
longer, making it easy to recognize. For more details on this process, see
ReceiverDetectiononpage 460.Tosimplifythediscussionsthatfollow,Lanes
thatdetectaReceiverduringthissubstatearereferredtoasDetectedLanes.
ExittoDetect.Quiet
If no Lanes detect a Receiver, go back to Detect.Quiet. The loop between
themisrepeatedevery12ms,aslongasnoReceiverisdetected.
ExittoPollingState
IfareceiverisdetectedonallLanes,thenextstatewillbePolling.TheLanes
mustnowdriveaDCcommonvoltagewithinthe03.6VVTXCMDCspec.
SpecialCase:
IfsomebutnotallLanesofadeviceareconnectedtoaReceiver(likeax4

524
PCIe 3.0.book Page 525 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

device connected to a x2 device), then wait 12 ms and try it again. If the


sameLanesdetectaReceiverthesecondtime,exittothePollingstate,oth
erwisegobacktoDetect.Quiet.IfgoingtoPolling,therearetwopossibili
tiesfortheLanesthatdidntseeaReceiver:
1. If the Lanes can operate as a separate Link (see Designing Devices with
LinksthatcanbeMergedonpage 541),useanotherLTSSMandhavethose
Lanesrepeatthedetectsequence.
2. If another LTSSM is not available, then the Lanes that dont detect a
ReceiverwillnotbepartofaLinkandmusttransitiontoElectricalIdle.

Polling State

Introduction
Tothispointthelinkhasbeenintheelectricalidlestate,howeverduringPolling
theLTSSMTS1sandTS2sareexchangedbetweenthetwoconnecteddevices.
Theprimarypurposeofthisstateisforthetwodevicestounderstandwhatthe
eachotherissaying.Inotherwords,theyneedtoestablishbitandsymbollock
oneachotherstransmittedbitstreamandresolveanypolarityinversionissues.
Oncethishasbeenaccomplished,eachdeviceissuccessfullyreceivingtheTS1
andTS2orderedsetsfromtheirlinkpartner.Figure149onpage525showsthe
substatesofthePollingstatemachine.

Figure149:PollingStateMachine

Exit to
Detect
Entry from
Detect
24 ms

48 ms

Exchange
1024 TS1s
(unless directed Polling.Active Polling.Configuration
to Compliance) Bit/Symbol Lock (Polarity Inversion)

Directed or
8 TS1, TS2 (or complement) Rx on ALL 8 TS2 Rx. 16 TS2 Tx.
Insufficient Lanes Electrical Lanes or 24 ms timeout and ANY
detect Idle Exit Lane Rx 8 TS1, TS2 and ALL Lanes
exit from Electrical Idle
detect exit from Electrical Idle
Exit to
Polling.Compliance Configuration

525
PCIe 3.0.book Page 526 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Detailed Polling Substates


Polling.Active
DuringPolling.Active
Transmitters send a minimum of 1024 consecutive TS1s on all detected
Lanesoncetheircommonmodevoltagehassettledatthelevelspecifiedin
theTransmitMarginfield.ThetwoLinkpartnersmayexittheDetectstate
at different times, so the TS1 exchange is not synchronized. The time
neededtosend1024TS1satGen1speed(2.5GT/s)is64s.

Somenotesregardingthissubstateare:

ThePADSymbolmustbeusedintheLaneandLinkNumberfieldsof
theTS1s.
Alldataratesadevicesupportsmustbeadvertised,evenifitdoesnt
intendtousethemall.
ReceiversusetheincomingTS1stoacquireBitLock(seeAchieving
BitLockonpage 395)andtheneitherSymbolLock(seeAchieving
Symbol Lock on page 396) for the lower rates, or Block Alignment
for8.0GT/s(seeAchievingBlockAlignmentonpage 438).
ExittoPolling.Configuration
The next state is Polling.Configuration if, after sending at least 1024 TS1s
ALLdetectedLanesreceive8consecutivetrainingsequences(ortheircom
plement,duetopolarityinversion)thatsatisfyoneofthefollowingcondi
tions:
TS1swithLinkandLanesettoPADwerereceivedwiththeCompli
anceReceivebitclearedto0b(bit4ofSymbol5).
TS1swithLinkandLanesettoPADwerereceivedwiththeLoopback
bitofSymbol5setto1b.
TS2swerereceivedwithLinkandLanesettoPAD.

If the conditions above are not met, then after a 24ms timeout, if at least
1024TS1sweresentafterreceivingaTS1,andANYdetectedLanereceived
eightconsecutiveTS1orTS2OrderedSets(ortheircomplement)withthe
LaneandLinknumberssettoPAD,andoneofthefollowingistrue:

TS1swithLinkandLanesettoPADwerereceivedwiththeCompli
anceReceive(bit4ofSymbol5)clearedto0b.
TS1swithLinkandLanesettoPADwerereceivedwiththeLoopback
(bit2ofSymbol5)setto1b.
TS2swerereceivedwithLinkandLanesettoPAD.

526
PCIe 3.0.book Page 527 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

If still none of the conditions above are met, if at least a predetermined


numberofdetectedLanesalsodetectedanexitfromElectricalIdleatleast
oncesinceenteringPolling.Active(thispreventsoneormorebadTransmit
tersorReceiversfromholdingupLinkconfiguration).Theexactsetofpre
determinedLanesisimplementationspecificnow,whichisachangefrom
the1.1specthatneededtoseeanElectricalIdleexitonalldetectedLanes.
ExittoPolling.Compliance
IftheEnterCompliancebitintheLinkControl2registerissetto1b,orif
thisbitwassetbeforeenteringPolling.Active,thechangetoPolling.Com
pliancemustbeimmediateandnoTS1saresentinPolling.Active.
Otherwise,aftera24mstimeout,if:
AllLanesfromthepredeterminedsethavenotseenanexitfromElec
tricalIdlesinceenteringPolling.Active(indicatesapassivetestload
such as a resistor on at least one Lane forces all Lanes into Poll
ing.Compliance).
AnydetectedLanereceived8consecutiveTS1s(ortheircomplement)
withLinkandLanenumberssettoPAD,theComplianceReceivebit
ofSymbol5setto1bandtheLoopbackbitclearedto0b.
ExittoDetectState
If, after 24ms, the conditions for going to Polling.Configuration or Poll
ing.Complianearenotmet,returntotheDetectstate.

Polling.Configuration
Inthissubstate,atransmitterwillstopsendingTS1sandstartsendingTS2s,still
with PAD set for the Link and Lane numbers. The purpose of the change to
sendingTS2sinsteadofTS1sistoadvertisetothelinkpartnerthatthisdeviceis
readytoproceedtothenextstateinthestatemachine.Itisahandshakemecha
nism to ensure that both devices on the link proceed through the LTSSM
together. Neither device can proceed to the next state until both devices are
ready.ThewaytheyadvertisetheyarereadyisbysendingTS2orderedsets.So
onceadeviceisbothsendingANDreceivingTS2s,itknowsitcanproceedto
thenextstatebecauseitisreadyanditslinkpartnerisreadytoo.

DuringPolling.Configuration
Transmitters send TS2s with Link and Lane numbers set to PAD on all
detected Lanes, and they must advertise all the data rates they support,
eventhosetheydontintendtouse.Also,eachLanesreceivermustinde
pendentlyinvertthepolarityofitsdifferentialinputpairifnecessary.For
anexplanationofhowthisisdone,seeOverviewonpage 506.TheTrans
mitMarginfieldmustberesetto000b.

527
PCIe 3.0.book Page 528 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ExittoConfigurationState
AftereightconsecutiveTS2swithLinkandLanesettoPADarereceivedon
anydetectedLanes,andatleast16TS2shavebeensentsincereceivingone
TS2,exittoConfiguration.
ExittoDetectState
Otherwise,exittoDetectaftera48mstimeout.
ExittoPolling.Speed(Nonexistentsubstate)
Asahistoricalaside,thesubstatesofPollinghavechangedsincethe1.0version
of the spec was released. At that time it was thought that when other speeds
becameavailableitwouldmakesensetochangetothehighestavailablerateas
soon as possible in this state. However, the advent of higher rates coincided
withtherealizationthatitwouldbeadvantageoustobeabletochangespeeds
bothhigherandlowerduringruntimeforpowermanagementreasons.Going
through the Polling state involves clearing a number of Link values and that
makes it an unattractive path for runtime use, so the rate change stage was
movedoutofthisstateintotheRecoverystate.SeeFigure1410onpage528.

Figure1410:PollingStateMachinewithLegacySpeedChange

Exit to
Detect
Entry from
Detect
24 ms
Speed change step was
48 ms
moved from this state to
Exchange Recovery state
1024 TS1s Polling.Speed
oll Sp ed
(unless directed Polling.Active Polling.Configuration
(Electrical
(E rica Idle,,
to Compliance) Bit/Symbol Lock (Polarity Inversion)
Chang Speed)
Change d)
Directed or
8 TS1, TS2 (or complement) Rx on ALL
Insufficient Lanes Electrical 8 TS2 Rx. 16 TS2 Tx.
Lanes or 24 ms timeout and ANY
detect Idle Exit Lane Rx 8 TS1, TS2 and ALL Lanes
exit from Electrical Idle
detect exit from Electrical Idle
Exit to
Polling.Compliance Configuration

Today,theLinkalwaystrainsto2.5GT/safterareset,evenifotherspeedsare
available.IfhigherspeedsareavailableoncetheLTSSMhasreachedL0,thenit
transitionstoRecoveryandattemptstochangetothehighestcommonlysup
portedoradvertisedrate.SupportedspeedsarereportedintheexchangedTS1s

528
PCIe 3.0.book Page 529 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

and TS2s, so that either device can subsequently decide to initiate a speed
change by transitioning to the Recovery state. The spec still lists this substate
butdeclaresthatitisnowunreachable.

Polling.Compliance
ThissubstateisonlyusedfortestingandcausesaTransmittertosendspecific
patternsintendedtocreatenearworstcaseInterSymbolInterference(ISI)and
crosstalkconditionstofacilitateanalysisoftheLink.Twodifferentpatternscan
besentwhileinthissubstate,theCompliancePatternandtheModifiedCompli
ancePattern.
CompliancePatternfor8b/10b.Thispatternconsistsof4Symbolsthat
are repeated sequentially: K28.5, D21.5+, K28.5+ and D10.2, where ()
means negative current running disparity or CRD and (+) means positive
CRD(sincetheCRDisforced,itspermissibletohaveadisparityerrorat
the beginning of the pattern). If the Link has multiple Lanes, then four
DelaySymbols(shownasD,butarereallyjustadditionalK28.5symbols)
areinjectedonLane0,twobeforethenextcompliancepatternandtwoafter
thecompliancepattern.OncethelastDelaysymbolhasbeensentonLane
0,thefourdelaysymbolsarealsosentonLane1(again,twobeforethenext
compliance pattern and two after). This process continues until after the
DelaysymbolshavepropagatedthroughLane7.Thentheygobacktostart
ingonLane0againascanbeseeninTable 143onpage 529(thecompli
ance pattern is shaded in grey). Every group of eight lanes behaves this
way.ShiftingtheDelaySymbolswillensureinterferencebetweenadjacent
Lanesandprovidebettertestconditions.

Table143:SymbolSequence8b/10bCompliancePattern

Symbol Lane0 Lane1 Lane2 ... Lane8

0 D K28.5 K28.5 D

1 D K21.5 K21.5 D

2 K28.5 K28.5+ K28.5+ K28.5

3 K21.5 D10.2 D10.2 K21.5

4 K28.5+ K28.5 K28.5 K28.5+

5 D10.2 K21.5 K21.5 D10.2

529
PCIe 3.0.book Page 530 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table143:SymbolSequence8b/10bCompliancePattern(Continued)

Symbol Lane0 Lane1 Lane2 ... Lane8

6 D K28.5+ K28.5+ D

7 D D10.2 D10.2 D

8 K28.5 D K28.5 K28.5

9 K21.5 D K21.5 K21.5

10 K28.5+ K28.5 K28.5+ K28.5+

... ... ... ... ...

16 K28.5 K28.5 D K28.5

17 K21.5 K21.5 D K21.5

18 K28.5+ K28.5+ K28.5 K28.5+

CompliancePatternfor128b/130b.This pattern consists of the follow


ingrepeatingsequenceof36Blocks:

1. The first Block consists of the Sync Header 01b and contains the
unscrambledpayloadof64onesfollowedby64zeros.
2. ThesecondBlockhasSyncHeader01bandcontainstheunscrambled
payloadshowninTable 144onpage 530(notethatthepatternrepeats
after8Lanes,andthatPmeansthe4bitTxpresetbeingused,while~P
isthebitwiseinverseofthat).
3. The third Block has Sync Header 01b and contains the unscrambled
payload shown in Table 145 on page 531 (same notes as the second
Block).
4. ThefourthBlockisanEIEOSBlock
5. 32moreDataBlocks,eachcontaining16scrambledIDLSymbols(00h).

Table144:SecondBlockof128b/130bCompliancePattern

Lane Lane Lane Lane Lane Lane Lane Lane


Symbol
0 1 2 3 4 5 6 7

0 55h FFh FFh FFh 55h FFh FFh FFh

1 55h FFh FFh FFh 55h FFh FFh FFh

530
PCIe 3.0.book Page 531 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Table144:SecondBlockof128b/130bCompliancePattern(Continued)

Lane Lane Lane Lane Lane Lane Lane Lane


Symbol
0 1 2 3 4 5 6 7

2 55h 00h FFh FFh 55h FFh FFh FFh

3 55h 00h FFh C0h 55h FFh F0h F0h

4 55h 00h FFh 00h 55h FFh 00h 00h

5 55h 00h C0h 00h 55h E0h 00h 00h

6 55h 00h 00h 00h 55h 00h 00h 00h

7 {P,~P} {P,~P} {P,~P} {P,~P} {P,~P} {P,~P} {P,~P} {P,~P}

8 00h 1Eh 2Dh 3Ch 4Bh 5Ah 69h 78h

9 00h 55h 00h 00h 00h 55h 00h F0h

10 00h 55h 00h 00h 00h 55h 00h 00h

11 00h 55h 00h 00h 00h 55h 00h 00h

12 00h 55h 0Fh 0Fh 00h 55h 07h 00h

13 00h 55h FFh FFh 00h 55h FFh 00h

14 00h 55h FFh FFh 7Fh 55h FFh 00h

15 00h 55h FFh FFh FFh 55h FFh 00h

Table145:ThirdBlockof128b/130bCompliancePattern

Lane Lane Lane Lane Lane Lane Lane Lane


Symbol
0 1 2 3 4 5 6 7

0 FFh FFh 55h FFh FFh FFh 55h FFh

1 FFh FFh 55h FFh FFh FFh 55h FFh

2 FFh FFh 55h FFh FFh FFh 55h FFh

3 F0h F0h 55h F0h F0h F0h 55h F0h

4 00h 00h 55h 00h 00h 00h 55h 00h

531
PCIe 3.0.book Page 532 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table145:ThirdBlockof128b/130bCompliancePattern(Continued)

Lane Lane Lane Lane Lane Lane Lane Lane


Symbol
0 1 2 3 4 5 6 7

5 00h 00h 55h 00h 00h 00h 55h 00h

6 00h 00h 55h 00h 00h 00h 55h 00h

7 {P,~P} {P,~P} {P,~P} {P,~P} {P,~P} {P,~P} {P,~P} {P,~P}

8 00h 1Eh 2Dh 3Ch 4Bh 5Ah 69h 78h

9 00h 00h 00h 55h 00h 00h 00h 55h

10 00h 00h 00h 55h 00h 00h 00h 55h

11 00h 00h 00h 55h 00h 00h 00h 55h

12 FFh 0Fh 0Fh 55h 0Fh 0Fh 0Fh 55h

13 FFh FFh FFh 55h FFh FFh FFh 55h

14 FFh FFh FFh 55h FFh FFh FFh 55h

15 FFh FFh FFh 55h FFh FFh FFh 55h

ModifiedCompliancePatternfor8b/10b.Thesecondcompliancepat
ternaddsanerrorstatusfieldthatreportshowmanyReceivererrorshave
beendetectedwhileinPolling.Compliance.

In8b/10bmode,theoriginalpatternisstillused,but2Symbolsareaddedto
reporttheerrorstatus(2areusedinsteadofonetoavoidinterferingwith
the required disparity of the sequence) and 2 more K28.5 Symbols are
addedattheend,makingthepattern8Symbolslongaltogether.

Table146:SymbolSequenceof8b/10bModifiedCompliancePattern

Symbol Lane0 Lane1 Lane2 ... Lane8

0 D K28.5 K28.5 D

1 D K21.5 K21.5 D

2 D K28.5+ K28.5+ D

3 D D10.2 D10.2 D

532
PCIe 3.0.book Page 533 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Table146:SymbolSequenceof8b/10bModifiedCompliancePattern(Continued)

Symbol Lane0 Lane1 Lane2 ... Lane8

4 K28.5 ERR ERR K28.5

5 K21.5 ERR ERR K21.5

6 K28.5+ K28.5 K28.5 K28.5+

7 D10.2 K28.5+ K28.5+ D10.2

8 ERR K28.5 K28.5 ERR

9 ERR K21.5 K21.5 ERR

10 K28.5 K28.5+ K28.5+ K28.5

11 K28.5+ D10.2 D10.2 K28.5+

12 K28.7 ERR ERR K28.7

13 K28.7 ERR ERR K28.7

14 K28.7 K28.5 K28.5 K28.7

15 K28.7 K28.5+ K28.5+ K28.7

16 K28.5 D K28.5 K28.5

TheencodederrorstatusbytecontainsaReceiverErrorCountinERR[6:0]
thatreportsthenumberoferrorsseensincePatternLockwasasserted.The
PatternLockindicatorisERRbit[7],andshowswhentheReceiverhas
lockedtotheincomingModifiedCompliancePattern.Thedelaysequenceis
alsodifferentforthispattern,andnowaddsfourK28.5Symbols(shownas
Dinthetable)inarowatthebeginningofthesequenceandfourK28.7
Symbolsattheendofthe8Symbolpattern,makingatotalof16Symbols
thataresentbeforetheDelaypatternshiftstothenextLane.Thispatternis
illustrated in Table 146 on page 532. It can be seen that the delay pattern
shifts to Lane 1 after 16 Symbols. As before, the basic pattern (8Symbols
now)ishighlightedingrey.

ModifiedCompliancePatternfor128b/130b.Thispatternconsistsofa
repeatingsequenceof65792Blocksaslistedhere:

1. OneEIEOSBlock
2. 256DataBlocksof16scrambledIDLSymbols(00h)each.

533
PCIe 3.0.book Page 534 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

3. 255setsofthefollowingsequence:
OneSOS
256DataBlocksof16scrambledIDLSymbolseach.

SincethepayloadintheDataBlocksisallzeros,theoutputendsupbeing
simplytheoutputofthescramblerforthatLane.Recallthatthescrambler
doesntadvancewiththeSyncHeaderbitsandisinitializedbytheEIEOS.
SincethescramblerseedvaluedependsontheLanenumber,itsimportant
that they be understood correctly. If Link training completed earlier but
thensoftwaresenttheLTSSMtothissubstatebysettingtheEnterCompli
ancebitintheLinkControl2register,thentheLanenumbersandpolarity
inversions that were assigned during training are used. If a Lane wasnt
activeduringtraining,orifthissubstatewasenteredinanyotherway,then
theLanenumberswillbethedefaultnumbersassignedbythePort.Finally,
notethattheDataBlocksinthispatterndontformaDataStreamanddont
havetofollowtherequirementsforthat(suchassendinganySDSOrdered
SetsorEDSTokens).

Thethoughtfulreadermaybewonderingabouttheabsenceoferrorstatus
Symbolsinthissequencethatareprominentinthe8b/10bsequence.Asit
turnsout,for128b/130btheyreincludedinsidetheSOSsnow.Recallthat
thelast2bytesoftheSOSareusedtoreporttheReceivererrorcountduring
Polling.Compliance (see Ordered Set Example SOS on page 426 for
moreonthis).

EnteringPolling.Compliance:
AswasthecasewhenenteringPolling.Active,theTransmitMarginfieldof
theLinkControl2registerisusedtosettheTransmittervoltagerangethat
willbeineffectwhileinthissubstate.

The data rate and deemphasis level are determined as described below.
SincemanyofthechoicesaboutthesesettingsdependontheLinkControl2
registerfields,thatregisterisshowninFigure1411onpage536forrefer
ence.

IfaPortonlysupports2.5GT/s,thenthatwillbethedatarateandthede
emphasislevelwillbe3.5dB.
Otherwise,ifthissubstatewasenteredbecause8consecutiveTS1swere
receivedwiththeComplianceReceivebitsetto1bandtheLoopbackbit
clearedto0b(bits4and2ofTS1Symbol5),thentheratewillbethehigh
estcommonvalueforanyLane.Theselect_deemphasisvariablemustbe
settomatchtheSelectableDeemphasisbitinTS1Symbol4.Ifthechosen
rate is 8.0 GT/s, the select_preset variable on each Lane is taken from

534
PCIe 3.0.book Page 535 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Symbol 6 of the consecutive TS1s. For this Gen3 rate, Lanes that didnt
receive 8 consecutive TS1s with Transmitter Preset information can
chooseanyvaluetheysupport.
Otherwise,iftheEnterCompliancebitissetintheLinkControl2regis
ter, the compliance pattern is transmitted at the data rate given by the
TargetLinkSpeedfield.Iftheratewillbe5.0GT/s,theselect_deemphasis
variableissetiftheCompliancePreset/Deemphasisfieldequals0001b.If
theratewillbe8.0GT/s,theselect_presetvariableofeachLaneiscleared
to0bandtheTransmittermustusetheCompliancePreset/Deemphasis
value,aslongasitisntaReservedencoding.
Finally,ifnoneoftheothercasesaretrue,thenthedatarate,preset,and
deemphasissettingswillcyclethroughasequencebasedonthecompo
nentsmaximumsupportedspeedandthenumberoftimesPolling.Com
pliance is entered this way. The sequence is given in Table 147 on
page 535andbeginswithSettingNumber1thefirsttimePolling.Compli
ance is entered, it increments through the list each time its reentered,
andeventuallyrepeatsthepatternifitsreenteredmorethan14times.
This provides a handy way to test all of a components supported set
tings:transitiontoPolling.Compliance,testthatsetting,transitionbackto
Polling.Active,thenbacktoPolling.Complianceagaintotestthenextset
ting.Amethodforaloadboardtocausethesetransitionsisdescribedin
thespec,andconsistsofsendinga100MHz,350mVppsignalforabout
1msononelegofareceiversdifferentialpair.

Table147:SequenceofComplianceTxSettings

Setting Data De TxPreset


Number Rate emphasis Encoding

1 2.5 3.5 n/a

2 5.0 3.5 n/a

3 5.0 6.0 n/a

4 8.0 n/a 0000b

5 8.0 n/a 0001b

6 8.0 n/a 0010b

7 8.0 n/a 0011b

8 8.0 n/a 0100b

535
PCIe 3.0.book Page 536 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table147:SequenceofComplianceTxSettings(Continued)

Setting Data De TxPreset


Number Rate emphasis Encoding

9 8.0 n/a 0101b

10 8.0 n/a 0110b

11 8.0 n/a 0111b

12 8.0 n/a 1000b

13 8.0 n/a 1001b

14 8.0 n/a 1010b

Figure1411:LinkControl2Register

Link Control 2 Register


15 12 11 10 9 7 6 5 4 3 0

Compliance Preset/
De-emphasis
Compliance SOS
Enter Modified Compliance
Transmit Margin
Selectable De-emphasis
Hardware Autonomous
Speed Disable
Enter Compliance
Target Link Speed

Ifthedataratewontbe2.5GT/s,then:

IfanyTS1sweresentduringPolling.Active,theTransmittermustsend
eitheroneortwoconsecutiveEIOSsbeforegoingintoElectricalIdle.
If no TS1s were sent in Polling.Active, the transmitter enters Electrical
IdlewithoutsendinganyEIOSs.
TheElectricalIdleperiodmustbe>1msand<2ms.Duringthistime,the
datarateischangedtothenewspeedandstabilized.Iftheratewillbe5.0
GT/s, the deemphasis level is given by the select_deemphasis variable

536
PCIe 3.0.book Page 537 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

(0b = 3.5dB, 1b = 6.0 dB). If the rate will be 8.0 GT/s, then the
select_presetvariablegivesthetransmitterpresetstouse.
DuringPolling.Compliance:
Oncethedatarateanddeemphasisorpresetvalueshavebeendetermined,
thefollowingruleswillapply:

CompliancePattern.IfentrywasnotduetotheComplianceReceivebit
set and Loopback bit cleared in the TS Ordered Sets and was not due to
boththeEnterComplianceandEnterModifiedCompliancebitsbeingsetin
theLinkControl2register,thenTransmitterssendthecompliancepattern
onalldetectedLanes.
ExittoPolling.Active

Ifanyoftheseconditionsaretrue:

a) ElectricalIdleexitisdetectedattheReceiverofanydetectedLaneand
theEnterCompliancebitiscleared(0b).
The spec notes that the stipulation any Lane supports the Load
Board usage model described earlier to allow the device to cycle
throughallthesupportedtestcases.
b) The Enter Compliance bit has been cleared (0b) since Polling.Compli
ancewasentered.
c) ForanUpstreamPort,theEnterCompliancebitisset(1b)andEIOShas
beendetectedonanyLane.ThisconditionclearstheEnterCompliance
bit(0b).

Ifthedataratewasnot2.5GT/sortheEnterCompliancebitwassetduring
entrytoPolling.Compliance,theTransmittersends8consecutiveEIOSsand
goes to Electrical Idle before transitioning to Polling.Active. During the
Electrical Idle time the Port changes to 2.5 GT/s and stabilized for a time
between1msand2ms.

Sending multiple EIOSs helps ensure that the Link partner will detect at
leastoneandexitPolling.CompliancewhentheEnterComplianceregister
bitwasusedforentry

ModifiedCompliancePattern.If Polling.Compliance was entered


becauseTS1sdirectedit,andeithertheComplianceReceivebitwassetand
Loopback bit was cleared or both Enter Compliance and Enter Modified
CompliancebitsweresetinLinkControl2registerthensendtheModified
Compliance Pattern on all detected Lanes with the error status Symbol
clearedtoallzeroes.

537
PCIe 3.0.book Page 538 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

If the rate is 2.5 or 5.0 GT/s, each Lane indicates a successful lock on the
incomingpatternbylookingforoneinstanceoftheModifiedCompliance
Pattern and then setting the Pattern Lock bit in the Modified Compliance
Patternthatitsendsback(bit7ofthe8biterrorstatusSymbol).

TheerrorstatusSymbolscannotbeusedinthelockingprocessbecause
they dont have meaning if the Link partner isnt already locked and
thereforetheirmeaningcanbeundefined.
Aninstanceofthepatternisdefinedtobethesequenceof4Symbols
describedearlier:K28.5,D21.5,K28.5,andD10.2orthecomplementof
theseSymbols(meaningthepolarityisinverted).
The device under test must set the Pattern Lock bit in the Modified
Compliance Patterns it sends within 1ms of receiving the Modified
CompliancePatternfromtheLinkpartner.
AnyReceivererrorsonaLaneincrementthatLaneserrorcountby1,
anditsaturateswhenthecountreaches127(doesntgohigherorwrap
around).

Iftherateis8.0GT/s

TheError_Statusfieldissetto00honentrytothissubstate.
The device under test must set the Pattern Lock bit in the Modified
Compliance Patterns it sends within 4ms of receiving the Modified
CompliancePatternfromtheLinkpartner.
Each Lane independently sets Pattern Lock when it achieves Block
Alignment. After that, Symbols in Data Blocks are expected to be
IDLs (00h) and any mismatched Symbols increment the count by 1.
The Receiver Error Count saturates at 127, and is sent in the last 2
SymbolsoftheSOSsincludedinthispattern.
The scrambling requirements are applied as usual to the Modified
CompliancePattern:theseedvalueissetperLane,anEIEOSinitiates
theLFSR,andSOSsdontadvancetheLFSR.
Thespecnotesthatdevicesshouldwaitlongenoughbeforeacquiring
Block alignment to ensure that their Receivers have stabilized and
wontseeanybitslips.Itevenmentionsthatdevicesmightwanttore
validatetheirBlockalignmentbeforesettingthePatternLockbit.

ExittoPolling.Active

IftheEnterCompliancebitwasset(1b)onentrytoPolling.Compliance
and either the Enter Compliance bit has been cleared (0b), or its an
UpstreamPortandreceivedanEIOSonanyLane.Thisalsocausesits
EnterCompliancebittobecleared(0b).

538
PCIe 3.0.book Page 539 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

If the data rate was not 2.5 GT/s or the Enter Compliance bit was set
during entry to Polling.Compliance, the Transmitter sends 8 consecu
tive EIOSs and goes to Electrical Idle before transitioning to Poll
ing.Active.DuringtheElectricalIdletimethePortchangesto2.5GT/s
and3.5dBdeemphasis,andthistimemustbebetween1msand2ms.
SendingmultipleEIOSshelpsensurethattheLinkpartnerwilldetectat
leastoneandexitPolling.CompliancewhentheEnterCompliancereg
isterbitwasusedforentry.
ExittoDetectState

IftheEnterCompliancebitintheLinkControl2registeriscleared(0b)
andthedeviceisdirectedtoexitthissubstate.

Figure1412:LinkControl2RegistersEnterComplianceBit

Link Control 2 Register


15 12 11 10 9 7 6 5 4 3 0

Compliance Preset/
De-emphasis
Compliance SOS
Enter Modified Compliance
Transmit Margin
Selectable De-emphasis
Hardware Autonomous
Speed Disable
Enter Compliance
Target Link Speed

Configuration State
Initially,theConfigurationstateperformsLinkandLaneNumberingatthe2.5
GT/srate;however,provisionsexistthatallowthe5GT/sand8GT/sdevicesto
alsoentertheConfigurationstatefromtheRecoverystate.Thetransitionfrom
Recovery to Configuration is done primarily for making dynamic changes in
the link width of multilane devices. The dynamic changes are supported for
the5GT/sand8GT/sdevicesonly.Consequently,thedetailedstatetransitions
for these devices appear in the detailed Configuration Substate descriptions
beginningonpage 552.

539
PCIe 3.0.book Page 540 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Configuration State General


ThemaingoalofthisstateistodiscoverhowthePorthasbeenconnectedand
assignLanenumbersforit.Forexample,8Lanesmaybeavailablebutonly2
areactive,orperhapstheLanescanbesplitintomultipleLinks,suchastwox4
Links.Unliketheotherstates,Portshavedefinedrolesthatdependonwhether
they are facing upstream or downstream. For that reason, the description of
these substates is grouped into the behavior for Downstream Lanes and for
UpstreamLanes.TheDownstreamPort(portthattransmitsdownstream)plays
theleaderroleonthisLinktowalkthroughtherestofthestatesinthelink
initializationprocess.TheUpstreamPort(portthattransmitsupstream)plays
thefollowerrole.Theleader,orDownstreamPort,willspecifytheLinkand
LanenumberstotheUpstreamPort,andtheUpstreamPortwillsimplyreply
withthesamevaluesitwastold,unlessthereisaconflict,aswewillseeinthis
section. The Link and Lane numbers are reported in the fields of the TS1s
exchangedduringthistime,asshownagaininFigure1413onpage540.These
fieldscontainPADsymbolsasaplaceholderuntilactualvaluesareassigned.

Figure1413:LinkandLaneNumberEncodinginTS1/TS2

0 COM K28.5
1 Link # 0 - 255 = D0.0 - D31.7, PAD = K23.7
2 Lane # 0 - 31 = D0.0 - D17.1, PAD = K23.7
3 # FTS # of FTSs required by Receiver for L0s recovery
4 Rate ID Bit 1 must be set, indicates 2.5 GT/s support
5 Train Ctl
6 TS ID or Equalization info when
changing to 8.0 GT/s, else
9 EQ Info TS1 or TS2 Identifier
10
TS1 Identifier = D10.2
TS ID
TS2 Identifier = D5.2
15

540
PCIe 3.0.book Page 541 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Designing Devices with Links that can be Merged


AdesignerchooseshowmanyLanestoimplementonagivenLinkbasedon
performance and cost requirements. Narrow Links may optionally be able to
combineintoawiderLink,andawideLinkcanoptionallybesplitintomultiple
narrowerLinks.Figure1414onpage541showsaSwitchwithoneUpstream
Portandfourx2DownstreamPorts.Inthisexample,theycanalsobegrouped
into two x4 Links. As a reminder, the spec requires that every Port must also
supportoperatingasax1Link.

As seen on the left side of the figure, the switch internally consists of one
upstream logical bridge and four downstream logical bridges. One bridge is
required for each Port, so supporting 4 Downstream Ports requires 4 down
streambridges.However,ifthePortsarecombinedasshownontherightside
ofthediagram,thensomeofthebridgessimplygounused.DuringLinkTrain
ing, the LTSSM of each Downstream Port determines which of the supported
connectionoptionsisactuallyimplemented.

Figure1414:CombiningLanestoFormWiderLinks(LinkMerging)

x8 x8

Switch Virtual
Switch Virtual
PCI PCI
Bridge 0 Bridge 0

OR
Virtual Virtual Virtual Virtual Virtual Virtual
PCI PCI PCI PCI PCI PCI
Bridge 1 Bridge 2 Bridge 3 Bridge 4 Bridge 1 Bridge 2

x2 x2 x2 x2
x4 x4

541
PCIe 3.0.book Page 542 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Configuration State Training Examples


Introduction
IntheConfigurationstate,theLinkandLanenumberingprocessisinitiatedby
aDownstreamPort,theleader,(e.g.,RootPortorSwitchDownstreamPort).
EndpointsandswitchUpstreamPortsdontinitiate,butrespond.Theyarethe
follower. Lets now consider some examples to make the concepts easier to
understand.

Link Configuration Example 1


ThedevicesshowninFigure1415onpage543bothsupportasingleLinkthat
implementslanesizesofx4,x2,orx1.TheLanenumberassignmentsarefixed
bythedeviceinternallyandmustbesequentialstartingfromzero.Thephysical
Lane numbers are shown within the device box and the reported, or logical,
LanenumbersarereportedbytheTSOrderedSets.Usually,thesewillbethe
same,butnotineverycase.

LinkNumberNegotiation.
1. SinceonlyoneLinkispossibleinthisexample,theDownstreamPort
(thePortthattransmitsdownstream)sendsTS1susingthesameLink
Number,N,foralltheLanesandPADfortheLaneNumbers.
2. InthisConfigurationstate,theUpstreamPortstartsoutsendingTS1s
withPADintheLinkandLanenumberfields,butuponreceivingthe
TS1s from the Downstream Port with the nonPAD Link number, the
UpstreamPortrespondswithTS1sonallconnectedLanesthatreflect
thesameLinkNumberNandPADfortheLaneNumberfield.Basedon
this response, the Downstream LTSSM recognizes that four Lanes
responded and used the same Link number as is being sent, so all 4
Lanes will be configured as one Link. The Link Number itself is an
implementationspecificvaluethatisntstoredinanydefinedconfigu
rationregisterandisntrelatedtothePortNumberoranyothervalue.

542
PCIe 3.0.book Page 543 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1415:Example1Steps1and2

Options: One Link x4, x2 or x1

LTSSM
(Downstream Port)
0 1 2 3
Step 1
TS1s Lane # PAD PAD PAD PAD
Link # N N N N

N N N N Link #
PAD PAD PAD PAD Lane # TS1s

0 1 2 3 Step 2
(Upstream Port)
LTSSM
Options: One Link x4, x2 or x1

LaneNumberNegotiation.

3. The Downstream Port now begins to send TS1s with the same Link
Number but assigns Lane Numbers of 0, 1, 2 and 3 to the connected
Lanes,asshowninFigure1416onpage544.
4. InresponsetoseeingnonPADLanenumberscomingin,theUpstream
PortwillverifythattheincomingLanenumbersmatchtheLanenum
berstheyarereceivedon.Inthisexample,theLanesoftheDownstream
andUpstreamPortsareconnectedcorrectly.BecausealltheLanenum
bersmatch,theUpstreamPortadvertisesitsLanenumbersintheTS1s
it is sending as well. When the Downstream Port sees nonPAD Lane
numbersinresponse,itcomparestheincomingnumberstothevalues
its sending. If they match, all is well but, if not, then other steps will
needtobetaken.IfsomebutnotallLanenumbersmatch,thentheLink
widthmaybeadjustedaccordingly.IftheLanesarereversed,thenthe
optionalLaneReversalfeaturewillbeneeded.Becauseitsoptional,its
possiblethattheLaneshavebeenreversedbutneitherdeviceiscapable
ofcorrectingit.Thiswouldbeadramaticboarddesignerrorbecauseit
ispossibletheLinkcannotbeconfiguredforoperationinthiscase.

543
PCIe 3.0.book Page 544 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1416:Example1Steps3and4

Options: One Link x4, x2 or x1


LTSSM
(Downstream Port)
0 1 2 3
Step 3
TS1s Lane # 0 1 2 3
Link # N N N N

N N N N Link #
0 1 2 3 Lane # TS1s

0 1 2 3 Step 4
(Upstream Port)
LTSSM
Options: One Link x4, x2 or x1

ConfirmingLinkandLaneNumbers.

5. SincethetransmittedandreceivedLinkandLanenumbersmatchedon
alltheLanes,theDownstreamPortindicatesitisreadytoconcludethis
negotiationandproceedtothenextstate,L0,bysendingTS2Ordered
SetswiththesameLinkandLanenumbers.
6. Upon receiving TS2s with the same Link and Lane numbers, the
Upstream Port also indicates its readiness to leave the Configuration
stateandproceedtoL0bysendingTS2sback.ThisisshowninFigure
1417onpage545.
7. Once aPortreceivesat least 8TS2sandtransmitsatleast 16,itsends
somelogicalidledataandthentransitionstoL0.

544
PCIe 3.0.book Page 545 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training


Figure1417:Example1Steps5and6

Options: One Link x4, x2 or x1


LTSSM
(Downstream Port)
0 1 2 3
Step 5
TS2s Lane # 0 1 2 3
Link # N N N N

N N N N Link #
0 1 2 3 Lane # TS2s

0 1 2 3 Step 6
LTSSM
(Upstream Port)
Options: One Link x4, x2 or x1

Link Configuration Example 2


Another example that should be covered is of a Device with 4 Downstream
Lanesthatiscapableofbeingconfiguredasasinglex4Linkoracombinationof
twox2Linksorfourx1Links.Soevenaconfigurationofonex2Linkandtwox1
Linkswouldbejustfine.AnexampleofthistypeofDevicecanbeseeninFig
ure1418onpage546.

IfallfourLaneshavedetectedareceiverandmadeittotheConfigurationstate,
thereareanumberofconnectionpossibilities:
Onex4Link
Twox2Links
Onex2Linkandtwox1Links
Fourx1Links

Oneexamplemethoddefinedinthespectodeterminewhichoftheconfigura
tionsareimplementedisdescribedbelow.

545
PCIe 3.0.book Page 546 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

LinkNumberNegotiation.

1. Inthisexamplemethod,theDownstreamPortbeginsbyadvertisinga
uniqueLinknumberoneachLane.Lane0advertisesaLinknumberof
N,Lane1advertisesaLinknumberofN+1,etc.asshowninFigure14
18onpage546.TheseLinknumbersarejustexamples,andtheydonot
havetobesequential.Also,itisimportanttorememberthattheDown
streamPortdoesnotknowwhatitisconnectedtoanditisthisprocess
wherethePortistryingtodeterminetheconnectionsforeachLane.

Figure1418:Example2Step1

Options: One Link x4, x2 or x1


Two Links x2 or x1
Four Links x1
LTSSM
(Downstream Port)
0 1 2 3
Step 1
TS1s Lane # PAD PAD PAD PAD
Link # N N+1 N+2 N+3

PAD PAD PAD PAD Link #


PAD PAD PAD PAD Lane # TS1s

0 1 1 0
(Upstream (Upstream
Port) Port)
LTSSM LTSSM
Options: Options:
One Link x2 or x1 One Link x2 or x1

2. UponreceivingthereturnedTS1s,theDownstreamPortrecognizestwo
things:allfourLanesareworkingandtheyareconnectedtotwodiffer
entUpstreamPorts.ThismeanstherewillactuallybetwoDownstream
Ports. EachDownstreamPortwillhaveitsownLane0and Lane 1as
showninFigure1420onpage548.

546
PCIe 3.0.book Page 547 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1419:Example2Step2

Options: One Link x4, x2 or x1


Two Links x2 or x1
Four Links x1
LTSSM
(Downstream Port)
0 1 2 3

TS1s Lane # 0 PAD PAD PAD


Link # N N+1 N+2 N+3

N N N+2 N+2 Link #


PAD PAD PAD PAD Lane # TS1s

0 1 1 0 Step 2
(Upstream (Upstream
Port) Port)
LTSSM LTSSM
Options: Options:
One Link x2 or x1 One Link x2 or x1

LaneNumberNegotiation.

3. TheprocesscontinuesnowforeachLinkindependentlybuttheylltake
the same steps as before to determine the Lane numbers: the Down
stream Ports will advertise their Lane numbers in the TS1s. It is also
importanttonotethattheDownstreamPortsbeginadvertisingthesin
glereturnedLinknumberforallLanesoftheLink.TheLinkontheleft
isadvertisingaLinknumberofNforbothLanesandtheLinkonthe
rightisadvertisingN+2.
4. In this example, the Lane numbers of the Link on the left match
betweentheDownstreamandUpstreamPort.However,fortheLinkon
theright,theLanenumbersoftheDownstreamPortarereversedfrom
theconnectedUpstreamPort.TheUpstreamPortrealizesthisandifit
supports Lane Reversal, it will implement that internally and reply
backwiththesameLanenumbersthatwereadvertisedbytheDown
streamPort,asshowninFigure1420.IftheUpstreamPortdidnotsup
portLaneReversal,itwouldhaveadvertiseditsownLanenumbersin

547
PCIe 3.0.book Page 548 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

thereturnedTS1sandthentheDownstreamPortwouldhaverealized
theissueandhadachancetoimplementLaneReversal.
5. LaneReversalcanoptionallybehandledbyeitherPort.IftheUpstream
PortdetectsthiscaseandsupportsLaneReversal,itsimplymakesthe
Lane assignment change internally and returns TS1s with the proper
Lanenumbers.Asaresult,theDownstreamPortisunawarethatthere
waseveranissue.IftheUpstreamPortisunabletohandleLaneRever
salthough,thentheDownstreamPortwillseetheincomingLanenum
bers in reverse order. If it supports Lane Reversal, it will then correct
thenumberingandbeginsendingTS2swiththenewLanenumbers.

Figure1420:Example2Steps3,4and5

Step 3
LTSSM LTSSM
(Downstream (Downstream
Port) Port)
0 1 0 1
Step 4
TS1s Lane # 0 1 0 1
Link # N N N+2 N+2

N N N+2 N+2 Link #


0 1 0 1 Lane # TS1s

0 1 1 0 Step 5
(Upstream (Upstream
Lane Reversal
Port) Port)
LTSSM LTSSM

ConfirmingLinkandLaneNumbers.

6. TheDownstreamPortsreceivetheTS1swiththeLinkandLanenum
bersthatmatchwhatwasadvertisedsoeachPort,independently,starts
sendingTS2sasanotificationthatitisreadytoproceedtotheL0state
withthenegotiatedsettings.

548
PCIe 3.0.book Page 549 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

7. The Upstream Ports receive the TS2s with no Link and Lane number
changesandstarttransmittingTS2sinreturnwiththesamevalues.
8. OnceeachPortreceivesatleast8TS2sandtransmitsatleast16TS2s,it
sendssomelogicalidledataandthentransitionstoL0.
The Upstream Port of the Link on the right is implementing Lane
Reversalinternally.

Link Configuration Example 3: Failed Lane


Finally,letsconsiderwhathappensifoneoftheLanesisntworkingproperly.
ConsideranexampleinwhichLane2oftheUpstreamPortisnotfunctioning
wellasshowninFigure1421onpage550.ItsimportanttonotethattheLane
isntphysicallybrokenbecauseifitwereitwouldnthavedetectedaReceiver
andwouldntbeconsideredforinclusionintheLink.However,eventhoughthe
Laneisattached,eithertheTransmitterorReceiver(orboth)ofLane2onthe
UpstreamPortisnotgettingthejobdone.

Incaseslikethis,itislikelythatthelinktrainingprocesswilltakeconsiderably
longer because most of the state transitions wait to proceed to the next state
untilALLLanesarereadyforthenextstate,ORifasubsetofLanesareready
andatimeoutconditionhasoccurred.

Thestepsbelowindicateawaythissituationcouldbehandledwhentransition
ingthroughthesubstatesoftheConfigurationstatemachine.

LinkNumberNegotiation.

9. EventhoughtheLane2ReceiverontheUpstreamPortishavingissues,
theDownstreamPortisgoingtotakethesameprocessuponentering
theConfigurationstate.TheDownstreamPortsendsTS1sonallLanes
withtheLinknumberNandwiththeLanenumbersettoPAD.
10. Lanes0,1and3allreceivedtheTS1swiththenonPADLinknumber,
sothoseLanessendTS1sbacktotheDownstreamPort.However,Lane
2 of the Upstream Port did not successfully receive the TS1s with the
nonPADLinknumber,soitsTransmittercontinuessendingTS1swith
PADintheLinkandLanenumberfieldsasshowninFigure1421on
page550.

549
PCIe 3.0.book Page 550 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1421:Example3Steps1and2

Options: One Link x4, x2 or x1


LTSSM
(Downstream Port)
0 1 2 3
Step 1
TS1s Lane # PAD PAD PAD PAD
Link # N N N N

N N PAD N Link #
PAD PAD PAD PAD Lane # TS1s

0 1 2 3 Step 2
LTSSM
(Upstream Port)
Options: One Link x4, x2 or x1

LaneNumberNegotiation.
11. Once the Downstream Port hasreceivedthe TS1swiththe sameLink
numberonLanes0,1and3,itwaitsuntiltherequiredtimeoutperiod
hopingthatLane2willstartworking.Whenthatdoesnthappen,the
DownstreamPortrealizesthatitwillonlybeabletotrainasax2Link.
After accepting this fact, the Downstream Port will advertise its Lane
numbersforLanes0and1,butLanes2and3gobacktosendPADsin
theLinkandLanenumberfields.
12. WhentheUpstreamPortreceivestheTS1sonLanes0and1withthe
advertised Lane numbers and it sees that Lane 3 has gone back to
receivingPADTS1s,itadvertisesitsLanenumberforLanes0and1but
all the other Lanes start (or continue) sending TS1s with PAD set in
boththeLaneandLinknumberfieldsasshowninFigure1422onpage
551.

550
PCIe 3.0.book Page 551 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1422:Example3Steps3and4

Options: One Link x4, x2 or x1


LTSSM
(Downstream Port)
0 1 2 3
Step 3
TS1s Lane # 0 1 PAD PAD
Link # N N PAD PAD

N N PAD PAD Link #


0 1 PAD PAD Lane # TS1s

0 1 2 3 Step 4
LTSSM
(Upstream Port)
Options: One Link x4, x2 or x1

ConfirmingLinkandLaneNumbers.
13. SincethetransmittedandreceivedLinkandLanenumbersmatchedon
Lanes 0 and 1, the Downstream Port indicates it is ready to conclude
this negotiation and proceed to the next state, L0, by sending TS2
Ordered Sets with the same Link and Lane numbers on these Lanes.
TheotherLanescontinuesendingTS1swithPADforboththeLinkand
Lanenumbers.
14. UponreceivingTS2swiththesameLinkandLanenumbersonLanes0
and1,theUpstreamPortalsoindicatesitsreadinesstoleavetheCon
figurationstateandproceedtoL0bysendingTS2sbackontheseLanes.
TheotherLanescontinuesendingTS1swithPADforboththeLinkand
Lanenumbers.ThisisshowninFigure1423onpage552.

551
PCIe 3.0.book Page 552 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1423:Example3Steps5and6

Options: One Link x4, x2 or x1


LTSSM
(Downstream Port)
0 1 2 3
Step 5
TS1s
Lane # 0 1 PAD PAD
Link # N N PAD PAD

TS2s N N PAD PAD Link #


0 1 PAD PAD Lane #

0 1 2 3 Step 6
LTSSM
(Upstream Port)
Options: One Link x4, x2 or x1

OnceaPortreceivesatleast8TS2sandtransmitsatleast16,itsendssome
logicalidledataandthoseLanestransitionstoL0.TheotherLanes,Lanes2
and3inthisexample,transitiontoElectricalIdleuntilthenexttimethelink
training process is initiated at which point those Lanes will attempt the
trainingprocesslikenormal.

Detailed Configuration Substates


Adetailedexplanationofeachsubstateispresentedheretocoverallthesub
statesofConfiguration,asshowninFigure1424onpage553.TheConfigura
tion Substates should be easier to follow, given the Link Training examples
discussedpreviously.

552
PCIe 3.0.book Page 553 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1424:ConfigurationStateMachine

E ntry from
P olling or R ecovery E xit to
D ire cted Loopback

Config.Linkwidth.Start

D ire cted E xit to


Config.Linkwidth.Accept
Disable

E xit to
Detect Config.Lanenum.Wait

Config.Lanenum.Accept

Config.Complete
2 m s tim eo u t &
2 m s tim eo u t, havent reached max
& max Recovery attempts at Recovery. E xit to
attempts reached. Config.Idle
Recovery
8 Id le R x, T x 1 6 Id le

E xit to F ull-O n P ow er S tate


L0 P acket transm ission/
reception begins

Configuration.Linkwidth.Start
ThissubstateisenteredaftereitherthenormalcompletionofthePollingstate
(asdescribedinPolling.Configurationonpage 527),oriftheRecoverystate
finds that Link or Lane numbers have changed since the last time they were
assignedandthustherecoveryprocesscantfinishnormally(asdescribedinthe
RecoveryStateonpage 571).

DownstreamLanes.

DuringConfiguration.Linkwidth.Start
The Downstream Port is now the leader on this Link and sends TS1s
withanonPADlinknumberonallactiveLanes(aslongasLinkUpis

553
PCIe 3.0.book Page 554 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

not setand upconfiguration of the Link width is nottakingplace). In


the TS1s, the Link number field is changed from PAD to a number
whiletheLanenumberremainsPAD.Theonlyconstraintonthevalue
of the Link numbers in the spec is that they must be unique for each
possibleLinkifmultipleLinksaresupported.Forexample,ax8Link
wouldhavethesameLinknumberonall8Lanes,butifitcouldalsobe
configuredastwox4Links,bothgroupsof4Laneswouldbeassigned
differentLinknumbers,suchas5foronegroupand6fortheother.The
valuesarelocaltotheLinkpartnersandtheresnoneedforsoftwareto
trackthemortrytomakethemuniquethroughoutthesystem.
Iftheupconfigure_capablebitissetto1b,theseTS1swillalsobesent
onanyinactiveLanesthatreceivedtwoconsecutiveTS1swithLinkand
LanenumberssettoPAD.

When entering this substate from Polling, any Lane that detected a
Receiverisconsideredactive.
When entering from Recovery, any Lane that was part of the Link
aftergoingthroughConfiguration.Completeisconsideredanactive
Lane.
AllsupporteddataratesmustbeadvertisedintheTS1s,evenifthe
Portdoesntintendtousethem.

Crosslinks.ForcaseswhereLinkUp=0bandtheoptionalcrosslinkcapa
bilityissupported,allLanesthatdetectedaReceivermustsendaminimum
of16to32TS1swithanonPADLinknumberandPADLanenumber.After
that,theportwillevaluatewhatitisreceivingtoseeifacrosslinkispresent.

UpconfiguringtheLinkWidth.IfLinkUp=1bandtheLTSSMwantsto
upconfiguretheLink,TS1swithLinkandLanenumberssettoPADaresent
onthecurrentlyactiveLanes,theinactiveLanesitintendstoactivate,and
theLanesthathaveseenincomingTS1s.WhentheLaneshavereceivedtwo
consecutiveTS1scomingback,orafter1ms,theLinknumberisassigneda
valueintheTS1sbeingsent.

IfactivatinganinactiveLane,theTransmittermustwaitfortheTxcom
mon mode voltage to settle before exiting Electrical Idle and sending
TS1s.
LinknumbersmustbethesameforLanesthatwillbegroupedintoa
Link. The numbers can only be different for groups of Lanes that are
capableofactingasauniqueLink.

ExittoAftera24mstimeoutifnoneoftheotherconditionsaretrue.
AnyLanesthatpreviouslyreceivedatleastoneTS1withLinkandLane

554
PCIe 3.0.book Page 555 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

number of PAD now receive two consecutive TS1s with a nonPAD


LinknumberthatmatchesatransmittedLinknumberandLanenum
bersarestillPADwillexittotheConfiguration.Linkwidth.Acceptsub
state.
ExittoConfiguration.Linkwidth.Start
IfthefirstsetofreceivedTS1sforthissubstatehaveanonPADLink
number then its understood that a crosslink is present and the Link
neighborisalsobehavingasaDownstreamPort.Tohandlethissitua
tion,theDownstreamLanesarechangedtoUpstreamLanesandaran
dom crosslink timeout is chosen. The next substate will be the same
Confiuration.Linkwidth.Start again but the Lanes will now behave as
UpstreamLanes.

ThissupportstheoptionalbehaviorwhenbothLinkpartnersbehaveas
DownstreamPorts.Thesolutionforthissituationistochangebothto
Upstream Ports and assign each a random timeout that, when it
expires,changesittoaDownstreamPort.Sincethetimeoutswontbe
thesame,eventuallyonePortisseenasDownstreamwhiletheotheris
seen as Upstream and then the training can go forward. The timeout
mustberandomsothateveniftwoofthesamedevicesareconnected
anypossibledeadlockwilleventuallybebroken.

Ifcrosslinksaresupported,receivingasequenceofTS1sthatfirsthave
a Link number of PAD and later have a nonPAD Link number that
matches the transmitted Link number is valid only if the sequence
wasntinterruptedbyaTS2.

ExittoDisableState
IfthePortisinstructedbyahigherlayertosendTS1sorTS2swiththe
Disable Link bit asserted on all detected Lanes. Normally, the Down
streamPortwillinitiatethisbut,fortheoptionalcrosslinkcase,itcould
become an Upstream Port instead and then Disabled will be the next
stateif2consecutiveTS1sarereceivedwiththeLoopbackbitset.

ExittoLoopbackState
If the loopbackcapable Transmitter is instructed by a higher layer to
sendTSOrderedSetswiththeLoopbackbitasserted,orifLanesthat
aresendingTS1sreceive2consecutiveTS1swiththeLoopbackbitset.
WhicheverPortsendstheTS1swiththebitsetwillbecometheLoop
back master, while the Port that receives them will become the Loop
backslave.

555
PCIe 3.0.book Page 556 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ExittoDetectState
Aftera24mstimeoutifnoneoftheotherconditionsaretrue.

UpstreamLanes.
DuringConfiguration.Linkwidth.Start
TheUpstreamPortisnowthefolloweronthisLinkandgoesbackto
sendingTS1orderedsetswithPADsetfortheLinkandLanenumber
fields. It will continue to do this until it begins receiving TS1s with a
nonPADLinknumberfromtheDownstreamPort(leader).

TheUpstreamPortsendsTS1swithLinkandLanevaluesofPADona)
all active Lanes, b) the Lanes it wants to upconfigure and, c) if
upconfigure_capableissetto1b,oneachoftheinactiveLanesthathave
receivedtwoconsecutiveTS1swithLinkandLanenumberssettoPAD
whileinthissubstate.

When entering this substate from Polling, any Lane that detected a
Receiverisconsideredactive.
When entering from Recovery, any Lane that was part of the Link
aftergoingthroughConfiguration.Completeisconsideredanactive
Lane.IfthetransitionwasntcausedbyanLTSSMtimeout,theTrans
mittermustsettheAutonomousChangebit(Symbol4,bit6)to1bin
theTS1sbeingsentintheConfigurationstateifitdoes,infact,plan
tochangetheLinkwidthforautonomousreasons.
AllsupporteddataratesmustbeadvertisedintheTS1s,evenifthe
Portdoesntintendtousethem.

Crosslinks.ForcaseswhereLinkUp=0bandtheoptionalcrosslinkcapa
bilityissupported,allLanesthatdetectedaReceivermustsendaminimum
of16to32TS1swithLinkandLanevaluesofPAD.Afterthat,theportwill
evaluatewhatitisreceivingtoseeifacrosslinkispresent.
ExittoAftera24mstimeoutifnoneoftheotherconditionsaretrue.
IfanyLanesreceivetwoconsecutiveTS1swithnonPADLinknumber
andPADLanenumber,thisporttransitionstotheConfiguration.Link
width.Accept substate where one of the received Link numbers is
selectedforthoseLanesandTS1saresentbackwiththatLinknumber
andaPADLanenumber,onalltheLanesthatreceivedTS1swithanon
PADLinknumber.AnyleftoverLanesthatdetectedaReceiverbutno
LinknumbermustsendTS1swithLinkandLanenumberssettoPAD.
IfupconfiguringtheLink,theLTSSMwaitsuntilitreceivestwocon
secutiveTS1swithanonPADLinknumberandPADLanenumber
on either a) all the inactive Lanes it wants to activate, or b) on any

556
PCIe 3.0.book Page 557 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Lane1msafterenteringthissubstate,whicheverisearlier.Afterthat,
it sends TS1s with the selected Link number along with PAD Lane
numbers.
To avoid configuring a Link smaller than necessary, its recom
mended that a multiLane Link that sees an error or loses Block
AlignmentonsomeLanesdelaythisReceiverevaluation.For8b/10b
encoding,itshouldwaitatleasttwomoreTS1s,whilefor128b/130b
encodingitshouldwaitforatleast34TS1s,butnevermorethan1ms
inanycase.
Afteractivatinganinactive Lane,theTransmittermustwait forthe
TxcommonmodevoltagetosettlebeforeexitingElectricalIdleand
sendingTS1s.
ExittoConfiguration.Linkwidth.Start
Afteracrosslinktimeout,send16to32TS2swithLinkandLanevalues
of PAD. The Upstream Lanes change to Downstream Lanes and the
nextsubstatewillbethesameConfiuration.Linkwidth.Startagainbut
thistimetheLanesbehaveasDownstreamLanes.Forthecaseoftwo
Upstream Portsconnectedtogether, this optionalbehaviorallowsone
ofthemtoeventuallytaketheleadasaDownstreamPort.
ExittoDisableState
Ifeitherofthefollowingistrue:
AnyLanesthataresendingTS1salsoreceiveTS1swiththeDisable
Linkbitasserted.
The optional crosslink is supported and either all Lanes that are
sendingandreceivingTS1sreceivetheDisableLinkbitintwocon
secutiveTS1s,orelseacrosslinkPortisdirectedbyahigherLayerto
asserttheDisablebitinitsTS1sandTS2sonallLanesthatdetecteda
Receiver.

ExittoLoopbackState
IfaloopbackcapableTransmitterisdirectedbyahigherLayertosend
TS Ordered Sets with the Loopback bit asserted or all Lanes that are
sendingandreceivingTS1sreceive2consecutiveTS1swiththeLoop
backbitset.WhicheverPortsendstheTS1swiththebitsetwillbecome
theLoopbackmaster,whilethePortthatreceivesthemwillbecomethe
Loopbackslave.

ExittoDetectState
Aftera24mstimeoutifnoneoftheotherconditionsaretrue.

557
PCIe 3.0.book Page 558 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Configuration.Linkwidth.Accept
Atthispoint,theUpstreamPortisnowsendingbackTS1orderedsetsonallits
LaneswiththesameLinknumber.TheLinknumberoriginatedfromtheDown
streamPort,andtheUpstreamPortissimplyreflectingthatvaluebackonallits
Lanes. Now the Downstream Port knows the Link width (number of Lanes
receivingthesameLinknumber)anditmuststartadvertisingtheLanenum
bers. So the leader (Downstream Port) continues sending TS1s, but now with
the actual Lane numbers designated instead of PAD. Also, all these TS1s will
have the same Link number. The detailed behavior for the Downstream and
UpstreamLanesareoutlinedbelow:

DownstreamLanes
DuringConfiguration.Linkwidth.Accept
The Downstream Port will now initiate Lane numbers. If a Link can be
formed from at least one group of Lanes that all receive two consecutive
TS1s andallsee thesame Linknumber, thenTS1sare sent thatkeepthat
sameLinknumberbutnowassignunique,nonPADLanenumbersaswell.

ExittoConfiguration.Lanenum.Wait
TheDownstreamPortdoesnotstayintheConfiguration.Linkwidth.Accept
substate very long. Once it has received the necessary TS1s from the
UpstreamPortindicating,theLinkwidth,itupdatesanyinternalstateinfo
thatisrequired,startssendingTS1swithnonPADLanenumbers,asindi
cated above, and immediately transitions to Configuration.Lanenum.Wait
toawaitLaneNumberconfirmationfromtheUpstreamPort.

UpstreamLanes
DuringConfiguration.Linkwidth.Accept
TheUpstreamPorttransmitsTS1swhereoneofthereceivedLinknumbers
isselectedandsentbackintheTS1sonalltheLanesthatreceivedTS1swith
anonPADLinknumber.AnyleftoverLanesthatdetectedaReceiverbut
noLinknumbermustsendTS1swithLinkandLanenumberssettoPAD.

ExittoConfiguration.Lanenum.Wait
TheUpstreamPortmustrespondtotheLanenumbersproposedtoitbythe
Link neighbor. If a Link can be formed using Lanes that sent a nonPAD
Link number on their TS1s and received two consecutive TS1s with the
same Link number and any nonPAD Lane number, then it should send
TS1sthatmatchthesameLanenumberassignments,ifpossible,oraredif
ferentifnecessary(suchaswiththeoptionalLanereversal).

558
PCIe 3.0.book Page 559 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Configuration.Lanenum.Wait

Prior to discussing the Configuration.Lanenum.Wait state, some background


informationmaybehelpful.Lanenumbersareassignedsequentiallyfromzero
to the maximum number possible for a Link. For example, a x8 Link will be
assignedLanenumbers07.PortsarerequiredtosupportaLinkaswideasthe
number of Lanes they have and as small as one Lane. The Lanes will always
startwithLane0andmustbebothsequentialandcontiguous.Forexample,if
someLanesonax8Portarentworking,itmightoptionallybedesignedtocon
figureax4Linkand,ifso,itwouldneedtouseLanes03.Asanotherexample,
ifLane2ofax8Portisnotworking,itwouldntbepossibletouseLanes0,1,3,
and 4 to form a x4 Link because the Lanes wouldnt be contiguous. Any left
overLanesmustsendTS1swithLinkandLanesettoPAD.

AcommontimingconsiderationisrepeatedmanytimesinthespecfortheCon
figurationsubstates.Ratherthanrepeatitforeverycasehere,justbeawarethat
itappliesingeneraltobothUpstreamandDownstreamPorts:

ToavoidconfiguringaLinksmallerthannecessary,itsrecommendedthata
multiLanePortdelaythefinallinkwidthevaluationifitsees anerroror
loses Block Alignment on some Lanes. For 8b/10b, it should wait at least
twomoreTS1s,whilefor128b/130bmodeitshouldwaitforatleast34TS1s,
butnevermorethan1msinanycase.TheideaisthattheLanesmightneed
settlingtimeafterpoweringuporbeingreset.

ExittoDetectState
Aftera2mstimeoutifnoLinkcanbeconfigured(e.g.:Lane0isnotworking
and Lane Reversal isnt available), or if all Lanes receive two consecutive
TS1swithPADinboththeLinkandLanenumbers,thelinkmustexittothe
DetectState.

DownstreamLanes

DuringConfiguration.Lanenum.Wait
The Downstream Port will continue to transmit TS1s with the nonPAD
LinkandLanenumbersuntiloneoftheexitconditionsismet.

ExittoConfiguration.Lanenum.Accept
Ifeitherofthecaseslistedbelowistrue:

IftwoconsecutiveTS1shavebeenreceivedonallLaneswithLinkand
LanenumbersthatmatchwhatisbeingtransmittedonthoseLanes.

559
PCIe 3.0.book Page 560 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

If any Lanes that detected a Receiver see two consecutive TS1s with a
Lane number different from when the Lane first entered this substate
and at least some Lanes see a nonPAD Link number. The spec points
outthatthisallowsthetwoPortstosettleonamutuallyacceptableLink
width.
ExittoDetectState
Aftera2mstimeoutorifallLanesreceivetwoconsecutiveTS1swithLink
andLanenumberssettoPAD.
UpstreamLanes
DuringConfiguration.Lanenum.Wait
TheUpstreamPortwillcontinuetotransmitTS1swiththenonPADLink
andLanenumbersuntiloneoftheexitconditionsismet.

ExittoConfiguration.Lanenum.Accept
Ifeitherofthecaseslistedbelowistrue:

IfanyLanesreceivetwoconsecutiveTS2s.
IfanyLanesreceivetwoconsecutiveTS1swithaLanenumberdifferent
fromwhentheLanefirstenteredthissubstateandatleastsomeLanes
seeanonPADLinknumber.

NotethatUpstreamLanesareallowedtowaitupto1msbeforechangingto
thatsubstate,soastopreventreceivederrorsorskewbetweenLanesfrom
affectingthefinalLinkconfiguration.

ExittoDetectState
Aftera2mstimeoutorifallLanesreceivetwoconsecutiveTS1swithLink
andLanenumberssettoPAD.

Configuration.Lanenum.Accept
DownstreamLanes
DuringConfiguration.Lanenum.Accept
TheDownstreamPorthasnowreceivedTS1swithnonPADLinkandLane
numbers.ItisatthispointthattheDownstreamPortmustdecideifaLink
canbeestablishedwiththeLanenumbersreturnedbytheUpstreamPort.
Thethreepossiblestatetransitionsarelistedbelow.
ExittoConfiguration.Complete
IftwoconsecutiveTS1sarereceivedwiththesamenonPADLinkandLane
numbers,andtheymatchtheLinkandLanenumbersbeingtransmittedin
theTS1sforalltheLanes,thenUpstreamPorthasagreedwiththeLinkand

560
PCIe 3.0.book Page 561 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

LanenumbersadvertisedbytheDownstreamPortandthenextsubstateis
Configuration.Complete. Or if the Lane numbers in the received TS1s are
reversed from what the Downstream Port advertised, if the Downstream
PortsupportsLaneReversal,itcanstillproceedtoConfiguration.Complete
whileusingthereversedLanenumbers.

ThespecpointsoutthattheReversedLaneconditionisstrictlydefinedas
Lane0receivingTS1swiththehighestLanenumber(totalnumberofLanes
1)andthehighestLanenumberreceivingTS1swithLanenumberofzero.
Onethingthatcanbeunderstoodfromthisistheanswertoaquestionthat
comes up in class sometimes: Can theLane numbersbe mixedup,rather
thansequential?Theanswerisno,theymustbefrom0ton1orfromn1to
0;nootheroptionsaresupported.

IftheConfigurationstatewasenteredfromtheRecoverystate,abandwidth
changemayhavebeenrequested.Ifso,statusbitswillbeupdatedtoreport
thenatureofwhathappened.Basically,thesystemneedstoreportwhether
this change was initiated because the Link wasnt working reliably or
becausehardwareissimplymanagingtheLinkpower.Thebitsareupdated
asfollows:

IfthebandwidthchangewasinitiatedbytheDownstreamPortbecause
ofareliabilityproblem,theLinkBandwidthManagementStatusbitis
setto1b.
IfthebandwidthchangewasnotinitiatedbytheDownstreamPortbut
theAutonomousChangebitintwoconsecutivereceivedTS1siscleared
to0b,theLinkBandwidthManagementStatusbitissetto1b.
OtherwisetheLinkAutonomousBandwidthStatusbitissetto1b.

ExittoConfiguration.Lanenum.Wait
IfaconfiguredLinkcanbeformedwithsomebutnotalloftheLanesthat
receivetwoconsecutiveTS1swiththesamenonPADLinkandLanenum
bers, those Lanes send TS1s with the same Link number and new Lane
numbers.TheobjectistouseasmallergroupofLanestoachieveaworking
Link.

The new Lane numbers must start with zero and increase sequentially to
covertheLanesthatwillbeused.AnyLanesthatdontreceiveTS1scantbe
partofthegroupandwilldisrupttheLanenumbering.AnyleftoverLanes
mustsendTS1swithLinkandLanesettoPAD.Forexample,if8Lanesare
available,butLane2doesntseeincomingTS1s,thentheLinkcantconsist
of a group that would need Lane 2. Consequently, the x8 and x4 options
wouldnotbeavailable,andonlyax1orx2Linkispossible.

561
PCIe 3.0.book Page 562 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ExittoDetectState
If no Link can be configured, or if all Lanes receive two consecutive TS1s
withPADforLinkandLanenumbers.

UpstreamLanes

DuringConfiguration.Lanenum.Accept
The Upstream Port has now received either TS2s or TS1s with nonPAD
Link and Lane numbers. It is at this point that the Upstream Port must
decide if a Link can be established with the Lane numbers sent by the
DownstreamPort.Thethreepossiblestatetransitionsarelistedbelow.

ExittoConfiguration.Complete
IftwoconsecutiveTS2sarereceivedwiththesamenonPADLinkandLane
numbers,andtheymatchtheLinkandLanenumbersbeingtransmittedin
theTS1sforthoseLanes,alliswellandthenextsubstatewillbeConfigura
tion.Complete.

ExittoConfiguration.Lanenum.Wait
IfaconfiguredLinkcanbeformedwithasubsetofLanesthatreceivetwo
consecutive TS1s with the same nonPAD Link and Lane numbers, those
LanessendTS1swiththesameLinknumberandnewLanenumbers.The
objectistouseasmallergroupofLanestoachieveaworkingLink.Thenext
substateinthiscasewillbeConfiguration.Lanenum.Wait.

As was the case for the Downstream Lanes, the new Lane numbers must
start with zero and increase sequentially to cover the Lanes that will be
used.AnyLanesthatdontreceiveTS1scantbepartofthegroupandwill
disrupttheLanenumbering.AnyleftoverLanesmustsendTS1swithLink
andLanesettoPAD.
ExittoDetectState
If no Link can be configured, or if all Lanes receive two consecutive TS1s
withPADforLinkandLanenumbers,thenthenextstatewillbeDetect.

Configuration.Complete
ThisistheonlysubstateoftheConfigurationstatewhereTS2sareexchanged.
As discussed before, the purpose of TS2s is a handshake, or confirmation
betweenthetwodevicesonthelinkthattheyarereadytoproceedtothenext
state.SothisisthefinalconfirmationoftheLinkandLanenumbersexchanged
intheTS1sleadinguptothispoint.

562
PCIe 3.0.book Page 563 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

ItshouldbenotedthatDevicesareallowedtochangetheirsupporteddatarates
and upconfigure capability when they enter this substate, but not while in it.
ThisisbecauseDevicesrecordthecapabilitiesoftheirLinkpartnerfromwhatis
advertisedintheseTS2s,aswillbedescribedinthissection.

DownstreamLanes
DuringConfiguration.Complete
TS2s are sent using the Link and Lane numbers that match the received
TS1s.TheTS2scanhavetheUpconfigureCapabilitybitsetifthePortsup
portsax1LinkusingLane0andisabletoupconfiguretheLink.

For 8b/10b encoding, Lane deskewing must be completed when leaving


thissubstate.Also,scramblingwillbedisabledifallconfiguredLanessee
two consecutive TS2s with the Disable Scrambling bit set. The Port that
sendsthesemustalsodisablescrambling.Notethatscramblingcannotbe
disabledwhenin128b/130bmodebecauseofthenecessarycontributionit
makestosignalintegrity.
TheDownstreamPortistransmittingTS2sandwatchingforTS2scoming
back. For future reference, record the number of FTSs that must be sent
whenexitingfromtheL0sstatefromtheN_FTSfieldintheincomingTS2s.

ExittoConfiguration.Idle
The next state will be Configuration.Idle when all Lanes sending TS2s
receive8TS2swithmatchingLinkandLanenumbers(nonPAD),matching
rate identifiers, and matching Link Upconfigure Capability bit in all of
them.Atleast16TS2smustalsobesentafterreceivingoneTS2.

If the device supports rates greater than 2.5 GT/s, it must record the rate
identifier received on any configured Lane and this overrides any previ
ouslyrecordedvalue.ThevariableusedtotrackspeedchangesinRecovery,
changed_speed_recovery,isclearedtozero.

The variable upconfigure_capable is set to 1b if the device sends TS2s


withLinkUpconfigureCapabilitysetto1bandreceives8consecutiveTS2s
withthesamebitset.Otherwiseitsclearedtozero.

AnyLanesthatarentconfiguredaspartoftheLinkarenolongerassoci
atedwiththeLTSSMinprogressandmusteitherbe:

AssociatedwithanewLTSSMor
TransitionedtoElectricalIdle
a) AspecialcasearisesifthoseLaneshadbeenconfiguredaspartof
theLinkthroughL0previouslyandLinkUphasremainedsetat1b

563
PCIe 3.0.book Page 564 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

since then. They must remain associated with the same LTSSM if
the Link is upconfigure capable. For that case, its also recom
mended that those Lanes leave their Receiver terminations on
becausetheyllbecomepartoftheLinkagainifitisupconfigured.
Iftheterminationsarentlefton,theymustbeturnedonfromwhen
theLTSSMenterstheRecovery.RcvrCfgstateallthewaythrough
Configuration.Complete.LanesthatwerentpartoftheLinkbefore
cantbecomepartofitthroughthisprocess,though.
b) For the optional crosslink, Receiver terminations must be between
ZRXHIGHIMPDCPOSandZRXHIGHIMPDCNEG.
c) If the LTSSM goes back to Detect, these Lanes will once again be
associatedwithit.
d) NoEIOSisneededbeforeLanesgotoElectricalIdle,andthetransi
tiondoesnthavetohappenonSymbolorOrderedSetboundaries.

Aftera2mstimeout:

ExittoConfiguration.Idle

Next state is Configuration.Idle if the idle_to_rlock_transitioned vari


ableislessthanFFhandthecurrentdatarateis8.0GT/s.

Inthistransition,thechanged_speed_recoveryvariableisclearedto
zero. Also, the upconfigure_capable variable may be updated,
thoughitsnotrequiredtodoso,ifatleastoneLanesaweightconsecu
tive TS2s with matching Link and Lane numbers (nonPAD). If the
transmittedandreceivedLinkUpconfigureCapabilitybitsare1b,setit
to1b,otherwiseclearittozero.

LanesthatarentpartoftheconfiguredLinkarentassociatedwiththe
LTSSMinprogressandhavethesamerequirementsasthenontimeout
caselistedabove.

ExittoDetectState
Otherwise,thenextstateisDetect.

UpstreamLanes

DuringConfiguration.Complete
TS2s are sent using the Link and Lane numbers that match the received
TS2s.TheTS2scanhavetheUpconfigureCapabilitybitsetifthePortsup
portsax1LinkusingLane0andisabletoupconfiguretheLink.

564
PCIe 3.0.book Page 565 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

For 8b/10b encoding, Lane deskewing must be completed when leaving


thissubstate.Also,scramblingwillbedisabledifallconfiguredLanessee
two consecutive TS2s with the Disable Scrambling bit set. The Port that
sendsthesemustalsodisablescrambling.Notethatscramblingcannotbe
disabledwhenin128b/130bmodebecauseofthenecessarycontributionit
makestosignalintegrity.

Inthissubstate,theUpstreamPortisreceivingTS2sfromtheDownstream
Port,andforfuturereference,shouldrecordtheN_FTSfieldvaluenumber
ofFTSsthatmustbesentwhenexitingfromtheL0sstatefromtheinthe
incomingTS2s.

ExittoConfiguration.Idle
The next state will be Configuration.Idle when all Lanes sending TS2s
receive8TS2swithmatchingLinkandLanenumbers(nonPAD),matching
rate identifiers, and a matching Link Upconfigure Capability bit in all of
them.Atleast16TS2smustalsobesentafterreceivingoneTS2.

If the device supports rates greater than 2.5 GT/s, it must record the rate
identifier received on any configured Lane, overriding any previously
recorded value. The variable used to track speed changes in Recovery,
changed_speed_recovery,isclearedtozero.

The variable upconfigure_capable is set to 1b if the device sends TS2s


withLinkUpconfigureCapabilitysetto1bandreceives8consecutiveTS2s
withthesamebitset.Otherwiseitsclearedtozero.

AnyLanesthatarentconfiguredaspartoftheLinkarenolongerassoci
atedwiththeLTSSMinprogressandmusteitherbe:

OptionallyassociatedwithanewcrosslinkLTSSM(ifthisfeatureissup
ported),or
TransitionedtoElectricalIdle
a) AspecialcasearisesifthoseLaneshadbeenconfiguredaspartofthe
LinkthroughL0previouslyandLinkUphasremainedsetat1bsince
then.TheymustremainassociatedwiththesameLTSSMiftheLink
is upconfigure capable. For that case, its also recommended that
those Lanes leave their Receiver terminations on because theyll
becomepartoftheLinkagainifitisupconfigured.Iftheyrenotleft
on,theymustbeturnedonfromwhentheLTSSMenterstheRecov
ery.RcvrCfgstateallthewaythroughConfiguration.Complete.Lanes
that werent part of the Link before cant become part of it through
thisprocess,though.

565
PCIe 3.0.book Page 566 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

b) ReceiverterminationsmustbebetweenZRXHIGHIMPDCPOSandZRX
HIGHIMPDCNEG.
c) IftheLTSSMgoesbacktoDetect,theseLaneswillonceagainbeasso
ciatedwithit.
d)NoEIOSisneededbeforeLanesgotoElectricalIdle,andthetransi
tiondoesnthavetohappenonSymbolorOrderedSetboundaries.

Aftera2mstimeout:

ExittoConfiguration.Idle

Next state is Configuration.Idle if the idle_to_rlock_transitioned vari


ableislessthanFFhandthecurrentdatarateis8.0GT/s.

Inthistransition,thechanged_speed_recoveryvariableisclearedto
zero. Also, the upconfigure_capable variable may be updated,
thoughitsnotrequiredtodoso,ifatleastoneLanesaweightconsecu
tive TS2s with matching Link and Lane numbers (nonPAD). If the
transmittedandreceivedLinkUpconfigureCapabilitybitsare1b,setit
to1b,otherwiseclearittozero.

LanesthatarentpartoftheconfiguredLinkarentassociatedwiththe
LTSSMinprogressandhavethesamerequirementsasthenontimeout
caselistedabove.

ExittoDetectState
Otherwise,thenextstateisDetect.

Configuration.Idle
DuringConfiguration.Idle
In this substate, the transmitter is sending Idle data and waiting for the
minimum number of received Idle data so this Link can transition to L0.
Duringthistime,thePhysicalLayerreportstotheupperlayersthatthelink
isoperational(Linkup=1b).

For8b/10bencoding,thetransmitterissendingIdledataonallconfigured
Lanes.Idledataarejustdatazerosthatgetscrambledandencoded.

For128b/130bencoding,thetransmittersendsoneSDSOrderedSetonall
configuredLanesfollowedbyIdledataSymbols.ThefirstIdleSymbolon
Lane0isthefirstSymboloftheDataStream.

566
PCIe 3.0.book Page 567 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

ExittoL0State
Ifusing8b/10bencoding,thenextstateisL0if8consecutiveIdledatasym
boltimesarereceivedonallconfiguredLanes,and16symboltimesofidle
dataweresentafterreceivingoneIdleSymbol.

Ifusing128b/130b,thenextstateisL0if8consecutiveIdledataarereceived
onallconfiguredLanes,16IdlesweresentafterreceivingoneIdleSymbol,
andthisstatewasntenteredbyatimeoutfromConfiguration.Complete.

LanetoLanedeskewmustbecompletedbeforeDataStreamprocessing
begins.
TheIdleSymbolsmustbereceivedinDataBlocks.
IfsoftwaresettheRetrainLinkbitintheLinkControlregistersincethe
last transition to L0 from Recovery or Configuration, the Downstream
PortmustsettheLinkBandwidthManagementbitintheLinkStatusreg
isterto1btoindicatethatthischangewasnothardwareinitiated(auton
omous).
Theidle_to_rlock_transitionedvariableisclearedto00hontransition
toL0.
Aftera2mstimeout:
ExittoDetailedRecoverySubstates
Iftheidle_to_rlock_transitionedvariableislessthanFFh,thenextstate
isRecovery(Recovery.RcvrLock).Then:

a) For8.0GT/s,incrementidle_to_rlock_transitionedby1.
b) For2.5or5.0GT/s,setidle_to_rlock_transitionedtoFFh.
c) NOTE:ThisvariablecountsthenumberoftimestheLTSSMhastran
sitioned from this state to the Recovery state because the sequence
isnt working. The problem may be that equalization hasnt been
properlyadjustedorthattheselectedspeedjustisntgoingtowork,
and the Recovery state will take steps to address these issues. This
variablelimitsthenumberoftheseattemptssoastoavoidanendless
loop.IftheLinkstillisntworkingafterdoingthis256times(when
thecountreachesFFh),gobacktoDetectandstartover,hopingfora
betterresult.
ExittoDetectState
Otherwise(meaningidle_to_rlock=FFh),thenextstateisDetect.

567
PCIe 3.0.book Page 568 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

L0 State
Thisisthenormal,fullyoperationalLinkstate,duringwhichLogicalIdle,TLPs
andDLLPsareexchangedbetweenLinkneighbors.L0isachievedimmediately
followingtheconclusionoftheLinkTrainingprocess.ThePhysicalLayeralso
notifies the upper layers that the Link is ready for operation, by setting the
LinkUpvariable.Inaddition,theidle_to_rlock_transitionedvariableiscleared
to00h.

ExittoRecoveryState
ThenextstatewillbeRecoveryifachangeintheLinkspeedorLinkwidth
is indicated, or if the Link partner initiates this by going to Recovery or
ElectricalIdle.Letsconsidereachofthesethreecasesinalittlemoredetail
inthefollowingdiscussion.

Speed Change
Twoconditionsaredescribedinthespecthatwillcauseanautomaticchangein
speed.

Thefirstiswhenrateshigherthan2.5GT/saresupportedbybothpartnersand
the Link is active (Data Link Layer reports DL_Active), or when one partner
requests a speed change in its TS Ordered Sets. For example, a Downstream
Portwillinitiateaspeedchangeifahigherratewasnotedandsoftwarewrites
theRetrainLinkbitandaftersettingtheTargetLinkSpeedfield(seeFigure14
26onpage569)toadifferentratethanthecurrentrate.

Thesecondconditioniswhenbothpartnerssupport8.0GT/sandoneofthem
wants to perform Tx Equalization. In both conditions the
directed_speed_change variable will be set to 1b and the
changed_speed_recoverybitwillbeclearedto0b.

A Port will not attempt a speed change (the directed_speed_change variable


wontbeset)ifaratehigherthan2.5GT/shasneverbeenseenasadvertisedby
theotherPortintheConfiguration.CompleteorRecovery.RcvrCfgsubstates.

568
PCIe 3.0.book Page 569 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1425:LinkControlRegister

15 12 11 10 9 8 7 6 5 4 3 2 1 0

RsvdP

Link Autonomous Bandwidth


Interrupt Enable

Link Bandwidth Management


Interrupt Enable
Hardware Autonomous
Width Disable

Enable Clock
Power Management
Extended Synch
Common Clock
Configuration
Retrain Link

Link Disable
Read Completion
Boundary Control

RsvdP
Active State
PM Control

Figure1426:LinkControl2Register

15 12 11 10 9 7 6 5 4 3 0

Compliance Preset/
De-emphasis
Compliance SOS
Enter Modified Compliance
Transmit Margin
Selectable De-emphasis
Hardware Autonomous
Speed Disable
Enter Compliance
Target Link Speed

569
PCIe 3.0.book Page 570 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Link Width Change


An upper layer would normally only direct a Link width reduction if
upconfigure_capable has been set to 1b because otherwise the Link wont be
abletogobacktotheoriginalwidth.IftheHardwareAutonomousWidthDis
ablebitissetto1baPortcanonlyreducethewidthinanefforttocorrectareli
abilityproblem.AnupperlayercanonlyinitiateanincreaseinLinkwidthifthe
Link partner advertised that it was upconfigure capable and the Link is not
alreadyatitsmaximumwidth.Apartfromtheseguidelines,thedecisioncrite
ria for changing the Link width are not given in the spec and are therefore
implementationspecific.

Link Partner Initiated


Thespecdescribesthreepossibilitiesforthiscase.

First,ifElectricalIdleisdetectedorinferred(seeTable 1410onpage 596)onall


LaneswithoutfirstreceivinganEIOSonanyLane,thePortmaychoosetoenter
Recovery or stay in L0. If errors result from this condition, the Port may be
directedtoRecoverybymeanssuchassettingtheRetrainLinkbit.

ThesecondcasehappenswhenTS1sorTS2sarereceived(oranEIEOSfor128b/
130b) on any configured Lanes, indicating that the Link partner has already
enteredRecovery.SincebothofthesecasesareinitiatedbytheLinkpartner,the
TransmitterisallowedtocompleteanyTLPorDLLPcurrentlyinprogress.

Finally, if an EIOS is received on any Lane, indicating a Link power manage


mentchange,buttheReceiverdoesntsupportL0sandhasntbeendirectedto
L1orL2,thengoingtoRecoveryistheonlyoption.

ExittoL0sState
ThenextstatewillbeL0sforaTransmitterthatsbeeninstructedtoinitiate
it,orforaReceiverthatseesanEIOS.Interestingly,theLTSSMstatesforthe
TransmitterandReceiverofthePortcanbedifferentnow,becauseonecan
beinL0swhiletheotherisstillinL0.

TransmittersgotoL0swhendirected,iftheyimplementL0s,andsend
EIOStoinitiatethechange.
ReceiversgotoL0swhenanEIOSisseenonanyLane.However,ifthe
Receiver doesnt implement L0s and hasnt been directed to L1 or L2,
thiswillbeseenasaproblemandthenextstatewillbeRecoveryState
instead.

570
PCIe 3.0.book Page 571 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

ExittoRx_L0s.Entry
ThenextstatewillbeL1whenoneLinkpartnerisdirectedtoinitiatethis
andsendsoneEIOSonallLanes(twoEIOSsifthespeedis5.0GT/s)and
receives an EIOS on any Lane. Note that both Link partners must have
already agreed to enter L1 beforehand and that a Data Link Layer hand
shakeisneededtoensurethatbothareready.Formoredetailonhowthis
works,seethesectioncalledIntroductiontoLinkPowerManagementon
page 733.

ExittoL2State
ThenextstatewillbeL2whenoneLinkpartnerisdirectedtoinitiatethisand
sendsoneEIOSonallLanes(twoEIOSsifthespeedis5.0GT/s)andreceivesan
EIOSonanyLane.NotethatbothLinkpartnersmusthavealreadyagreedto
enter L2 beforehand and that a handshake is needed to ensure that both are
ready.Formoredetailonhowthisworks,seethesectioncalledIntroductionto
LinkPowerManagementonpage 733.

Recovery State
If everything works as expected, the Link trains to the L0 state without ever
goingintotheRecoverystate.Butwevealreadydiscussedtworeasonswhyit
mightnot.First,ifthecorrectSymbolpatternisntseeninConfiguration.Idle,
theLTSSMgoes to Recovery in aneffort to correct signaling problemsby, for
example, adjusting equalization values. Secondly, once L0 is reached with a
datarateof2.5GT/sandbothdevicessupporthigherspeeds,theLTSSMgoesto
RecoveryandattemptstochangetheLinkspeedtothehighestcommonlysup
ported/advertisedspeed.Inthisstate,BitLockandeitherSymbolLockorBlock
AlignmentisreacquiredandtheLinkisdeskewedagain.TheLinkandLane
NumbersshouldremainunchangedunlesstheLinkwidthisbeingchanged.In
thatcase,theLTSSMpassesthroughtheConfigurationstatewhereLinkwidth
isrenegotiated.

NOTE: To simplify the discussion and avoid repeating the same text many
times,thetermLockwillbeusedheretomeanthecombinationofBitLock
andeitherSymbolLockfor8b/10bencodingorBlockAlignmentfor128b/130b
encoding.AReceivermustacquirethisLocktobeabletorecognizeSymbols,
OrderedSetsandPackets.

571
PCIe 3.0.book Page 572 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Reasons for Entering Recovery State


Exiting the L1 state; Required because there is no fast training option (like
sendingFTSorderedsets)whenexitingL1
ExitingL0sifthereceiverfailstoachieveLockfromtheFTSorderedsetsin
therequiredtime,theLinkmusttransitiontoRecovery
FromL0if:
Ahigherdatarateisavailablewheninitialtrainingcompletes.
ALinkspeedorwidthchangehasbeenrequested(forpowermanagement
orbecausethecurrentspeedorwidthisunreliable).
SoftwaresetstheRetrainLinkbitintheLinkControlRegister(seeFigure
1471onpage644)inanefforttocleartransmissionproblems.
AnerrorconditionsuchasaReplayNumRollovereventassociatedwith
the Ack/Nak protocol of the Data Link Layer automatically causes the
PhysicalLayerlogictoretraintheLink.
Receiver sees TS1s or TS2s on any configured Lane, meaning that the
neighbormusthaveenteredRecovery.
Receiver sees Electrical Idle on all configured Lanes but did not first
receivetheElectricalIdleOrderedSet.

Initiating the Recovery Process


EitherPortcaninitiateRecoverybysendingTS1stoitsneighbor.WhenaPort
seesincomingTS1sitknowsthattheotherPorthasenteredRecovery,soitalso
goesintoRecoveryandreturnsTS1s.BothreceiversfirstusetheTS1storeac
quire Lock (if necessary) and then proceed to the other substates as needed.
ThisisshowninFigure1427onpage573.Adetaileddescriptionofwhathap
pensinthesubstatesisprovidedinthesectionsthatfollow.

572
PCIe 3.0.book Page 573 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1427:RecoveryStateMachine

Recovery.Speed
E ntry from E xit to
E xit to
L1, L0, L0s Loopback C onfiguration

Recovery.Equalization

Recovery.RcvrLock Recovery.Idle E xit to


(bit/sym bol re -lock)
Recovery.RcvrCfg
(S end idle data) D isabled

E xit to H ot
E xit to E xit to R eset
C onfiguration D etect

E xit to L0

Detailed Recovery Substates


DuringRecovery.RcvrLock
Regardless of the speed, Transmitters send TS1s on all configured Lanes
usingthesameLinkandLanenumbersthatweresetintheConfiguration
state.Ifthe purpose ofentering theRecovery statewasto change speeds,
thespeed_changebitintheDataRateIdentifierSymbolwillbesetto1bin
the TS1s from the initiating device and the internal variable
directed_speed_change is set to 1b. This same variable will be set in the
otherdeviceifthespeed_changebitissetintheincomingTS1s.Inaddition,
Thesuccessful_speed_negotiationvariableisclearedto0bonentrytothis
substate.

In this substate, an Upstream Port is allowed to specify the deemphasis


level the Downstream Port should use when operating at 5GT/s. This is
accomplished by setting the Selectable Deemphasis bit in its TS1s to the
desiredvalue.ItspossiblethatbiterrorsontheLinkwillpreventthisinfor
mation from reaching the Downstream Port, so the Upstream Port is
allowedtorequestthedeemphasislevelagainwhengoingtotheRecovery
stateforaspeedchange.IftheDownstreamPortplanstousetherequested
level,itmustrecordthevalueoftheSelectableDeemphasisbitwhileinthis
state.

573
PCIe 3.0.book Page 574 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Anewtransmittervoltagecanalsobeapplieduponentrytothisstate.The
TransmitMarginfieldintheLinkControl2registerissampledonentryto
thissubstateandremainsineffectuntilanewvalueissampledonanother
entrytothissubstatefromL0,L0s,orL1.

ADownstreamPortthatwantstochangetherateto8.0GT/sandredothe
equalizationmustsendEQTS1swiththespeed_changebitsetandadver
tisingthe8.0GT/srate.IfanUpstreamPortreceives8consecutiveEQTS1s
orEQTS2swiththespeed_changebitsetto1bandthe8.0GT/sratesup
ported,itisexpectedtoadvertisethe8.0GT/srate,too,unlessithascon
cludedthattherearereliabilityproblemsatthatratethatcantbefixedwith
equalization.NotethataPortisallowedtochangeitsadvertiseddatarates
whenenteringthisstate,butonlythoseratesthatcanbesupportedreliably.
And apart from the conditions described here, a device is not allowed to
changeitssupporteddataratesinthissubstateorinRecovery.RcvrCfgor
Recovery.Equalization.

ExittoRecovery.RcvrCfg
The next state will be Recovery.RcvrCfg if 8 consecutive TS1s or TS2s are
receivedwhoseLinkandLanenumbersmatchwhatisbeingsentandtheir
speed_changebitisequaltothedirected_speed_changevariableandtheir
ECfieldis00b(ifthecurrentdatarateis8.0GT/s).

IftheExtendedSynchbitisset,aminimumof1024TS1sinarowmustbe
sentbeforegoingtoRecovery.RcvrCfg.
If this substate was entered from Recovery.Equalization, the Upstream
Portmustcomparetheequalizationcoefficientsorpresetreceivedbyall
Lanes against the final set of coefficients or preset that was accepted in
Phase 2 of the equalization process. If they dont match, it sets the
RequestEqualizationbitintheTS2sitsends.

ExittoRecovery.Equalization
Whenthedatarateis8.0GT/s,theLanesmustestablishtheproperequal
ization parameters to obtain good signal integrity. This section does not
applyforlowerspeeds.JustbecausetheLinkisrunningat8.0GT/s,itdoes
notgothroughtheRecovery.EqualizationsubstateeverytimeRecoveryis
entered.Recovery.Equalizationisonlyenteredifoneoftheseconditionsis
met:
Ifthestart_equalization_w_presetvariableissetto1bthen:
a) UpstreamPortregisteredpresetvaluesfromthe8consecutiveTS2sit
sawpriortochangingto8.0GT/s.ItmustusetheTransmitterpresets
anditmayoptionallyusetheReceiverpresetsitreceived.

574
PCIe 3.0.book Page 575 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

b)DownstreamPortmustusetheTransmitterpresetsdefinedinitsLane
EqualizationControlregisterassoonasitchangesto8.0GT/sandit
mayoptionallyusetheReceiverpresetsfoundthere.
Else(thevariableisnotset),Transmittersmustusethecoefficientsettings
theyagreedtowhentheequalizationprocesswaslastexecuted.
a) UpstreamPortsnextstatewillbeRecovery.Equalizationif8consecu
tive incoming TS1s have Link and Lane numbers that match those
beingsentandthespeed_changebitis0b,buttheECbitsarenon
zero,indicatingthattheDownstreamPortwishestoredosomeparts
oftheequalizationprocess.ThespecnotesthataDownstreamPort
could do this under software or implementationspecific direction.
Asalways,thetimeittakestodothismustnotbeallowedtocause
transactiontimeouterrors,whichreallymeanstheDownstreamPort
wouldneedtoensuretherewerenotransactionsinflightbeforetak
ingthisstep.
a) Downstream Ports next state will be Recovery.Equalization if
directed,aslongasthisstatewasntenteredfromConfiguration.Idle
or Recovery.Idle. The spec points out that no more than two TS1s
whoseEC=00bshouldbesentbeforesendingTS1swithanonzero
ECvaluetorequestthatequalizationberedone.
Otherwise,aftera24mstimeout:
ExittoRecovery.RcvrCfg
ThenextstatewillbeRecovery.RcvrCfgifboth:

8 consecutive TS1s or TS2s are received whose Link and Lane num
bersmatchwhatitbeingsentandtheirspeed_changebitisequalto
1b.
Andeitherthecurrentdatarateisalreadyhigherthan2.5GT/s,orat
leastahigherrateisshowntobesupportedintheTS1sorTS2s.

ExittoRecovery.Speed
ThenextstatewillbeRecovery.Speedifotherofthetwofollowingcondi
tionsaremet:

Ifthecurrentspeedissethigherthan2.5GT/sbutisntworkingsince
entering Recovery (indicated by clearing the variable
changed_speed_recovery to 0b). The new rate after leaving Recov
ery.Speedwilldropbackto2.5GT/s.
Ifthechanged_speed_recoveryvariableissetto1b,indicatingthata
higherratethan2.5GT/sisalreadyworkingbuttheLinkwasunable
tooperateatanewnegotiatedrate.Asaresult,theoperatingspeed
willreverttowhatitwaswhenRecoverywasenteredfromL0orL1.

575
PCIe 3.0.book Page 576 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ExittoConfigurationState
Otherwise,theLTSSMwillreturntoConfigurationifaspeedchangeisnot
requested(directed_speed_changevariable=0bandthespeed_changebit
intheTS1sandTS2sis0b),orifthehighestcommonlysupporteddatarate
is2.5GT/s.

ExittoDetectState
Finally,ifnoneoftheotherconditionsaretrue,thenextstatewillbeDetect.

Speed Change Example


Thespecincludesanexampleofaspeedchangeinthediscussionofthissub
state.ThescenarioistwoLinkneighbors(deviceAanddeviceB)thatarecom
ingoutofreset,bothofwhichsupportthe5.0GT/sand8.0GT/srates.

Tobeginwith,theLinkwillautomaticallytraintoL0usingtheGen1rateof2.5
GT/s.(Thisbehaviorislikelytocontinueinfuturespecversionsbecauseitpro
videsbackwardcompatibilitywitholderdesigns.)

Inourexamplebothdevicessupporthigherratesandthisisindicatedbythe
RateIdentifierfieldintheirTSOrderedSetsduringtraining.Bothdevicesnote
thattheothersupportsahigherrateandoneofthem(deviceA)willbethefirst
tosetitsdirected_speed_changevariableto1b.Whenthathappens,itwillgoto
Recovery.RcvrLockandsendTS1swiththespeed_changebitset.Ifthedesired
ratewillbe8.0GT/sandhasntbeenbefore,thedeviceswillexchangeEQTS1s
todelivertheTXequalizerpresetstobeusedinsteadofsendingordinaryTS1s.

DeviceBseesincomingTS1sandalsotransitionstoRecovery.RcvrLock.When
itrecognizes8TS1sinarowwiththespeed_changebitset,itrespondsbyset
tingthespeed_changebitinitsownTS1sandgoestoRecovery.Speed.DeviceA
waitsforthatresponse and, when 8TS1sinarowwiththe speed_changebit
havebeenseen,itgoestoRecovery.RcvrCfgandthentoRecovery.Speed.Inthat
substate,thetransmittersareputintoElectricalIdle,thespeedischangedtothe
highest commonlysupported rate, and the directed_speed_change variable is
cleared.

Afteratimeoutperiod,bothdevicestransitionbacktoRecovery.RcvrLockand
the transmitters are reactivated using the new speed (8.0 GT/s in this case).
TheysendTS1sagainnow,thistimewiththespeed_changebitclearedto0b.If
thenewspeedworkswell,theytransitiontoRecovery.RcvrCfgandbacktoL0.
However,ifdeviceBhasaproblem,suchasfailuretoachieveBitLock,itwill
timeout in this substate and go back to Recovery.Speed. Device A may have

576
PCIe 3.0.book Page 577 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

alreadytransitionedtoRecovery.RcvrCfgbythistime,butwhenitseesElectri
calIdlenow,indicatingtheneighborhasreturnedtoRecovery.Speed,itwillalso
gobacktothatstate.ReturningtoRecovery.Speedcausesbothdevicestorevert
tothespeedinusewhenRecoverywasentered,2.5GT/sinthiscase,andreturn
toRecovery.RcvrLock.
In response to that development, Device A might set directed_speed_change
againandtrytheprocessasecondtime.Ifitfailedagain,deviceAmightchoose
to remove the 8.0 GT/s rate from its advertised list and try the speed change
againwithoutit.Sincethehighestcommonrateisnow5.0GT/s,ifthisattempt
succeedstheratewillendupat5.0GT/s.Ifitdoesntwork,DeviceAmightgive
up tryingtouse a higher rate.How andwhen adevice chooses tochange its
advertisedratesorgiveuptryingtogetahigherrateworkingisnotgiveninthe
specandwillbeimplementationspecific.

Link Equalization Overview


ThissectionprovidesanoverviewoftheEqualizationProcessandpreparesthe
reader to understand the detailed substate machine behaviors if they are of
interest.

Using a higher Link speed results in more signal distortion than lower data
rates. To compensate for this and minimize the effort and cost for system
designers,the3.0specaddsarequirementforTransmitterEqualization.Unlike
thefixeddeemphasisvaluesforthelowerrates,whichisreallyasimpleform
of Transmitter equalization itself, the new method uses an active handshake
processtomatchtheTransmitterstotheactualsignalingenvironment.During
this process, each Receiver Lane evaluates the quality of the incoming signal
and suggests Tx equalization parameters that the Link partner should use to
meetthesignalqualityrequirements.

TheLinkEqualizationprocedureexecutesafterthefirstchangetothe8.0GT/s
datarate.Thespecstronglyrecommendsthattheequalizationprocessbeiniti
atedautonomously(automaticallyinhardware)butdoesntrequireit.Ifacom
ponent chooses not to use the autonomous mechanism then a softwarebased
mechanismmustbeused.Ifeitherportisunabletoachievethenecessarysignal
qualitythroughthisprocess,theLTSSMwillconcludethattherateisnotwork
ingandwillgobacktoRecovery.Speedtorequestalowerspeed.

The process involves up to four phases, as described in the text that follows.
Oncethespeedhasbeenchangedto8.0GT/s,thecurrentequalizationphasein
use is indicated by the EC (Equalization Control) field in the TS1s being, as
showninFigure1428.

577
PCIe 3.0.book Page 578 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1428:ECFieldinTS1sandTS2sfor8.0GT/s

Symbol 6
7 6 5 4 3 2 1 0
0 Tx Preset EC
1 Link #
2 Lane # Use Preset Reset EIEOS
Interval Count
3 # FTS
Symbol 7
4 Rate ID 7 6 5 4 3 2 1 0
5 Train Ctl
Rsvd
FS value when EC = 01b,
Otherwise Pre-Cursor Coefficient
6
EQ Info Symbol 8
9 7 6 5 4 3 2 1 0
10 LF value when EC = 01b,
Rsvd Otherwise Cursor Coefficient
TS ID
13 Symbol 9
7 6 5 4 3 2 1 0
14 TS ID
15 P RCV Post-Cursor Coefficient

Phase 0
WhentheDownstreamPortisreadytochangefromalowerratetothe8.0GT/s
rate,itenterstheRecovery.RcvrCfgsubstateandsendsTxPresetsandRxHints
totheUpstreamPortusingEQTS2sasdescribedinTS1andTS2OrderedSets
onpage 510.(NotethatthisphaseisskippediftheLinkisalreadyrunningat8.0
GT/s.) The Downstream Port (DSP) sends Tx Preset values based on the con
tents of its Equalization Control register shown in Figure 1429 on page 579.
Onethingthishighlightsisthatthere canbedifferentequalizationvaluesfor
eachLane.TheDownstreamPortwillusetheDSPvaluesforitsownTransmit
ter and optionally for its Receiver, and send the USP values to the Upstream
Portforittousewhengoingtothehigherspeed.

578
PCIe 3.0.book Page 579 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1429:EqualizationControlRegisters

31 Link Control 3 Register 2 1 0


RsvdP

31 Lane Error Status Register 0

Equalization Control Registers


31 16 15 0
Lane (1) Control Lane (0) Control

Lane (3) Control Lane (2) Control

Lane (n) Control Lane (n-1) Control

Control Register Contents


15 14 12 11 8 7 6 4 3 0
USP USP DSP DSP
R R
Rx Hint Tx Preset Rx Hint Tx Preset
USP = UpStream Port DSP = DownStream Port

Table148:TxPresetEncodings

Encoding Deemphasis Preshoot

0000b 6 0

0001b 3.5 0

0010b 4.5 0

0011b 2.5 0

0100 0 0

579
PCIe 3.0.book Page 580 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table148:TxPresetEncodings(Continued)

Encoding Deemphasis Preshoot

0101 0 2

0110 0 2.5

0111 6 3.5

1000 3.5 3.5

1001 0 3.5

1010 DependsonFS Depends


andLSvalues onFSand
LSvalues

1011bto Reserved Reserved


1111b

Table149:RxPresetHintEncodings

Encoding RxPresetHint

000b 6dB

001b 7dB

010b 8dB

011b 9dB

100 10dB

101 11dB

110 12dB

111 Reserved

Oncetheratedoeschange,theDownstreamPortbeginsinPhase1andsends
TS1s with EC = 01b. It then waits for the Upstream Port to respond with the
sameECvalue.

Meanwhile,theUpstreamPortstartsinPhase0,asillustratedinFigure1430on
page581,andsendsTS1sthatechothepresetvaluesitreceivedearlierfromthe

580
PCIe 3.0.book Page 581 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

EQ TS1s and EQ TS2s. It will use those requested Tx presets if theyre sup
ported,andwilloptionallyusetheRxHints.TheUSPisallowedtowait500ns
beforeevaluatingtheincomingsignalbut,onceitsabletorecognizetwoTS1s
inarowitsreadyforthenextstep.Thismeansthesignalqualitymeetsthemin
imumBERof104(e.g.,BitErrorRatiooflessthanoneerrorin10,000bits).Sub
sequently the USP sets EC=01b in its TS1s thereby moving to Phase 1 and
handingcontrolofthenextsteptotheDSP.

Figure1430:EqualizationProcess:StartingPoint

Root Port

Downstream
Port

EC = 01b EC = 00b

Upstream
Port

Endpoint

Phase 1
TheDSPperformsthesameactionsastheUSPandachievesaBERof104by
detecting backtoback TS1s. During this time, the DSP communicates its Tx
presets and FS (Full Swing), LF (Low Frequency), and Postcursor coefficient
valuesasshowninFigure1432onpage584.Thespecgivessomeadditional
rulesthatmustbesatisfiedforasetofrequestedcoefficients,whichare:

1. |C1|<=Floor(FS/4),(Note:Floormeansrounddowntotheintegervalue)
2. |C1|+C0+|C+1|=FS
3. C0|C1||C+1|>=LF

581
PCIe 3.0.book Page 582 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

FS represents the maximum voltage, and LF defines the minimum voltage as


LF/FS.Theseinformthereceiveraboutthenumberofpossiblevaluesandallow
the coefficients to be communicated as integer values but understood as frac
tionalvalues.

Asanexample,assumewereusingthecoefficientsdefinedfortheP7presetset
ting.TheFSvalueactsasareferenceandcanbeanynumberupto63but,for
easeofcalculation,letssayitsgivenas30.InthecaseofP7,C1is0.1,thevalue
communicated to represent C1 in the TS1s would be 3, since 3/30 = 0.1 and
alwaysconsiderednegative.C+1is0.2,soitwouldbecommunicatedas6,since
6/30=0.2andalwaysnegative.C0is0.7,sothatwillbesentas21,since21/30=
0.7.Finally,theLFvaluerepresentsthesmallestpossibleratio,andforP7thatis
0.4timesthemaxvalue.Consequently,LFwillbecommunicatedas12,since12/
30=0.4.

Armedwiththisinformation,letscheckthethreerulestoseewhethertheyare
satisfiedfortheP7case:

1. 3<=Floor(12/4),Thisworksouttobe3<=3andistrue.
2. 3+21+6=30Thisoneistrue.
3. 2136>=12Thisoneisalsotrue,soallthreechecksaresatisfiedforP7.

OncetheDownstreamPortissatisfiedthattheLinkisworkingwellenoughto
moveforward(itrecognizesincomingTS1swithEC=01b),thenthisphaseis
completeanditinitiatesachangetoPhase2bysettingitsEC=10basillustrated
inFigure1431onpage583andhandscontrolofthenextstepbacktotheUSP.
WhentheUSPrespondswithEC=10b,bothPortsgotoPhase2.Asahappy
alternative, the Downstream Port may conclude that the signal quality is
alreadygoodenoughatthispointandnofurtheradjustmentsarenecessary.In
thatcase,itsetitsEC=00btoexittheequalizationprocess.

582
PCIe 3.0.book Page 583 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1431:EqualizationProcess:InitiatingPhase2

Root Port

Downstream
Port

EC = 10b EC = 01b

Upstream
Port

Endpoint

Phase 2
The signal quality has been good enough to recognize TS1s, but not good
enough for runtime operation. Once both Ports are in Phase 2, the Upstream
PortisallowedtorequestTxsettingsfortheDownstreamPortandthenevalu
ate how well they work, reiterating the process until it arrives at optimal set
tingsforthecurrentenvironment.Tomakearequest,itchangesthevalueofthe
equalizationinformationitsendsinitsTS1s.AsshowninFigure1432onpage
584,thereareseveralvaluesofinterest:

TxPreset:TheTxpresetsareacoarsegrainedadjustmenttotheTransmitter
settingsthatareintendedtogetitintotherightballparkforthecurrentsig
nalingenvironment.TheUpstream Portsetsthisvalue,andsets theUse
Presetindicator(bit7ofSymbol6)totelltheDownstreamPortsTransmit
tertouseit.IftheUsePresetbitisnotset,thenitsunderstoodthatthepre
sets should stay as they are and that the coefficient values should be
changedinstead.TheTxcoefficientsareconsideredasfinegrainedadjust
ments.

583
PCIe 3.0.book Page 584 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1432:EqualizationCoefficientsExchanged

Symbol 6
7 6 5 4 3 2 1 0
0 Tx Preset EC
1 Link #
2 Lane # Use Preset Reset EIEOS
Interval Count
3 # FTS
Symbol 7
4 Rate ID 7 6 5 4 3 2 1 0
5 Train Ctl
Rsvd
FS value when EC = 01b,
Otherwise Pre-Cursor Coefficient
6
EQ Info Symbol 8
9 7 6 5 4 3 2 1 0
10 LF value when EC = 01b,
Rsvd Otherwise Cursor Coefficient
TS ID
13 Symbol 9
7 6 5 4 3 2 1 0
14 TS ID
15 P RCV Post-Cursor Coefficient

Coefficients:Sincethespecrequiresa3tapTxequalizer,threecoefficient
valuesaredefinedthatcanbepicturedasvoltageadjustmentstoasignal
pulsethatcompensatesforthedistortionitwillexperiencegoingthrough
the transmission medium, as shown in Figure 1433 on page 585. This is
coveredinmoredetailinthePhysicalLayerElectricalsectiontitled,Solu
tionfor8.0GT/sTransmitterEqualizationonpage 474.
PreCursorCoefficient:amultiplierappliedtothesignalpriortothesam
plepointthatcanboostorreducethesignaldependingontheneed.
CursorCoefficient:thesamplepointmultiplier;alwayspositive.
PostCursorCoefficient:amultiplierappliedtothesignalafterthesample
pointthatcanboostorreducethesignaldependingontheneed.
Once the signal meets the quality standard needed, the Upstream Port
indicatesthatitsreadytomovetothenextphasebychangingEC=11b.

584
PCIe 3.0.book Page 585 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training


Figure1433:3TapTransmitterEqualization

Unmodified Signal

t
UI UI UI UI

Cursor

V
Pre-cursor Post-cursor
reduction reduction

Equalized Signal
t
UI UI UI UI
Cursor

Figure1434:EqualizationProcess:AdjustmentsDuringPhase2

Root Port

Evaluate Propose
resulting new Tx
Rx signal EQ values
Endpoint

585
PCIe 3.0.book Page 586 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Phase 3
TheDownstreamportrespondsbysendingEC=11bandcannowdothesame
signalevaluationprocessfortheUpstreamPortsTransmitter.ItsendsTS1sthat
requestanewsettingthesameway:iftheUsePresetbitisset,newpresetsare
defined,otherwise new coefficients are beinggiven. This issent continuously
for1soruntiltherequesthasbeenevaluatedforitsresult,whicheverislater.
Thatevaluationmustwait500nsplustheroundtriptimethroughtheoutgoing
logic and back in to the receive logic. Different equalization settings can be
testeduntiloneisfoundthatachievesthedesiredsignalquality.Atthatpoint
theDownstreamPortexitstheequalizationprocessbysettingEC=00b.

Figure1435:EqualizationProcess:AdjustmentsDuringPhase3

Root Port
Propose Evaluate
new Tx resulting
EQ values Rx signal

Endpoint

Equalization Notes
Thespecificationmentionsotheritemsassociatedwiththeequalizationprocess,
asdescribedbelow:

AllLanesmustparticipateintheprocess;eventhosethatmayonlybecome
activelaterafteranupconfigureevent.
The algorithm used by a component to evaluate the incoming signal and
determine the equalization values that its Link partner should use is not
giveninthespecandisimplementationspecific.

586
PCIe 3.0.book Page 587 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Equalization changes can be requested for any number of Lanes and the
Lanescanusedifferentvalues.
Attheendofthefinetuningsteps(Phase2forUpstreamPortsandPhase3
forDownstreamPorts),eachcomponentisresponsibleforensuringthatthe
Transmittersettingscauseittomeetthespecrequirements.
ComponentsmustevaluaterequeststoadjusttheirTransmittersettingsand
actonthem.Ifvalidvaluesaregiventheymustusethemandreflectthose
valuesintheTS1stheysend.
Arequesttoadjustcoefficientsmayberejectedifthevaluesarenotcompli
ant with the rules. The requested values will still be reflected in the TS1s
sentbackbuttheRejectCoefficientValuesbitwillbeset.
Components must store the equalization values that they settled on
throughthisprocessforfutureuseat8.0GT/s.Thespecisnotexpliciton
this,buttheauthorsopinionisthatthesevalueswouldsurviveachangein
speedtoalowerrateandthenbacktothe8.0GT/srate.Thatmakessense
becauseitcouldpotentiallytakealongtimetorepeattheEQprocessand
the resulting values would be the same, provided the electrical environ
menthasntchanged.
ComponentsareallowedtofinetunetheirReceiversatanytime,aslongas
itdoesntcausetheLinktobecomeunreliableorgotoRecovery.

Detailed Equalization Substates


Thissectioncoversdetaileddescriptionsofthestatemachinebehaviorsduring
LinkEqualization.

Recovery.Equalization
This substate is used to execute the Link Equalization Procedure for 8.0 GT/s
andhigherrates.ThelowerratesdontuseequalizationandtheLTSSMwont
enterthissubstatewhentheyreineffect.Sincethisisanewandcomplextopic
forPCIe,adescriptionoftheoverallequalizationprocedurefromahighlevel
view is presented after the state machine details in the section called Link
Equalization Overview on page 577. First though, lets step through the sub
statestoseethemechanicsoftheprocess.

DownstreamLanes
TheDownstreamPortstartsinPhase1oftheequalizationprocess.Tobegin
thisprocess,thereareseveralbitsthatneedtobereset.IntheLinkStatus2
register (Figure 1436 on page 588), the following bits are cleared when
enteringthissubstate:

587
PCIe 3.0.book Page 588 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

EqualizationPhase1Successful
EqualizationPhase2Successful
EqualizationPhase3Successful
LinkEqualizationRequest
EqualizationComplete

ThePerformEqualizationbitoftheLinkControl3registerisalsoclearedto
0b as is the internal variable start_equalization_w_preset. The
equalization_done_8GT_data_ratevariableissetto1b.

Figure1436:LinkStatus2Register

15 6 5 4 3 2 1 0

RsvdZ

Link Equalization Request


Equalization Phase 3 Successful
Equalization Phase 2 Successful
Equalization Phase 1 Successful
Equalization Complete
Current De-emphasis Level

Figure1437:LinkControl3Register

31 2 1 0

RsvdP

Link Equalization Request


Interrupt Enable
Perform Equalization

588
PCIe 3.0.book Page 589 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Phase1Downstream.During this phase, the Downstream Port sends


TS1swithEC=01bwhileusingthePresetvaluesfromtheLaneEqualiza
tionControlregisterandwiththeFS,LF,andPostcursorCoefficientfields
thatcorrespondtotheTxPresetfield.Itsallowedtowait500nsbeforeeval
uatingincomingTS1sifitneedstimetostabilizeitsReceiverlogic.

ExittoPhase2Downstream

TheDownstream Portwilltransition toPhase2ifitwanttocontinue


with the equalization process and when all configured Lanes receive
twoconsecutiveTS1swithEC=01b.Atthispoint,thePortwillsetthe
EqualizationPhase1Successfulstatusbitto1bandstorethereceived
TS1LFandFSvaluesforuseinPhase3(iftheDownstreamPortplans
toadjusttheUpstreamPortsTxcoefficients).

ExittoDetailedRecoverySubstates

IftheDownstreamPortdoesntwanttousePhases2and3,itsetsthe
status bits to 1b (Eq. Phase 1 Successful, Eq. Phase 2 Successful, Eq.
Phase3Successful,andEq.Complete).Onereasontodothiswouldbe
because it can already see that the signal characteristics are good
enoughandtherestofthephasesarentneeded.

ExittoRecovery.Speed

IftheconsecutiveTS1sarenotseenaftera24mstimeout,thenextstate
isRecovery.Speed.Thesuccessful_speed_negotiationflagisclearedto
0b,andtheEqualizationCompletestatusbitissetto1b.

Phase2Downstream.During this phase, the Downstream Port sends


TS1swithEC=10bandcoefficientsettingsindependentlyassignedoneach
Laneaccordingtothefollowing:

If two consecutive TS1s are received with EC = 10b (Upstream Port


hasenteredPhase2)eitherforthefirsttime,orwithdifferentpreset
orcoefficientvaluesthanthelasttime,andifthevaluesrequestedare
legalandsupported,thenchangetheTxsettingstousethemwithin
500nsoftheendofthesecondTS1requestingthem.Also,reflectthe
valuesintheTS1sbeingsentbacktotheUpstreamPortandclearthe
Reject Coefficient Values bit to 0b. Note that the change must not
causeillegalvoltagesorparametersattheTransmitterformorethan
1ns.
a) Iftherequestedpresetorcoefficientsareillegalornotsupported,
dontchangetheTxsettingsbutreflectthereceivedvaluesinthe

589
PCIe 3.0.book Page 590 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

TS1s being sent and set the Reject Coefficient Values bit to 1b
(seeFigure1438onpage590).
IfthetwoconsecutiveTS1sarentseen,keepthecurrentTxpresetand
coefficientvalues.

ExittoPhase3Downstream

WhentheUpstreamPortissatisfiedwiththechanges,itbeginstosendTS1s
withEC=11b,indicatingadesiretochangetoPhase3.Whentwoconsecu
tiveTS1slikethisarereceived,settheEq.Phase2Successfulstatusbitto1b
andchangetoPhase3.

ExittoRecovery.Speed

Ifafter32ms,thetransitiontoPhase3hasnothappened,thePortshould
clearthesuccessful_speed_negotiationflag,settheEqualizationComplete
statusbitandexittotheRecovery.Speedsubstate.

Figure1438:TS1sRejectingCoefficientValues

Symbol 6
7 6 5 4 3 2 1 0
0 Tx Preset EC
1 Link #
2 Lane # Use Preset Reset EIEOS
Interval Count
3 # FTS
Symbol 7
4 Rate ID 7 6 5 4 3 2 1 0
5 Train Ctl
Rsvd
FS value when EC = 01b,
Otherwise Pre-Cursor Coefficient
6
EQ Info Symbol 8
9 7 6 5 4 3 2 1 0
10 LF value when EC = 01b,
Rsvd Otherwise Cursor Coefficient
TS ID
13 Symbol 9
7 6 5 4 3 2 1 0
14 TS ID
15 P RCV Post-Cursor Coefficient

590
PCIe 3.0.book Page 591 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Phase3Downstream.During this phase, the Downstream Port sends


TS1swithEC=11bandbeginstheprocessofevaluatingUpstreamTxset
tingsindependentlyforeachLane.
InthetransmittedTS1s,theDownstreamPortcaneitherrequestanewpre
set by setting the Use Preset bit to 1b and Tx Preset field to the desired
value,oritcanrequestnewcoefficientsbyclearingtheUsePresetbitto0b
andsettingthePrecursor,Cursor,andPostCursorCoefficientfieldstothe
desiredvalues.Eitherrequestmustbemadecontinuouslyforatleast1sor
untiltheevaluationhascompleted.Ifnewpresetorcoefficientsettingsare
going to be presented, they must be sent on all Lanes at the same time.
However,agivenLaneisntrequiredtorequestnewsettingsifitwantsto
keeptheonesithas.
The Downstream Port must wait long enough to ensure the Upstream
Transmitterhashadachancetoimplementtherequestedchanges,(500ns
plusthe roundtrip delay for thelogic), then obtainBlock Alignment and
evaluate the incoming TS1s. Its not expected that anything useful will be
comingfromtheUpstreamPortduringthewaitingperiod,anditmaynot
even be legal. Thats why obtaining Block Alignment after that time is a
requirement.
IftwoconsecutiveTS1sareseenthatmatchthesamepresetorcoefficient
valuesthatarebeingrequestedanddonthavetheRejectCoefficientValues
bitset,thentherequestedsettingwasacceptedandcanbeevaluated.Ifthe
values match but the Reject Coefficient Values bit is set to 1b, then the
requestedvalueshavebeenrejectedbytheUpstreamPortandarenotbeing
used. For this case, he spec recommends that the Downstream Port try
againwithdifferentvaluesbutitsnotrequiredtodosoandmaychooseto
simplyexitthisphase.
The total time spent on a preset or coefficient request, from the time the
requestissentuntilthecompletionofitsevaluationmustbelessthan2ms.
Anexceptionisavailablefordesignsthatneedmoretimeforthefinalstage
ofoptimization,butthetotaltimeinthisphasecannotexceed24msandthe
exception can only be taken twice. If the Receiver doesnt recognize any
incomingTS1s,itmayassumethattherequestedsettingdoesntworkfor
thatLane.

ExittoDetailedRecoverySubstates
The next state will be Recovery.RcvrLock when all configured Lanes
havetheiroptimalsettings.Whenthathappens,theEqualizationPhase
3SuccessfulandEqualizationCompletestatusbitswillbesetto1b.

591
PCIe 3.0.book Page 592 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ExittoRecovery.Speed
Otherwise, after a 24ms timeout (with a tolerance of 0 or +2ms), the
next state will be Recovery.Speed, and the
successful_speed_negotiation flag is cleared to 0b while the Equaliza
tionCompletestatusbitissetto1b.

UpstreamLanes
TheUpstreamPortstartsinPhase0oftheequalizationprocessandmust
resetseveralinternalbits.IntheLinkStatus2register(Figure1436onpage
588),thefollowingbitsareclearedwhenenteringthissubstate:

EqualizationPhase1Successful
EqualizationPhase2Successful
EqualizationPhase3Successful
LinkEqualizationRequest
EqualizationComplete

ThePerformEqualizationbitoftheLinkControl3registerisalsoclearedto
0b as is the internal variable start_equalization_w_preset. The
equalization_done_8GT_data_ratevariableissetto1b.

Phase0Upstream.Duringthisphase,theUpstreamPortsendsTS1swith
EC = 00b while using the Tx Preset values that were delivered in the EQ
TS2s before entering this state. The equalization information fields in the
TS1sbeingsentmustshowthepresetvalueandalsothePrecursor,Cursor,
andPostcursorcoefficientfieldsthatcorrespondtothatpreset.Notethatif
aLanereceivedareservedorunsupportedTxPresetvalueintheEQTS2s,
ornoEQTS2satall,thentheTxPresetfieldandcoefficientvaluesarecho
senbyadevicespecificmethodforthatLane.

ExittoPhase1Upstream

WhenallconfiguredLanesreceivetwoconsecutiveTS1swithEC=01b,
indicatingthattheycanrecognizetheTS1sfromtheDownstreamPort
whichalwaysstartswiththisvalue,thenthenextphaseisPhase1.

TheequalizationvaluesLFandFSthatarereceivedintheTS1smustbe
storedandusedduringPhase2iftheUpstreamPortplanstoadjustthe
DownstreamPortsTxcoefficients.

UpstreamPortmaywait500nsafterenteringPhase0beforeevaluating
theincomingTS1stogivetimeforitsReceiverlogictostabilize.

592
PCIe 3.0.book Page 593 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

ExittoRecovery.Speed

IfincomingTS1sarenotrecognizedwithina12mstimeout,theLTSSM
will transition to Recovery.Speed, clear the
successful_speed_negotiation flag and set the Equalization Complete
statusbit.

Phase1Upstream.Duringthisphase,theUpstreamPortsendTS1swith
EC = 01b while using the Transmitter settings that were determined in
Phase0.TheseTS1scontaintheFS,LF,andPostcursorCoefficientvalues
withwhatiscurrentlybeingused.

ExittoPhase2Upstream

If all configured Lanes receive two consecutive TS1s with EC = 10b,


indicatingthattheDownstream Portwantsto go toPhase 2,thenthe
nextphasewillbePhase2,andthisPortwillsettheEqualizationPhase
1Successfulstatusbit.

ExittoDetailedRecoverySubstates

IfallconfiguredLanesreceivetwoconsecutiveTS1swithEC=00b,it
meansthattheDownstreamPorthasdecidedthattheequalizationpro
cessisalreadycompleteanditwantstoskiptheremainingphases.In
thiscase,thenextstatewillbeRecovery.RcvrLock,andtheEqualization
Phase1SuccessfulandEqualizationCompletestatusbitsaresetto1b.

ExittoRecovery.Speed

Otherwise, after a 12ms timeout, the LTSSM will transition to Recov


ery.Speed, clear the successful_speed_negotiation flag and set the
EqualizationCompletestatusbit.

Phase2Upstream.Duringthisphase,theUpstreamPortsendsTS1swith
EC=10bandbeginstheprocessoffindingoptimalTxvaluesfortheDown
streamPort.Recallthatthesettingsareindependentlydeterminedforeach
Lane.Theprocessisasfollows:

InthetransmittedTS1s,theUpstreamPortcaneitherrequestanewpreset
byputtingalegalvalueintheTransmitterPresetfieldoftheTS1sbeingsent
and setting the Use Preset bit to 1b to tell the Downstream Port to begin
usingit.Or,requestnewcoefficientsbyputtinglegalvaluesinthosefields
andclearingtheUsePresetbitto0bsotheDownstreamPortwillloadthem
insteadofthepresetfield.Oncetherequestismadeitmustberepeatedfor

593
PCIe 3.0.book Page 594 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

atleast1soruntiltheevaluationiscomplete.Ifnewpresetorcoefficient
settings are going to be presented, they must be sent on all Lanes at the
sametime.However,agivenLaneisntrequiredtorequestnewsettingsifit
wantstokeeptheonesithas.

The Upstream Port must wait long enough to ensure the Downstream
Transmitterhashadachancetoimplementtherequestedchanges,(500ns
plusthe roundtrip delay for thelogic), then obtainBlock Alignment and
evaluate the incoming TS1s. Its not expected that anything useful will be
comingfromtheDownstreamPortduringthewaitingperiod,anditmay
notevenbelegal.ThatswhyobtainingBlockAlignmentafterthattimeisa
requirement.

When TS1s are received that contain the same equalization fields as are
beingsentandtheRejectCoefficientValuesbitisnotset(0b),thentheset
tinghasbeenacceptedandcannowbeevaluated.Iftheequalizationfields
matchbuttheRejectCoefficientValuesbitisset(1b),thenthesettinghas
been rejected. In that case the spec recommends that the Upstream Port
requestadifferentequalizationsetting,butthisisnotrequired.

The total time spent on a preset or coefficient request, from the time the
requestissentuntilthecompletionofitsevaluationmustbelessthan2ms.
Anexceptionisavailablefordesignsthatneedmoretimeforthefinalstage
ofoptimization,butthetotaltimeinthisphasecannotexceed24msandthe
exception can only be taken twice. If the Receiver doesnt recognize any
incomingTS1s,itmayassumethattherequestedsettingdoesntworkfor
thatLane.

ExittoPhase3Upstream

ThenextphaseisPhase3ifallconfiguredLaneshavetheiroptimalset
tings.Whenthathappens,theEqualizationPhase2Successfulstatusbit
willbesetto1b.

ExittoRecovery.Speed

Otherwise, after a 24ms timeout (with a tolerance of 0 or +2ms), the


next state will be Recovery.Speed, and the
successful_speed_negotiation flag is cleared to 0b while the Equaliza
tionCompletestatusbitissetto1b.

Phase3Upstream.Duringthisphase,theUpstreamPortsendsTS1swith
EC = 11b and responds to the requested Tx values from the Downstream
Port.

594
PCIe 3.0.book Page 595 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

IftwoconsecutiveTS1sarentseen,keepthecurrentTxpresetandcoeffi
cientvalues.However,iftwoconsecutiveTS1sarereceivedwithEC=11b
(DownstreamPorthasenteredPhase3)eitherforthefirsttime,orwithdif
ferent preset or coefficient values than the last time, and if the values
requestedarelegalandsupported,thenchangetheTxsettingstousethem
within500nsoftheendofthesecondTS1requestingthem.Therequested
valuesmustbereflectedintheTS1sbeingsentbacktotheUpstreamPort
andcleartheRejectCoefficientValuesbitto0b.Notethatthechangemust
not cause illegal voltages or parameters at the Transmitter for more than
1ns.
If the requested preset or coefficients are illegal or not supported,
dontchangetheTxsettingsbutreflectthereceivedvaluesintheTS1s
beingsentandsettheRejectCoefficientValuesbitto1b(seeFigure
1438onpage590).

ExittoDetailedRecoverySubstates

When the Downstream Port is satisfied with the changes, it begins to


sendTS1swithEC=00b,indicatingadesiretofinishtheequalization
process. When two consecutive TS1s like this are received, set the
EqualizationPhase3SuccessfulandEqualizationCompletestatusbits
to1b.

ExittoRecovery.Speed

Iftheabovecriteriaarenotmetwithina32mstimeout,thenextstate
willbeRecovery.Speed.Thesuccessful_speed_negotiationflagwillbe
clearedto0bandtheEqualizationCompletestatusbitwillbeset.

Recovery.Speed
Whenenteringthissubstate,adevicemustenterElectricalIdleonitsTrans
mitter and wait for its Receiver to enter Electrical Idle. After that, it must
remain there for at least 800ns if the speed change succeeded
(successful_speed_negotiation=1b)orforatleast6sifthespeedchange
wasnotsuccessful(successful_speed_negotiation=0b),butnotlongerthan
anadditional1ms.

AnEIOSmustbesentpriortoenteringthissubstateifthecurrentrateis2.5
GT/sor8.0GT/s,andtwomustbesentifthe currentrateis5.0 GT/s. An
ElectricalIdleconditionexistsonaLanewhentheseEIOSshavebeenseen
orwhenitisotherwisedetectedorinferred(asdescribedinElectricalIdle
onpage 736).

595
PCIe 3.0.book Page 596 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

TheoperatingfrequencyisonlyallowedtochangeaftertheReceiverLanes
haveenteredElectricalIdle.IftheLinkisalreadyoperatingatthehighest
commonlysupportedrate,theratewontbechangedeventhoughthissub
stateisexecuted.

If the negotiated rate is 5.0 GT/s, the deemphasis level must be selected
basedonthesettingoftheselect_deemphasisvariable:ifthevariableis0b,
apply6dBdeemphasis,butifthevariableis1b,apply3.5dBdeempha
sisinstead.

Curiously,theDCcommonmodevoltagedoesnothavetobemaintained
withinspeclimitsduringthissubstate.

If this substate is entered after a successful speed negotiation


(successful_speed_negotiation = 1b), Electrical Idle can be inferred as
showninTable 1410onpage 596.Thespecpointsoutthatthiscoversthe
caseinwhichbothLinkpartnershaverecognizedincomingTS1sandTS2s,
sotheirabsencecanbeinterpretedasanentrytoElectricalIdle.

If this substate is entered after an unsuccessful speed negotiation


(successful_speed_negotiation = 0b), Electrical Idle can be inferred if an
Electrical Idle exit has not been detected at least once on any configured
Laneinthespecifiedtime.Thisisintendedtocoverthecasewhenatleast
onesideoftheLinkisnotabletorecognizeTSOrderedSets,andsothelack
of an exit from Electrical Idle over a longer interval can be treated as an
entrytoElectricalIdle.

Table1410:ConditionsforInferringElectricalIdle

State 2.5GT/s 5.0GT/s 8.0GT/s


L0 AbsenceofFlow AbsenceofFlowCon AbsenceofFlow
ControlUpdate trolUpdateDLLPor ControlUpdate
DLLPorSOSina SOSina128swin DLLPorSOSina
128swindow dow 128swindow

Recovery.RcvrCfg AbsenceofaTS1or AbsenceofaTS1or AbsenceofaTS1or


TS2ina1280UI TS2ina1280UIinter TS2ina4mswin
interval val dow

Recovery.Speedwhen AbsenceofaTS1or AbsenceofaTS1or AbsenceofaTS1or


successful_speed_neg TS2ina1280UI TS2ina1280UIinter TS2ina4680inter
otiation=1b interval val val

596
PCIe 3.0.book Page 597 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Table1410:ConditionsforInferringElectricalIdle(Continued)

State 2.5GT/s 5.0GT/s 8.0GT/s


Recovery.Speedwhen AbsenceofanElec AbsenceofanElectri AbsenceofanElec
successful_speed_neg tricalIdleexitina calIdleexitina16000 tricalIdleexitina
otiation=0b 2000UIinterval UIinterval 16000UIinterval

Loopback.Active(asa AbsenceofanElec N/A N/A


slave) tricalIdleexitina
128swindow

Thedirected_speed_changevariablewillbeclearedto0bandthenewdata
ratemustbevisibleintheCurrentLinkSpeedfieldoftheLinkStatusregis
ter,showninFigure1439.

IfthespeedwaschangedbecauseofaLinkbandwidthchange:

If successful_speed_negotiation is set to 1b and the Autonomous


Changebitinthe8consecutiveTS2sissetto1b,orthespeedchange
wasinitiatedbytheDownstreamPortforautonomousreasons(nota
reliability problem and not caused by software setting the Link
Retrainbit),thentheLinkAutonomousBandwidthStatusbitinthe
LinkStatusregisterissetto1b.
Otherwise,theLinkBandwidthManagementStatusbitissetto1b.

Figure1439:LinkStatusRegister

15 14 13 12 11 10 9 4 3 0

Link Autonomous
Bandwidth Status
Link Bandwidth
Management Status
Data Link Layer
Link Active
Slot Clock
Configuration
Link Training
Undefined
Negotiated
Link Width
Current Link Speed

597
PCIe 3.0.book Page 598 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ExittoDetailedRecoverySubstates
Oncethetimeouthasexpired,thenextstatewillbeRecovery.RcvrLock
IfthissubstatewasenteredfromRecovery.RcvrCfgandthespeedchange
wassuccessful,thenewdatarateischangedonalltheconfiguredLanesto
the highest commonlysupported rate and the changed_speed_recovery
variableissetto1b.

IfthissubstatewasenteredforasecondtimesinceenteringRecoveryfrom
L0 or L1 (indicated by changed_speed_recovery = 1b), the new data rate
willbetheratethatwasinusewhentheLTSSMenteredRecovery,andthe
changed_speed_recoveryvariableisclearedto0b.

Otherwise, the new data rate will revert to 2.5 GT/s and the
changed_speed_recovery variable remains cleared to 0b. The spec notes
thatthisrepresentsthecasewhentherateinL0wasgreaterthan2.5GT/s
butoneLinkpartnercouldntoperateatthatrateandtimedoutinRecov
ery.RcvrLockthefirsttimethrough.

ExittoDetectState
IfnoneoftheconditionsforexitingtoRecovery.RcvrLockaremet,thenext
statewillbeDetect,althoughthespecpointsoutthatthisshouldntbepos
sibleundernormalconditions.ItwouldmeanthattheLinkneighborscan
nolongercommunicateatall.

Recovery.RcvrCfg
ThisstatecanonlybeenteredfromRecovery.RcvrLockafterreceivingatleast8
TS1orTS2orderedsetswiththesameLinkandLanenumbersthathadbeen
negotiatedpreviously.Thismeansthatbitandsymbolorblocklockhavebeen
establishedandnowthePortmustdetermineifthereareanyotheritemsthat
needaddressedintheRecoverystate.IfthepurposeofenteringRecoverywas
simplytoreestablishbitandsymbollockafterleavingalinkpowermanage
mentstate,thenitislikelythatTS2swillbeexchangedhereandprogressonto
Recovery.Idle.If,however,therewasanotherreasonforenteringtheRecovery
state(e.g.speedchangeorlinkwidthchange),thenthatwillbedeterminedin
thissubstateandtheappropriatestatetransitionwilloccur.

Duringthissubstate,theTransmittersendsTS2sonallconfiguredLaneswith
the same Link and Lane Numbers configured earlier. If the
directed_speed_change variable is set to 1b, then the speed_change bit in the
TS2smustalsobeset.TheN_FTSvalueintheTS2sshouldreflectthenumber
neededatthecurrentrate.Thestart_equalization_w_presetvariableiscleared
to0bwhenenteringthissubstate.

598
PCIe 3.0.book Page 599 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

IfthespeedhasbeenchangedadifferentN_FTSnumbermaynowbeseenin
theTS2s.ThatvaluemustbeusedforexitingfutureL0slowpowerLinkstates.
For8b/10bencoding,LanetoLanedeskewmustbecompletedbeforeleaving
thissubstate.DevicesmustnotetheadvertisedrateidentifierinincomingTS2s
andusethistooverrideanypreviouslyrecordedvalues.Whenusing128b/130b
encoding,devicesmustmakeanoteofthevalueoftheRequestEqualizationbit
forfuturereference.

Notes about this substate: The variable successful_speed_negotiation is set to


1b.ThedataratesadvertisedintheTS2swiththespeed_changebitsetarenoted
atthispointforfuturereference,asistheAutonomousChangebitforpossible
loggingintheLinkStatusregisterduringRecovery.Speed.Theratethatwillbe
selectedinRecovery.Speedwillbethehighestcommonlysupportedrate.Inter
estingly,thechangetoRecovery.Speedwilltakeplaceforthiscaseevenifthe
Linkisalreadyoperatingatthehighestsupportedrate,althoughinthatcasethe
ratewontactuallychange.

Ifthespeedisgoingtochangeto8.0GT/s,aDownstreamPortwillneedtosend
EQTS2s(bit7ofSymbol6issetto1btoindicateanEQtrainingsequence).This
casewouldberecognizedif8.0GT/sismutuallysupportedand8consecutive
TS1sorTS2shavebeenseenonanyconfiguredLanewiththespeed_changebit
set,oriftheequalization_done_8GT_data_ratevariableis0b,orifdirected.
AnUpstreamPortcansettheRequestEqualizationbitifthecurrentdatarateis
8.0GT/sandtherewasaproblemwiththeequalizationprocess.EitherPortcan
request equalization be done again by setting both the Request Equalization
andQuiesceGuaranteebitsto1b.

UpstreamPortssettheirselect_deemphasisvariablebasedontheSelectableDe
emphasisbitinthereceivedTS2s.And,iftheTS2swereEQTS2s,theysetthe
start_equalization_w_presetvariableto1bandupdatetheirLaneEqualization
registerwiththenewinformation(i.e.:updatetheUpstreamPortTransmitter
Preset and Receiver Preset Hint fields in the register). Any configured Lanes
thatdontreceiveEQTS2swillchoosetheirpresetvaluesfor8.0GT/soperation
in a designspecific manner. Downstream Ports must set their
start_equalization_w_preset variable to 1b if the
equalization_done_8GT_data_ratevariableisclearedto0borifdirected.

Finally,if128b/130bencodingisinuse,devicesmustmakeanoteoftheRequest
Equalizationbit.Ifset,bothitandtheQuiesceGuaranteebitmustbestoredfor
futurereference.

599
PCIe 3.0.book Page 600 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ExittoRecovery.Idle
ThenextstatewillbeRecovery.Idleiftwoconditionsaretrue:

Eight consecutive TS2s are received on any configured Lane with


Link and Lane numbers and rate identifiers that match those being
sentandeither:
a) Thespeed_changebitintheTS2sisclearedto0b,or
b) Noratehigherthan2.5GT/siscommonlysupported.
SixteenTS2havebeensentafterreceivingoneandtheyhaventbeen
interruptedbyanyinterveningEIEOS.Thechanged_speed_recovery
anddirected_speed_changevariablesarebothclearedto0bonentry
tothissubstate.

ExittoRecovery.Speed
TheLTSSMwillgotoRecovery.SpeedifALLthreeconditionslistedbelow
aretrue:

EightconsecutiveTS2sarereceivedonanyconfiguredLanewiththe
speed_change bit set, identical rate identifiers, identical values in
Symbol6,and:
a) TheTS2swerestandard8b/10bTS2s,or
b) TheTS2swereEQTS2s,or
c) 1mshasexpiredsincereceivingeightEQTS2sonanyconfigured
Lane.
BothLinkpartnerssupportrateshigherthan2.5GT/s,ortherateis
alreadyhigherthan2.5GT/s.
For 8b/10b encoding, at least 32 TS2s were sent with the
speed_change bit set to 1b without any intervening EIEOS after
receivingoneTS2withthespeed_changebitsetto1binthesamecon
figuredLane.For128b/130bencoding,atleast128TS2saresentwith
the speed_change bit set to 1b after receiving one TS2 with the
speed_changebitsetto1binthesameconfiguredLane.

AtransitiontoRecovery.Speedcanalsooccuriftheratehaschangedtoa
mutually negotiated rate since entering Recovery from L0 or L1
(changed_speed_recovery=1b)andanyconfiguredLaneshaveeitherseen
EIOSordetected/inferredElectricalIdleandhaventseenTS2ssinceenter
ingthissubstate.ThismeansahigherratewasattemptedbuttheLinkpart
nerindicatesthatitisntworkingforsomereason.Thenewratewillreturn
towhateveritwaswhenRecoverywasenteredfromL0orL1.

ThefinalcasethatcancauseatransitiontoRecovery.Speedisiftheratehas
notchangedtoamutuallynegotiatedratesinceenteringRecoveryfromL0

600
PCIe 3.0.book Page 601 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

orL1(changed_speed_recovery=0b),andthecurrentrateisalreadyhigher
than2.5GT/s,andanyconfiguredLaneshaveeitherseenEIOSordetected/
inferredElectricalIdleandhaventseenTS2ssinceenteringthissubstate.In
this case, the understanding is that the current rate isnt working and the
solutionistodropbackdown,sothenewratewillbecome2.5GT/s.
ExittoConfigurationState
ThenextstatewillbeConfigurationif8consecutiveTS1sarereceivedon
any configured Lane with Link or Lane numbers that dont match those
beingsentandeitherthespeed_changebitisclearedto0b,ornoratehigher
than2.5GT/siscommonlysupported.
The variables changed_speed_recovery and directed_speed_change are
cleared to 0b when the LTSSM transitions to Configuration. If the N_FTS
valuehaschangedsincelasttime,thenewvaluemustbeusedforL0sgoing
forward.
ExittoDetectState
After48mswithoutresolvingtooneofthepreviouslydefinedstatetransi
tions,thenextstatewillbeDetectifthedatarateis2.5GT/sor5.0GT/s.
If the rate is 8.0 GT/s there is another possibility because the number of
attempts may not have been exceeded yet. That is indicated by the
idle_to_rlock_transitionedvariable,andifitslessthanFFhwhentherateis
8.0GT/s,thenewstatewillbeRecovery.Idle.Ifthattransitionismade,the
variables changed_speed_recovery and directed_speed_change will be
cleared to 0b. However, once idle_to_rlock_transitioned reaches FFh, and
the48mstimeoutisseen,thenextstatewillbeDetect.

Recovery.Idle
Asthenameimplies,TransmitterswillusuallysendIdlesinthissubstateasa
preparationforchangingtothefullyoperationalL0state.For8b/10bmode,Idle
dataisnormallysentonalltheLanes,whilefor128b/130banSDSissenttostart
aDataStreamandthenIdledataSymbolsaresentonalltheLanes.

ExittoL0State
ThenextstateisL0ifeitherofthefollowingcasesistrue.Ineithercase,if
the Retrain Link bit has been written to 1b since the last transition to L0
from Recovery or Configuration, the Downstream Port will set the Link
BandwidthManagementStatusbitto1b(seeFigure1439onpage597).

8b/10b encoding is in use and 8 consecutive Symbol Times of Idle


data have been received and 16 Idle data Symbols have been sent
sincethefirstonewasreceived.

601
PCIe 3.0.book Page 602 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

128b/130bencodinginuse,8consecutiveSymbolTimesofIdledata
havebeenreceivedand16IdledataSymbolshavebeensentsincethe
first one was received, and this state wasnt entered from Recov
ery.RcvrCfg.NotethatIdledataSymbolsmustbecontainedinData
Blocks,LanetoLaneDeskewmustbecompletedbeforeDataStream
processing starts, and the idle_to_rlock_transitioned variable is
clearedto00hontransitiontoL0.
ExittoConfigurationState
ThenextstateisConfigurationifeither:

A Port is instructed by a higher layer to optionally reconfigure the


Link,suchastochangetheLinkwidth.
AnyconfiguredLaneseestwoconsecutiveincomingTS1swithLane
numbers set to PAD (a Port that transitions to Configuration to
changetheLinkwillsendPADLanenumbersonallLanes).Thespec
recommends that the LTSSM use this transition when changing the
Linkwidthtoreducethetimeitwilltake.

ExittoDisableState
ThenextstateisDisabledifeither:

A Downstream or optional crosslink Port is instructed by a higher


layertosettheDisableLinkbitinitsTS1sorTS2s.
AnyconfiguredLaneofanUpstreamoroptionalcrosslinkPortsees
theDisableLinkbitsetintwoconsecutiveincomingTS1s.

ExittoHotResetState
ThenextstateisHotResetifeither:

A Downstream or optional crosslink Port is instructed by a higher


layertosettheHotResetbitinitsTS1sorTS2s.
AnyconfiguredLaneofanUpstreamoroptionalcrosslinkPortsees
theHotResetbitsetintwoconsecutiveincomingTS1s.

ExittoLoopbackState
ThenextstateisLoopbackifeither:

ATransmitterisknowntobeLoopbackMastercapable(designspe
cific;thespecdoesnotprovideameanstoverifythis)andinstructed
byahigherlayertosettheLoopbackbitinitsTS1sorTS2s.
AnyconfiguredLaneofanUpstreamoroptionalcrosslinkPortsees
theLoopbackbitsetintwoconsecutiveincomingTS1s.Thereceiving
devicethenbecomestheLoopbackslave.

602
PCIe 3.0.book Page 603 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

ExittoDetectState
Otherwise, after a 2ms timeout, the next state will be Detect unless the
idle_to_rlock_transitionedvariableislessthanFFh,inwhichcasethenext
state will be Detailed Recovery Substates. For the transition to Recov
ery.RcvrLock,ifthedatarateis8.0GT/stheidle_to_rlock_transitionedvari
ableisincrementedby1b,whilefor2.5or5.0GT/sitwillbesettoFFh.

L0s State
This is the low power Link state that has the shortest exit latency back to L0.
Devices manage entry and exit from this state automatically under hardware
controlwithoutanysoftwareinvolvement.EachdirectionofaLink,canenter
andexittheL0sstateindependentofeachother.

L0s Transmitter State Machine


TheL0sstatehasdifferentsubstatesfortheTransmitterandtheReceiver.The
Transmittersubstateswillbedescribedfirst.AsshowninFigure1440onpage
603thetransmitterstatemachineassociatedwithL0sstateisasimpleone.

Figure1440:L0sTxStateMachine

Entry
from L0

Tx sends Transmitter sends


EIOS FTSs on all Lanes
TTX-IDLE-MIN
= 20 ns Tx_L0s.Idle Directed
Tx_L0s.Entry (Tx Electrical Idle) Tx_L0s.FTS

Transmitter sends
SOS or EIEOS

Exit to
L0

603
PCIe 3.0.book Page 604 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Tx_L0s.Entry.
ATransmitterentersL0swhendirectedbyanupperlayer.Thespecgives
nodecisioncriteriaforthis,butintuitivelyitwouldoccurbasedonaninac
tivitytimeout:noTLPsorDLLPsbeingsentforagiventime.ToenterL0s,
theTransmittersendsoneEIOS(twoEIOSsforthe5.0GT/srate)andenters
ElectricalIdle.TheTransmitterisnotturnedoff,however,andmustmain
taintheDCcommonmodevoltagewithinthespecrange.

ExittoTx_L0s.Idle

ThenextstatewillbeTx_L0s.IdleaftertheTTXIDLEMINtimeout(20ns).
ThistimeisintendedtoensurethattheTransmitterhasestablishedthe
ElectricalIdlecondition.

Tx_L0s.Idle.
In this substate, the transmitter continues the Electrical Idle state until
directedtoleave.BecausethisdirectionoftheLinkisinElectricalIdle,there
willbeapowersavingsbenefit,whichistheentirepurposeoftheL0sstate.

ExittoTx_L0s.FTS

ThenextstatewillbeTx_L0s.FTSwhendirected,suchaswhenthePort
needstoresumepackettransmission.TheLTSSMwillbeinstructedina
designspecificmannertoexitthisstate.

Tx_L0s.FTS.
In this substate, the Transmitter will start sending FTS ordered sets to
retrain the Receiver of the Link Partner. The number of FTSs sent is the
N_FTSvalueadvertisedbytheLinkPartnerinitsTSOrderedSetsduring
thelasttrainingsequence that ledtoL0.ThespecnotesthatifaReceiver
timesoutwhiletryingtodothis,itmaychoosetoincreasetheN_FTSvalue
itadvertisesduringtheRecoverystate.

IftheExtendedSynchbitisset(seeFigure1471onpage644),thetransmit
ter must sends 4096 FTSs instead of the N_FTS number. This extends the
time available to synchronize external test and analysis logic, which may
notbeabletorecoverBitLockasquicklyastheembeddedlogiccan.

Foralldatarates,noSOSscanbesentpriortosendinganyFTSs.However,
forthe5.0GT/srate,4to8EIESymbolsmustbesentpriortosendingthe
FTSs.For128b/130b,anEIEOSmustbesentpriortotheFTSs.

604
PCIe 3.0.book Page 605 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

ExittoL0State

The Transmitter will transition to the L0 state once all the FTSs have
beensentand:

a) For 8b/10b encoding, one SOS is sent on all configured Lanes,


althoughnonearesentbeforeorduringtheFTSs.
b) For128b/130bencoding,oneEIEOSissentfollowedbyanSDSand
aDataStream.

L0s Receiver State Machine


Figure1441onpage605showstheReceiverL0sstatemachine.AReceiveris
requiredtoimplementL0ssupportiftheASPMSupportfieldintheLinkCapa
bilityregistershowsittobesupported,andisallowedtoimplementitevenif
thatsupportisnotindicated.

Figure1441:L0sReceiverStateMachine

Entry
from L0

Rx detects
EIOS Exit from FTSs Received
TTX-IDLE-MIN Electrical
= 20 ns Rx_L0s.Idle Idle
Rx_L0s.Entry (Rx Electrical Idle) Rx_L0s.FTS

Tx sends N_FTS
SOS or EIEOS Timeout

Exit to Exit to
L0 Recovery

605
PCIe 3.0.book Page 606 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Rx_L0s.Entry.
Entered when a Receiver that receives an EIOS, provided it supports L0s
andhasntbeendirectedtoL1orL2.

ExittoRx_L0s.Idle

ThenextstatewillbeRx_L0s.IdleaftertheTTXIDLEMINtimeout(20ns).

Rx_L0s.Idle.
TheReceiverisnowinElectricalIdlemodeandisjustwaitingtoseeanexit
fromElectricalIdle.

AsanasideregardingElectricalIdle,theearlyversionsofthespecexpected
thatElectricalIdlewouldbebasedonasquelchdetectcircuitmeasuringa
voltagethreshold.Later,asspeedsincreased,detectingsuchsmallvoltage
differences became increasingly difficult. Consequently, more recent spec
versions allow Electrical Idle to be inferred by observing Link behavior,
rather than actually measuring the voltage. However, if the voltage level
isnt used to detect entry into Electrical Idle, then it also cant be used to
detectanexitfromit.Tohandlethatproblem,anewOrderedSetwasintro
ducedcalledtheEIEOS(ElectricalIdleExitOrderedSet).TheEIEOScon
sistsofalternatingbytesofallzerosandallonesandcreatestheeffectofa
lowfrequency clock on the Lanes. Once a Receiver has entered Electrical
IdleitcanwatchforthispatternonthesignaltoinformitthattheLinkis
exitingfromElectricalIdle.

ExittoRx_L0s.FTS

ThenextstatewillbeRx_L0s.FTSaftertheReceiverdetectsanexitfrom
ElectricalIdle.

Rx_L0s.FTS.
Inthissubstate,theReceiverhasnoticedanexitfromElectricalIdleandis
nowtryingtoreestablishBitandSymbolorBlocklockontheincomingbit
stream(whicharereallyFTSorderedsets).

ExittoL0State

ThenextstatewillbeL0ifanSOSisreceivedin8b/10bencodingoran
SDSin128b/130bencodingonallconfiguredLanes.TheReceivermust
be able to accept valid data immediately after that, and LanetoLane
deskewmustbecompletedbeforeleavingthisstate.

606
PCIe 3.0.book Page 607 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

ExittoRecoveryState
OtherwisethenextstatewillbeRecoveryaftertheN_FTStimeout.Ifso,
theTransmittermustalsogotoRecovery,althoughitsallowedtofinish
anyTLPorDLLPthatwasinprogress.Ifthetimeoutoccurs,thespec
recommendsthattheN_FTSvaluebeincreasedtoreducethelikelihood
ofithappeningagain.TheN_FTStimeoutisdefinedasfollows:
For 8b/10b, the minimum timeout is given as 40 * [N_FTS + 3] * UI,
whilethemaximumallowedistwicethattime.Since10bits(UIrepre
sentsonebittime)areneededperSymbol,thisworksoutto(4*N_FTS+
12)Symbols.Theextra12Symbolsareexplainedas6foramaxsized
SOS+4forthepossibleextraFTS+2moreforSymbolmargin.Insum
mary, then, the minimum time is the time it should take to send the
requestednumberofFTSsplus12Symbols,whilethemaximumtimeis
twiceasmuchasthat.
Iftheextendedsynchbitisset,themintime=2048FTSsandthemax
time=4096FTSs.TheactualtimeoutvalueaReceiverwillusemustalso
takeintoaccountthe4to8EIESymbolsforspeedsotherthan2.5GT/s.
For128b/130b,thetimeoutvalueisgivenasaminimumof130*[N_FTS
+5+12+Floor(N_FTS/32)]*UIandamaxoftwicethattime.Thevalue
130 * UI means 130 bit times which represents one Block, so if we
removethosetwovalueswecansaywerelookingat[N_FTS+5+12+
Floor(N_FTS/32)] Blocks. Thevalue [5+ Floor (N_FTS/32)]represents
theEIEOSsthatwillneedtobesentduringthistime.OneEIEOSwillbe
sent after every 32 FTSs, so Floor (N_FTS/32) gives that number. The
other5areaccountedforbythefirstEIEOS,thelastEIEOS,theSDS,the
periodic EIEOS and an additional EIEOS in case the Transmitter
choosestosendtwoEIEOSfollowedbyanSDSwhenN_FTSisdivisi
ble by 32. Finally, the value of 12 represents the number of SOSs that
willbesentiftheextendedsynchbitisset.Whenthatbitisset,thetim
eoutwilluseN_FTS=4096.

L1 State
This Link power state trades a longer exit latency for more aggressive power
management compared to the L0s state. L1 is an option for ASPM, like L0s,
meaning devices can enter and exit this state automatically under hardware
control without any software involvement. However, unlike L0s, software is
alsoabletodirectanUpstreamPorttoinitiateachangetoL1,anditdoessoby
writingthedevicepowerstatetoalowerlevel(D1,D2,orD3).TheL1stateis
alsodifferentfromL0sinthatitaffectsbothdirectionsoftheLink.

607
PCIe 3.0.book Page 608 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1442:L1StateMachine

Entry
from L0

Directed and Remain in


EIOS Tx & Rx TTX-IDLE-MIN Electrical Idle

= 20 ns L1.Idle
L1.Entry (Electrical Idle)

Tx in Electrical Idle Tx Directed or


Rx sees Electrical Idle Exit

Exit to
Recovery

SincegoingtoElectricalIdlecanindicateadesirebytheLinkpartnertoenter
L0s,L1orL2,differentiatingwhichshouldbethenextstateishandledbyhav
ing both partners agree beforehand when theyre going to enter L1. A hand
shakeinformsthemthatthepartnerisreadyanditsthereforesafetoproceed.
Formoredetailonhowthisworks,seethesectioncalledIntroductiontoLink
PowerManagementonpage 733.Figure1442onpage608showstheL1state
machine,whichisdescribedinthefollowingsections.

L1.Entry
InorderforanUpstreamPorttoenterthisstate,itmustsendarequesttoenter
L1toitsLinkPartnerandreceiveacknowledgementthatitisOKtoputtheLink
intoL1.(The reason forrequesting to go into L1may bebecause of ASPMor
because of software involvement.) Once the L1 request acknowledge is
received,theUpstreamPortenterstheL1.Entrysubstate.

InorderforaDownstreamPorttoenterthisstate,itmustreceiveanL1enter
request from the Upstream Port and send a positive response to that request.
ThentheDownstreamPortwaitstoreceiveanElectricalIdleOrderedSet(EIOS)
and have its receive lanes drop to Electrical Idle. It is at this point that the
DownstreamPortenterstheL1.Entrysubstate.

608
PCIe 3.0.book Page 609 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

DuringL1.Entry
All configured Transmitters send an EIOS and enter Electrical Idle while
maintainingtheproperDCcommonmodevoltage.

ExittoL1.Idle
The next state will be L1.Idle after the TTXIDLEMIN timeout (20ns). This
timeisintendedtoensurethattheTransmitterhasestablishedtheElectrical
Idlecondition.

L1.Idle
Duringthissubstate,TransmittersremainintheElectricalIdle.

Forratesotherthan2.5GT/stheLTSSMmustremaininthissubstateforatleast
40ns.Inthespec,thisdelayissaidtoaccountforthedelayinthelogiclevelsto
armtheElectricalIdledetectioncircuitryincasetheLinkentersL1andimmedi
atelyexits.

ExittoRecoveryState
ThenextstatewillbeRecoverywhenaTransmitterisdirectedtochangeit
orwhenanyReceiverdetectsanexitfromElectricalIdle.Reasonsforleav
ingL1includetheneedtodeliveraDLLPorTLP,oradesiretochangethe
Linkwidthorspeed.Ifaspeedchangeisdesired,aPortisallowedtosetthe
directed_speed_change variable to 1b and must clear the
changed_speed_recovery variable to 0b. Optionally, the Port may exit L1
andtheninitiatethespeedchangelaterbysettingdirected_speed_change
to1bandenteringRecoveryfromL0instead.

L2 State
ThisisadeeperpowerstatewithalongerexitlatencythanL1.PowerManage
mentsoftwaredirectsanUpstreamPorttoinitiateentryintoL2(bothdirections
oftheLinkgotoL2)whenitsdeviceisplacedintheD3Coldpowerstateandthe
appropriateLinkhandshakeshavebeencompleted.

Main power will be shut off by the system once it learns that everything is
ready.Whenpowerisremoved,theLinkpowerstatewillbecomeeitherL2or
L3, depending on whether a secondary power source called VAUX (auxiliary
voltage)isavailable.IfVAUXispresent,theLinkentersL2;ifnot,itentersL3.

ThemotivationforL2istousethesmallpoweravailablefromVAUXtoinform
thesystemwhenaneventhasoccurredforwhichtheLinkneedstohavepower

609
PCIe 3.0.book Page 610 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

restored.Therearetwostandardwaysadevicecaninformthesystemofsuch
anevent.OneisasidebandsignalcalledtheWAKE#pin,andtheotherisanin
band signal called a Beacon. The L2 state isnt needed for WAKE#, but is
required if the optional Beacon will be used. The spec explicitly states that
devicesoperatingat5.0or8.0GT/sdontneedtosupportBeacon,soitwould
seemthatthisislegacysupportandonlyinterestingfordevicesoperatingat2.5
GT/s.FormoredetailonLinkwakeupoptions,refertoWakingNonCommu
nicatingLinksonpage 772.

Ifsupported,theBeaconisalowfrequency(30KHz500MHz)inbandsignal
thatanUpstreamPortsupportingwakeupcapabilitymustbeabletosendonat
least Lane 0 and a Downstream Port must be able to receive. Intermediate
deviceslikeSwitchesthatreceiveaBeacononaDownstreamPortmustforward
it to their Upstream Port. The ultimate destination for the Beacon is the Root
Complex, because thats where the system power control logic is expected to
reside.

ATransmittergoingtoElectricalIdlecouldindicateadesiretoenteranyofthe
lowpower Link states (L0s, L1 or L2), so a means of differentiating them is
needed.ForL2,thisishandledbyhavingtheLinkpartnersagreebeforehand
that theyre going to enter L2 by using a handshake sequence to ensure that
theyre both ready. For more detail on how this works, see the section called
IntroductiontoLinkPowerManagementonpage 733.Figure1443onpage
611showstheL2entryandExitstatemachine,whichisdescribedinthefollow
ingtext.

610
PCIe 3.0.book Page 611 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1443:L2StateMachine

Entry
from L0

Directed, and
EIOS both sent
and received Upstream Tx
sends Beacon
Upstream Port directed to send Beacon,
L2.Idle or Downstream Port detects Beacon
(Electrical Idle, L2.TransmitWake
No DC CMV)
Rx termination enabled,
Rx looking for Upstream Rx detects
Electrical Idle Exit Electrical Idle Exit

Root Port detects Beacon,


or Upstream Port sees
Electrical Idle Exit Exit to
Detect

L2.Idle
Toenterthissubstate,allthenecessaryhandshakeprocessmusthavealready
taken place between both ports on the Link and the ports have sent and
receivedtherequiredEIOS.

AllconfiguredTransmittersmustremainintheElectricalIdlestateforatleast
the TTXIDLEMIN timeout (20ns). However, since the main power will now be
shutoff,theyarentrequiredtomaintaintheDCcommonmodevoltagewithin
the spec range. Receivers wont start looking for the Electrical exit condition
until at least after the 20ns timeout expires. All Receiver terminations must
remainenabledinthelowimpedancecondition.

ExittoL2.TransmitWake
ThenextstatewillbeL2.TransmitWakeiftheUpstreamPortisinstructedto
sendaBeacon(theBeaconisalwaysandonlydirectedupstreamtotheRoot
Complex).

611
PCIe 3.0.book Page 612 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

If the Downstream Port of a Switch detects a Beacon, it must direct the


UpstreamPortoftheSwitchtoexittoL2.TransmitWakeandbeginsending
aBeacon.

ExittoDetectState
Oncemainpowerisreturned,thenextstatewillbeDetect.

IfthisPorthasmainpower,butitdetectsanexitfromElectricalIdleonany
predeterminedLanes,meaningthosethatcouldbenegotiatedtobeLane
0(multiLaneLinksmusthaveatleasttwopredeterminedLanes),thenext
state will be detect. When this happens to a Switch Upstream Port, the
SwitchmustalsotransitionitsDownstreamPortstoDetect.

L2.TransmitWake
During this substate, the Transmitter will send the Beacon on at least Lane 0.
NotethatthisstateonlyappliestoUpstreamPortsbecauseonlytheycansenda
Beacon.

ExittoDetectState
The next state will be Detect if an Electrical Idle exit is detected on any
Receiver of an Upstream Port. Of course, power must have already been
restoredtothedevicesinorderfortheneighbortoexitfromElectricalIdle.

Hot Reset State


APortenterstheHotResetstateeitherbecauseitisaBridgeandsoftwarepro
grammed its configuration space to propagate a Hot Reset Downstream as
explainedinHotReset(InbandReset)onpage 837,becauseaPortreceived
twoconsecutiveTS1swiththeHotResetbitasserted.

DuringHotReset
APorttransmitsTS1swiththeHotResetbitsetcontinuouslybutdoesnt
changetheconfiguredLinkandLaneNumbers.

If the Upstream Port of Switch enters the Hot Reset state, all configured
DownstreamPortsmusttransitiontoHotResetassoonaspossible.

ExittoDetectState
IntheBridgewhereHotResetwasoriginated,oncesoftwareclearsthecon
figuration space bit that initiated the Hot Reset, the Bridge Port enters
Detect. However, the Port must remain in the Hot Reset state for a mini
mumof2ms.

612
PCIe 3.0.book Page 613 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

ForPortswhereHotResetwasenteredbecauseofreceivingtwoconsecu
tiveTS1swiththeHotResetbitasserted,itremainsinthisstateaslongasit
continuestoreceivethesetypeofTS1s.OncethePortstopsreceivingTS1s
withtheHotResetbitasserted,itwilltransitiontotheDetectstate.How
ever,thePortmustremainintheHotResetstateforaminimumof2ms.

Disable State
ADisabledLinkisElectricallyIdleanddoesnothavetomaintaintheDCcom
mon mode voltage. Software initiates this by setting the Link Disable bit (see
Figure1471onpage644)intheLinkControlregisterofadeviceandthedevice
thensendsTS1swiththeLinkDisablebitasserted.

DuringDisable
AllLanestransmit16to32TS1swiththeDisableLinkbitasserted,sendan
EIOS (twoconsecutiveEIOSs forthe 5.0 GT/scase)and thentransition to
Electrical Idle. The DC commonmode voltage does not need be within
spec.

If anEIOS(two consecutive EIOSs forthe 5.0 GT/s case) wassent and an


EIOSwasalsoreceivedonanyconfiguredLane,thenLinkUp=0b(False)
andtheLanesareconsideredtobedisabled.

ExittoDetectState
For Upstream Ports, the next state will be Detect when Electrical Idle is
detectedattheReceiverorifnoEIOShasbeenreceivedwithina2mstime
out.

ForDownstreamPorts,thenextstatewillalsobeDetect,butnotuntilthe
LinkDisablebithasbeenclearedto0bbysoftware.

Loopback State
The Loopback state is a test and debug feature that isnt used during normal
operation.AdeviceactingasaLoopbackmastercanputtheLinkpartnerinto
theLoopbackslavemodebysendingTS1swiththeLoopbackbitasserted.This
can be done incircuit, allowing the possibility of using the Loopback state to
performaBIST(BuiltInSelfTest)ontheLink.

Onceinthisstate,theLoopbackmastersendsvalidSymbolstotheLoopback
slave,whichthenechoesthemback.TheLoopbackslavecontinuestoperform

613
PCIe 3.0.book Page 614 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

clocktolerancecompensation,sothemastermustcontinuetoinsertSOSsatthe
correctintervals.Toperformclocktolerancecompensation,theLoopbackslave
mayhavetoaddordeleteSKPSymbolstotheSOSitechoestotheLoopback
master.

TheLoopbackstateisexitedwhentheLoopbackmastertransmitsanEIOSand
thereceiverdetectsElectricalIdle.TheLoopbackstatemachineisshowninFig
ure1444onpage614anddescribedinthefollowingtext.

Figure1444:LoopbackStateMachine

Entry
from Configuration
Or Recovery
Slave: Enter Electrical
Master sends valid Idle for 2ms
Master receives Symbols - Master: Tx EIOSs
Master sends
Identical TS1s; Slave required to and enter Electrical
TS1s w/ Loopback Slave has retransmit exactly Slave: Directed or Idle for 2 ms
bit set entered 4 EIOS seen
Loopback Master: Directed
Loopback.Entry Loopback.Active Loopback.Exit

Timeout less than


100 ms Exit to
Detect

Loopback.Entry
ThetypicalbehaviorforthissubstateisfortheLoopbackMastertosendTS1s
withtheLoopbackbitsetuntilitstartsseeingthoseTS1sbeingreturned.Once
theLoopbackMasterseesTS1sbeingreturnedwiththeLoopbackbitasserted,
it knows that its Link Partner is now behaving as the Loopback Slave and is
simplyrepeatingeverythingitreceives.

While in this substate, the Link is not considered to be active (LinkUp = 0b).
Also, the Link and Lane numbers used in TS1s and TS2s are ignored by the
Receiver.ThespecmakesaninterestingobservationregardingtheuseofLane
numberswith128b/130bencoding.Asitturnsout,eachLaneusesadifferent
seed value for its scrambler (see Scrambling on page 430). Consequently, if
the Lane numbers havent been negotiated before going into the Loopback
mode,itspossiblethattheLinkpartnerscouldhavedifferentLaneassignments
and would therefore be unable to recognize incoming Symbols. This can be
avoidedbywaitinguntiltheLanenumbershavebeennegotiatedbeforedirect

614
PCIe 3.0.book Page 615 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

ingthemastertogototheLoopbackstate,orbydirectingthemastertosetthe
ComplianceReceivebitduringLoopback.Entry,orbysomeothermethod.

LoopbackMaster:
Inthissubstate,theLoopbackMasterwillcontinuouslysendTS1swiththe
Loopbackbitset.ThemastermayalsoasserttheComplianceReceivebitin
theTS1stohelptestingwhenoneorbothPortsarehavingtroubleobtaining
bitlock,Symbollock,orBlockalignmentafteraratechange.Ifthebitisset
itmustnotbeclearedwhileinthisstate.

If this substate was entered from Configuration.Linkwidth.Start, check to


seewhetherthespeedinuseisthehighestmutuallysupportedrateforboth
Linkpartners.Ifnot:

Changetothe highest common speed. Send 16 TS1swith the Loop


backbitsetfollowedbyanEIOS(twoEIOSsifthecurrentspeedis5.0
GT/s), and then go to Electrical Idle for 1ms. During the idle time,
changethespeedtothehighestcommonlysupportedrate.
Ifthehighestcommonrateis5.0GT/s,theslavesTxdeemphasisis
controlledbythemastersettingitsSelectableDeemphasisbitinthe
TS1stothedesiredvalue(1b=3.5dB,0b=6dB).
For data rates of 5.0 GT/s and higher, the masters Transmitter can
chooseanydeemphasissettingsitwants,regardlessofthesettingsit
senttotheslave.
Potential problem: if Loopback is entered after the Link has already
trainedtoL0andLinkUp=1b,itspossibleforonePorttoenterLoop
back from Recovery and the partner to enter from Configuration. If
thathappened,thelatterPortmighttrytochangethespeedwhilethe
PortenteringfromRecoverydoesnot,resultinginasituationwhere
the results are undefined. The spec states that the test setup must
avoidconflictingcaseslikethis.

ExittoLoopback.Active

ThenextstatewillbeLoopback.Activeaftereither2ms,iftheCompli
anceReceivebitissetintheoutgoingTS1s,ortwoconsecutiveTS1sare
received on a designspecific number of Lanes with the Loopback bit
setandtheComplianceReceivebitwasnotsetintheoutgoingTS1s.

Note that if the speed was changed, the master must ensure that
enoughTS1shavebeensentfortheslavetobeabletoacquireSymbol
lockorBlockalignmentbeforegoingtotheLoopback.Activestate.

615
PCIe 3.0.book Page 616 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ExittoLoopback.Exit

IfneitheroftheconditionstoenterLoopback.Activearemet,thenext
statewillbeLoopback.Exitafteradesignspecifictimeoutoflessthan
100ms.

LoopbackSlave:

ThissubstateisenteredbyreceivingtwoconsecutiveTS1swiththeLoop
backbitasserted.

If this substate was entered from Configuration.Linkwidth.Start, check to


seewhetherthespeedinuseisthehighestonethatmutuallysupportedby
bothLinkpartners.Ifnot:

Changetothehighestcommonspeed.SendoneEIOS (twoEIOSsif
thecurrentspeedis5.0GT/s),andthengotoElectricalIdlefor2ms.
Duringtheidletime,changethespeedtothehighestcommonlysup
portedrate.
If the highest common rate is 5.0 GT/s, set the Transmitters de
emphasisaccordingtotheSelectableDeemphasisbitinthereceived
TS1s(1b=3.5dB,0b=6dB).
Ifthehighestcommonrateis8.0GT/sand:
a)EQTS1sdirectedtheslavetothisstate,usetheTxPresetsettings
theyspecified.
b)NormalTS1sdirectedtheslavetothisstate,theslaveisallowedto
useitsdefaulttransmittersettings.

ExittoLoopback.Active

The nextstatewill beLoopback.Active if the Compliance Receive bit was


set in the incoming TS1s that directed the slave to this state. The slave
doesnt need to wait for particular boundaries to send loopedback data,
andisallowedtotruncateanyOrderedSetinprogress.

Otherwise, the slave sends TS1s with Link and Lane numbers set to PAD
andthenextstatewillbeLoopback.Activeif:

Therateis2.5or5.0GT/sandSymbollockisacquiredonallLanes.
The rate is 8.0 GT/s and two consecutive TS1s are seen on all active
Lanes.Equalizationishandledbyevaluatingandapplyingthevalues
given in the TS1s, as long as theyre supported and the EC value is
appropriateforthedirectionofthePort(10bforDownstreamPorts,

616
PCIe 3.0.book Page 617 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

and11bforUpstreamPorts).Optionally,thePortcanaccepteitherof
theECvaluesforthiscase.Ifthesettingsareapplied,theymusttake
effectwithin500nsofreceivingthemandmustnotcausetheTrans
mittertoviolateanyelectricalspecsformorethan1ns.Asignificant
difference compared to the process in Recovery.Equalization is that
thenewsettingsarenotechoedintheTS1sbeingsentbytheslave.
For 8b/10b, the slave must only transition to loopedback data on a
Symbol boundary, but is allowed to truncate any Ordered Set in
progress. For 128b/130b, no boundary is specified for when the
loopedback data can be sent, and it is still allowed to truncate any
OrderedSetinprogress.

Loopback.Active
During this substate, the Loopback Master sends valid encoded data and
shouldnotsendEIOSuntilitsreadytoexitLoopback.TheLoopbackSlaveech
oes the received information without modification (even if the encoding is
determinedtobeinvalid),withthepossibleexceptionofinvertingthepolarity
asdeterminedinthePollingstate.Theslavealsocontinuestoperformclocktol
erancecompensation.ThatmeansSKPsmustbeaddedorremovedasneeded,
buttheLanesarentrequiredtoallsendthesamenumber.

ExittoLoopback.Exit
ThenextstatewillbeLoopback.Exitfortheloopbackmasterifdirected.

ThenextstatewillbeLoopback.Exitfortheloopbackslaveifeitheroftwo
conditionsistrue:

TheslaveisdirectedtoexitorfourconsecutiveEIOSsareseenonany
Lane.
Optionally,ifthecurrentspeedis2.5GT/sandanEIOSisreceivedor
ElectricalIdleisdetectedorinferredonanyLane.ElectricalIdlemay
beinferredifanyconfiguredLanehasnotdetectedanexitfromElec
tricalIdlefor128s.

TheslavemustbeabletodetectanElectricalIdleonanyLanewithin1msof
EIOSbeingreceived.BetweenthetimeEIOSisreceivedandElectricalIdle
is actually detected, the Loopback Slave may receive a bit stream that is
undefinedbytheencodingscheme,anditmayloopthatbacktothetrans
mitter.

617
PCIe 3.0.book Page 618 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Loopback.Exit
Duringthissubstate,theLoopbackMastersendsanEIOSforPortsthatsupport
only 2.5 GT/s and eight consecutive EIOSs for Ports that support rates higher
than2.5GT/s(optionallysend8forthePortsthatonlysupport2.5GT/s,too),
andthenenterElectricalIdleonallLanesfor2ms.

TheLoopbackMastermusttransitiontoElectricalIdlewithinTTXIDLESET
TOIDLE aftersendingthelast EIOS.Note thatthe EIOSmarksthe end of
the masters transmit and compare operations. Any data received by the
masterafteranyEIOSisreceivedisundefinedandshouldbeignored.

TheloopbackslavemustenterElectricalIdleonallLanesfor2msbutmustecho
back all Symbols received prior to detecting Electrical Idle to ensure that the
masterseesthearrivaloftheEIOSastheendofthelogicalsendandcompare
operation.

ExittoDetectState
ThenextstatewillbeDetectoncetherequiredEIOSshavebeenexchanged
andtheLaneshavebeeninElectricalIdlefor2ms.

Dynamic Bandwidth Changes


HigherdataratesandwiderLinksforPCIeofferhigherperformancethanpre
viousgenerationsbutusemorepower,too.Consequently,the2.0specwriters
chosetoincludeanotherpairofpowermanagementmechanismsthatallowthe
hardwaretoadjusttheLinkspeedandwidthonthefly.TheseallowtheLinkto
usethehighestspeedandwidestpossibleLinkwhenperformanceisneeded,or
todropdowntoalowerspeedornarrowerLinkwidthorbothtoreducepower.
TherearetwoclearadvantagestothismethodcomparedtochangingtheLink
orDevicepowerstate.

First,theLinkisalwaysabletocommunicateregardlessofthechanges,witha
relativelyshortinterruptioninservicetomakethechange.Second,thepower
savingcanbegreater.Forexample,ax16Linkwouldalmostcertainlyuseless
poweroperatingasanactivex1Linkthanasax16LinkinL0s.

Secondly,inadditiontopowerconservation,bandwidthreductionscanalsobe
usedtoresolvereliabilityproblems.Forexample,itmaybethatahighspeed
Linkproducesunacceptablereliability,inwhichcaseeitherLinkcomponentis
allowedtoremovetheoffendingspeedfromthelistofsupportedspeedsthatit
advertises.Howacomponentmakesthatreliabilitydeterminationisnotspeci

618
PCIe 3.0.book Page 619 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

fied.Interestingly,componentsarealsopermittedtogointotheRecoverystate
and advertise a different set of supported speeds without requesting a speed
changeintheprocess.

Changing the Link Speed or Link Width requires the Link to be retrained.
WhentheLinkisintheL0state,andthespeedneedstobechanged,theLTSSM
of the port desiring thespeed change starts transmitting TS1s to its neighbor.
Doing so results in the two involved ports LTSSMs going through Recovery
statewheretheLinkspeedischangedandthenbacktoL0.

Similarly,theportthatdesirestochangetheLinkwidthstartstransmittingTS1s
to its neighbor. Doing so results in the two involved ports LTSSMs going
through Recovery state then Configuration state where the Link width is
changed.TheLTSSMfinallyreturnstoL0withthenewLinkwidthestablished.

Because the LTSSM is involved in dynamic Link bandwidth management, it


makes sense to discuss the two aspects of Link bandwidth management,
dynamicLinkspeedchangeanddynamicLinkwidthchangeinthefollowing
sections. Lets consider these two options separately, starting with Link speed
changes.

Dynamic Link Speed Changes


Bywayofreview,theLTSSMstatesareillustratedinFigure1445onpage620
to make it easier to recall the flow of states. Although according to the Gen1
specification,speedchangewasindicatedtobeperformedinthePollingstate,
thesubsequentGen2specmovedthisfunctiontotheRecoverystate.

619
PCIe 3.0.book Page 620 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1445:LTSSMOverview

Detect

Polling

Configuration

L2 Recovery

L1 L0 L0s

DuringthePollingstate,TS1sareexchangedbetweenLinkneighbors,andthese
containseveralkindsofinformationasshowninFigure1446onpage621.The
mostinterestingpartforushereisbytenumber4,theRateIdentifier.Bits1,2
and 3 indicate which data rates are available and the spec points out that 2.5
GT/smustalwaysbesupported,while5.0GT/smustalsobesupportedif8.0
GT/sissupported.

Themeaningofbit6dependsonwhetherthePortisfacingupstreamordown
stream and also on what LTSSM state the Port is in. However, for the speed
changecasetheoptionsarereducedbecauseitsonlymeaningfulcomingfrom
the Upstream Port and just indicates whether or not the speed change is an
autonomous event. Autonomous means that the Port is requesting this
change for its own hardwarespecific reasons and not because of a reliability
issue.Bit7isusedbytheUpstreamPorttorequestaspeedchange.Theseval
uesareverysimilarintheTS2s,althoughbit6hasanothermeaningnowrelated
toautonomousLinkwidthchangesthatwelldiscusslater.

620
PCIe 3.0.book Page 621 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1446:TS1Contents

0 COM
1 Link #
Rate Identifier
2 Lane # Bit 0 Reserved, = 0

3 # FTS Bit 1 Indicates 2.5 GT/s support

4 Rate ID Bit 2 Indicates 5.0 GT/s support

5 Bit 3 Indicates 8.0 GT/s support


Train Ctl
Bit 4:5 Reserved, = 0
6
TS ID Bit 6 Autonomous Change / Selectable De-
13 emphasis
Bit 7 Speed Change
14 TS ID

15 TS ID

Figure1447:TS2Contents

0 COM
1 Link # Rate Identifier

Bit 0 Reserved, = 0
2 Lane #
Bit 1 Indicates 2.5 GT/s support
3 # FTS
Bit 2 Indicates 5.0 GT/s support
4 Rate ID
Bit 3 Indicates 8.0 GT/s support
5 Train Ctl Bit 4:5 Reserved, = 0
6
Bit 6 Autonomous Change / Link Up-
TS ID configure Capability / Selectable De-
13 emphasis

14 TS ID Bit 7 Speed Change

15 TS ID

621
PCIe 3.0.book Page 622 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Upstream Port Initiates Speed Change


AspeedchangemustbeinitiatedbytheUpstreamPort(Portfacingupstream),
andisaccomplishedbytransitioningtotheRecoverystate.Thesubstatesofthe
RecoverystateareshowninFigure1448onpage622andthepartofinterestfor
thisdiscussionishighlightedbytheoval.Thediscussionthatfollowshereisa
relatively highlevel overview of the entire speed change process and doesnt
getintothedetailsoftheLTSSMoperation.Tolearnmoreaboutthat,referto
thediscussioncalledRecoveryStateonpage 571.

Figure1448:RecoverySubStates

Exit to
Recovery.Speed
Entry from Loopback Exit to
L1, L0, L0s Configuration

Recovery.Equalization

Recovery.RcvrLock Recovery.Idle Exit to


(bit/symbol re-lock)
Recovery.RcvrCfg
(Send idle data) Disabled

Exit to Hot
Exit to Exit to Reset
Configuration Detect

Exit to L0

Speed Change Example


Toillustratetheprocess,considerthespeed changeexampleshowninFigure
1449onpage623.NotethattheEqualizationsubstatehasbeenremovedinthis
exampletomakethediagramssimplerandeasiertofollow.Theexampleshows
achangefrom2.5GT/sto5.0GT/sandsotheEqualizationsubstateisnotused
anyway.Achangeto8.0GT/swouldgothroughthesameprocessbutwould
justaddatripthroughtheEqualizationsubstateattheendoftheprocess.To

622
PCIe 3.0.book Page 623 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

learnmoreabouttheEqualizationprocess,refertoRecovery.Equalizationon
page 587.

TheEndpointinthisexample,whichcanonlyhaveanUpstreamPort,isshown
connectedtoaRootComplex,whichcanonlyhaveDownstreamPorts.Onlythe
UpstreamPortcaninitiatethespeedchangeprocess,anditdoessobecauseits
DirectedSpeedChangeflagwassetearlierbasedonsomehardwarespecificcon
ditions.Tostartthesequence,itchangesitsLTSSMtotheRecoverystate,enters
theRecovery.RcvrLocksubstateandsendsTS1swiththeSpeedChangebitset
andlistingthespeedsthatitwillsupport,asshowninFigure1449onpage623.
When the Downstream Port sees the incoming TS1s, it also changes to the
RecoverystateandbeginssendingTS1sback.SincetheSpeedChangebitwas
setintheincomingTS1s,thatwillsettheDirectedSpeedChangeflagintheRoot
PortandtheoutgoingTS1swillalsohavethatbitset.ThespeedthattheLink
will attempt to use will be the highest commonlysupported speed so, if a
Devicewantstousealowerspeeditwouldsimplynotlistthehigherspeedsas
beingsupportedatthistime.

Figure1449:SpeedChangeInitiated

Directed Speed Change = 0 Directed Speed Change = 1

Entry Entry
Speed Speed

RcvrLock RcvrCfg RcvrLock RcvrCfg

Speed_Change = 1

TS1 TS1 TS1 TS1

Root Link S peed = 2.5 GT/s


PCIe
Complex Endpoint
TS1 TS1 TS1 TS1

Speed_Change = 1

When the Upstream Port detects the TS1s coming back, its state machine
changesto theRecovery.RcvrCfgsubstateanditbeginstosendTS2sthatstill
havetheSpeedChangebitset,asillustratedinFigure1450onpage624.These

623
PCIe 3.0.book Page 624 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

TS2swillnowalsohavetheAutonomousChangebitsetifthischangewasnot
caused by a reliability problem on the Link. When the Downstream Port sees
incoming TS2s, it also changes to the Recovery.RcvrCfg substate and returns
TS2swiththeSpeedChangebitset.However,theAutonomousChangebitis
reservedintheTS2sforDownstreamPortsduringRecovery.

Figure1450:SpeedChangePart2

Directed Speed Change = 1 Directed Speed Change = 1

Entry Entry
Speed Speed

RcvrLock RcvrCfg RcvrLock RcvrCfg

Speed_Change = 1

TS2 TS2 TS2 TS2

Root PCIe
Link Speed = 2.5 GT/s
Complex Endpoint
TS2 TS2 TS2 TS2

Speed_Change = 1

Autonomous Change = 1

OnceeachPorthasseen8consecutiveTS2swiththeSpeedChangebitset,they
knowthatthenextstepwillbetogototheRecovery.Speedsubstate,asshown
inFigure1451onpage625.Atthispoint,theDownstreamPortneedstoregis
terthesettingoftheAutonomousChangebitintheincomingTS2s.Tosupport
this,someextrafieldshavebeenaddedtothePCIeCapabilityregisters.

ThestatusbitsforLinkbandwidthchangesarefoundintheLinkStatusregis
ter,showninFigure1452onpage625.Statuschangescanalsobeusedtogen
erateaninterrupttonotifysoftwareoftheseeventsifthedeviceiscapableand
hasbeenenabledtodoso.ThiscapabilityisreportedbytheLinkBandwidth
NotificationCapablebit,showninFigure1453onpage626,andenabledbythe
InterruptEnablebitsintheLinkControlregister,asshowninFigure1454on

624
PCIe 3.0.book Page 625 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

page626.Notethattherearetwocases:autonomousandbandwidthmanage
men.Autonomousmeansthechangewasnotcausedbyareliabilityproblem,
whilebandwidthmanagementmeansitwas.

Figure1451:SpeedChangePart3

Directed Speed Change = 0 Directed Speed Change = 0

Entry Entry
Speed Speed

RcvrLock RcvrCfg RcvrLock RcvrCfg

TS2 TS2 TS2 EIOS


TS2

Root PCIe
Link Speed = 2.5 GT/s
Complex Endpoint
EIOS
TS2 TS2 TS2 TS2
Autonomous Change = 1
Root Complex Config Space
L ink Autonomous Bandwidth Status bit = 1

Figure1452:BandwidthChangeStatusBits

625
PCIe 3.0.book Page 626 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1453:BandwidthNotificationCapability

Figure1454:BandwidthChangeNotificationBits

626
PCIe 3.0.book Page 627 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

OncetheRecovery.Speedsubstateisreached,theLinkisplacedintotheElectri
cal Idle condition in both directions and the speed is changed internally. The
speed chosen will be the highest commonlysupported speed reported in the
RateIDfieldoftheTS1sandTS2s.Inthisexample,thatturnsouttobe5.0GT/s
andsothechangeismadetothatspeed.Afteratimeoutperiod,theLinkneigh
borsbothtransitionbacktoRecovery.RcvrLockandexitElectricalIdlebysend
ingTS1sagain,asshowninFigure1455onpage627.WhentheUpstreamPort
seesthemcomingback,ittransitionstoRecovery.RcvrCfgandbeginssending
TS2s,muchlikebefore.Thistime,though,theSpeedChangebitisnotset.Even
tually TS2s are seen coming back from the Downstream Port that also dont
havetheSpeedChangebitset,andatthatpointthestatemachinestransitionto
theRecovery.IdleontheirwaybacktoL0.
Ifaspeedchangehasfailsforsomereason,acomponentisnotallowedtotry
thatspeedorahigheroneforatleast200msafterreturningtoL0oruntilthe
Linkneighboradvertisessupportforahigherspeed,whichevercomesfirst.

Figure1455:SpeedChangeFinish

Directed Speed Change = 0 Directed Speed Change = 0

Entry Entry
Speed Speed

Exit to L0 Exit to L0
RcvrLock RcvrCfg RcvrLock RcvrCfg

Speed_Change = 0

TS2
TS1 TS2
TS1 TS2
TS1 TS2
TS1

Root PCIe
Link Speed = 5.0 GT/s
Complex Endpoint
TS2
TS1 TS2
TS1 TS2
TS1 TS2
TS1

Speed_Change = 0

Software Control of Speed Changes


Softwareisunabletocontrolwhenhardwaremakesdecisionsaboutchanging
thespeedbutcanlimitordisablethiscapability.Limitingitisaccomplishedby
settingtheTargetLinkSpeedvalueintheLinkControl2RegistershowninFig
ure1456onpage628.Thisactsastheupperboundonthespeedsavailableto

627
PCIe 3.0.book Page 628 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

theUpstreamPort,whichwilltrytomaintainthatvalueorthehighestspeed
supportedbybothLinkneighbors,whicheverislower.Softwarecanalsoforcea
particularspeedtobeusedbysettingthe TargetLinkSpeedintheUpstream
component and then setting the Retrain Link bit in the Link Control register,
showninFigure1457onpage629.Asmentionedearlier,softwareisnotifiedof
anyhardwarebasedLinkspeedorwidthchangesbytheLinkBandwidthNoti
fication Mechanism. Finally, the speed change mechanism can be disabled by
settingtheHardwareAutonomousSpeedDisablebit.

Figure1456:LinkControl2Register

628
PCIe 3.0.book Page 629 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1457:LinkControlRegister

Dynamic Link Width Changes


The same basic operation for changing the Link speed can also be used to
change the Link width, although the sequence is a little more complicated
because more LTSSM steps are involved. One thing thats important for soft
waretonotebeforeenablingLinkwidthchangesiswhethertheLinkneighbor
supportsrecoveringfromanarrowLinkbacktoawideLink(calledUpconfig
uringtheLink).Devicesreportthisabilityinbit6oftheRateIDfieldoftheTS2s
theysendduringtraining,asshowninFigure1458onpage630.Ifacomponent
doesntsupportthis,thatwouldmeanthatchangingtoanarrowerLinkwidth
wouldbeaonewayeventandwouldonlybesuitableforthecaseofareliabil
ityproblemontheLink.

629
PCIe 3.0.book Page 630 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1458:TS2Contents

0 COM
1 Link # Rate Identifier

Bit 0 Reserved, = 0
2 Lane #
Bit 1 Indicates 2.5 GT/s support
3 # FTS
Bit 2 Indicates 5.0 GT/s support
4 Rate ID
Bit 3 Indicates 8.0 GT/s support
5 Train Ctl Bit 3:5 Reserved, = 0
6 Bit 6 Autonomous Change / Link Up-
configure Capability / Selectable De-
TS ID emphasis
13 Bit 7 Speed Change

14 TS ID

15 TS ID

Link Width Change Example


ConsidertheexampleinFigure1459onpage631ofaRootPortconnectedtoan
Endpoint (Gigabit Ethernet Device). Only the Upstream Port will initiate this
change, and it begins by going to the Recovery state as before. This time,
though,theSpeedChangebitisnotset.TosortoutwhatthenewLinkwidth
willbe,theUpstreamPortwillneedtotelltheDownstreamPorttotransition
fromtheRecoverystatetotheConfigurationstatebeforegoingbacktoL0,as
showninFigure1460onpage631.ThereareseveralsubstatesintheConfigu
rationstate,andasimplifiedversionofthemisshowninFigure1461onpage
632.Wellgothroughthesequencetobeclearonhowthestepswork.

630
PCIe 3.0.book Page 631 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1459:LinkWidthChangeExample

Gigabit
Root Ethernet
Complex
Device
Lane Lane

0 0

1 1

Lan
2 2
e

3 3

Figure1460:LinkWidthChangeLTSSMSequence

Detect

Polling

Configuration

L2 Recovery

L1 L0 L0s

631
PCIe 3.0.book Page 632 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1461:SimplifiedConfigurationSubstates

Entry from
Polling or Recovery

Config.Linkwidth.Start

Config.Linkwidth.Accept

Config.Lanenum.Wait

Config.Lanenum.Accept

Config.Complete

Config.Idle

Exit to
L0

As before, the Upstream Port initiates this process by going to Recovery and
sendingTS1s.ThesedonthavetheSpeedChangebitset,ashighlightedinthe
exampleshowninFigure1459onpage631,whereanEthernetDeviceinitiates
thisprocessonitsUpstreamPort.Inresponse,theDownstreamPortsendsTS1s
back,alsowiththeSpeedChangebitcleared.LinkandLanenumbersarestill
shownasbeingunchangedfromthelasttimetheLinkwastrained.Referring
back to Figure 1448 on page 622, the next state is Recovery.RcvrCfg during
whichtheLinkpartnersexchangeTS2s.

632
PCIe 3.0.book Page 633 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1462:LinkWidthChangeStart

Gigabit
Root Ethernet
Complex
Device
Lane
ink:PAD, L ane:PAD) T S1 (L ink:0, Lane:0) TS1 (Link:0, Lane:0) Lane

0 0
TS1 (Link:0, Lane:0) TS1 (Link:0, L ane:0) TS1 (Link:PAD, Lan

Speed Change = 0 Speed Change = 0

ink:PAD, L ane:PAD) T S1 (L ink:0, Lane:1) TS1 (Link:0, Lane:1)

1 1
TS1 (Link:0, Lane:1) TS1 (Link:0, L ane:1) TS1 (Link:PAD, Lan

Speed Change = 0 Speed Change = 0

ink:PAD, L ane:PAD) T S1 (L ink:0, Lane:2) TS1 (Link:0, Lane:2)


Lan
2 2
e
TS1 (Link:0, Lane:2) TS1 (Link:0, L ane:2) TS1 (Link:PAD, Lan

Speed Change = 0 Speed Change = 0

ink:PAD, L ane:PAD) T S1 (L ink:0, Lane:3) TS1 (Link:0, Lane:3)

3 3
TS1 (Link:0, Lane:3) TS1 (Link:0, L ane:3) TS1 (Link:PAD, Lan

Speed Change = 0 Speed Change = 0

Since a speed change is not requested, the next state is Recovery.Idle. In that
statethePortsnormallysendthelogicalidlesymbols(allzeros)andtheDown
stream Port does so, as shown in Figure 1463 on page 634. However, the
Upstream Port was directed to change the Link width so it doesnt send the
expectedIdlesymbols.Instead,itsendsTS1swithPADforboththeLinkand
Lanenumbers.TheDownstreamPortrecognizesthatapreviouslyconfigured
LanenowhasaLanenumberofPAD,andthatcausesittotransitiontothefirst
Configurationsubstate:Config.Linkwidth.Start.

633
PCIe 3.0.book Page 634 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1463:LinkWidthChangeRecovery.Idle

Gigabit
Root Ethernet
Complex
Device
Lane
(Link:PAD, L ane:PAD) Idle Data Idle Data Lane

0 0
TS1 (L in k:PAD , Lane:PAD) TS1 (Lin k:PAD , Lane:PAD) TS1 (Link:PAD, Lan e:P
Speed Change = 0 Speed Change = 0

(Link:PAD, L ane:PAD) Idle Data Idle Data

1 1
TS1 (L in k:PAD , Lane:PAD) TS1 (Lin k:PAD , Lane:PAD) TS1 (Link:PAD, Lan e:P

Speed C hange = 0 Speed Change = 0

(Link:PAD, L ane:PAD) Idle Data Idle Data


Lan
2 2
e
TS1 (L in k:PAD , Lane:PAD) TS1 (Lin k:PAD , Lane:PAD) TS1 (Link:PAD, Lan e:P
Speed C hange = 0 Speed Change = 0

(Link:PAD, L ane:PAD) Idle Data Idle Data

3 3
TS1 (L in k:PAD , Lane:PAD) TS1 (Lin k:PAD , Lane:PAD) TS1 (Link:PAD, Lan e:P
Speed Change = 0 Speed Change = 0

TheDownstreamPortnowinitiatesthenextstepbysendingTS1sthathavethe
originallynegotiatedLinknumberbutPADonalltheLanenumbers,asillus
tratedinFigure1464onpage635.TheUpstreamPortrespondswithmatching
TS1s on theLanes it wantstohave active, but with PADforbothLink and
LanenumbersontheLanesitwishestohaveinactive.WhentheDownstream
Portseesthisresponse,ittransitionstotheConfig.Linkwidth.Acceptsubstate.
NotethattheAutonomousChangebitissetfortheseTS1s.

634
PCIe 3.0.book Page 635 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1464:MarkingActiveLanes

Gigabit
Root Ethernet
Complex Desired
Device
Lane State
Lane
k:PAD, L ane:PAD) TS1 (Link:0, Lane: PAD) TS1 (Link:0, Lane: PAD) Lane

0 0 Active
TS1 (Link:0, Lane:PAD) TS1 (Link:0, Lane:PAD) TS1 (Link:PAD, Lane:PAD)
Autonomous Change = 1 Autonomous Change = 1

k:PAD, L ane:PAD) TS1 (Link:0, Lane: PAD) TS1 (Link:0, Lane: PAD)

1 1 Inactive
TS1 (L ink:PAD , Lane:PAD) TS1 (L ink:PAD , Lane:PAD) TS1 (Link:PAD, Lane:PAD)

Autonomous Change = 1 Autonomous Change = 1

k:PAD, L ane:PAD) TS1 (Link:0, Lane: PAD) TS1 (Link:0, Lane: PAD)
Lan
2 2
e
Inactive
TS1 (L ink:PAD , Lane:PAD) TS1 (L ink:PAD , Lane:PAD) TS1 (Link:PAD, Lane:PAD)
Autonomous Change = 1 Autonomous Change = 1

k:PAD, L ane:PAD) TS1 (Link:0, Lane: PAD) TS1 (Link:0, Lane: PAD)

3 3 Inactive
TS1 (L ink:PAD , Lane:PAD) TS1 (L ink:PAD , Lane:PAD) TS1 (Link:PAD, Lane:PAD)
Autonomous Change = 1 Autonomous Change = 1

The Root Port responds by changing its TS1s to show Lane numbers that are
appropriatefortheactiveLanes,butPADfortheLinkandLanenumbersofall
theLanesthatwereseentobeinactive.TheUpstreamPortrespondswiththe
sameTS1s,asshowninFigure1465onpage636,andthestatechangestoCon
fig.Lanenum.Accept.Atthispoint,theRootPortupdatesthestatusbittoshow
thatanautonomouschangewasdetectedandchangestotheConfig.Complete
substate.

635
PCIe 3.0.book Page 636 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1465:ResponsetoLaneNumberChanges
g
Root Ethernet
Co mp lex Desired
Device
State
Lane
Link: PAD, L ane:PAD) T S1 (L ink:0, Lane: 0) TS1 (L in k: 0, Lane: 0) Lane

0 0 Active
TS1 (Link: 0, Lane:0) TS1 (Link: 0, L ane:0) TS1 (Link:PAD, Lan e: PAD)
Autonom ous Change = 1 Autonomous Change = 1

Link: PAD, L ane:PAD) TS1 (Li nk:PAD, Lane: PAD) T S1 (L ink:PAD, Lane:PAD)

1 1 Inactive
TS1 (L in k: PAD , Lane:PAD) TS1 (L in k: PAD , Lane:PAD) TS1 (Link:PAD, Lan e: PAD)

Autonomous Change = 1 Autonomous Change = 1

Link: PAD, L ane:PAD) TS1 (Li nk:PAD, Lane: PAD) T S1 (L ink:PAD, Lane:PAD)
Lan
2 2
e
Inactive
TS1 (L in k: PAD , Lane:PAD) TS1 (L in k: PAD , Lane:PAD) TS1 (Link:PAD, Lan e: PAD)
Autonomous Change = 1 Autonomous Change = 1

Link: PAD, L ane:PAD) TS1 (Li nk:PAD, Lane: PAD) T S1 (L ink:PAD, Lane:PAD)

3 3 Inactive
TS1 (L in k: PAD , Lane:PAD) TS1 (L in k: PAD , Lane:PAD) TS1 (Link:PAD, Lan e: PAD)

Autonom ous Change = 1 Autonomous Change = 1

Inthenextstep,theRootPortbeginstosendTS2sontheactiveLanesandputs
theinactiveLanesintoElectricalIdle.RecallthattheTS2sreportwhetheracom
ponentisupconfigurecapableandinthisexample,bothLinkpartnerssup
port this capability. The Endpoint sends back the same thing: TS2s on active
Lanes and Electrical Idle on inactive Lanes. Seeing that, the Root Ports state
machinechangestoConfig.IdleanditbeginstosendLogicalIdleontheactive
Lanes.TheEndpointrespondswiththesamethingandtheLinkstatechanges
backtoL0.TheLinkisnowreadyfornormaloperation,albeitwithareduced
bandwidthforpowerconservation.

636
PCIe 3.0.book Page 637 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1466:LinkWidthChangeFinish

Gigabit
Root Ethernet
Complex Desired
Upconfigure Capability = 1 Upconfigure Capability = 1
Device
State
Lane
Link:PAD, L ane:PAD) TS2 (Link:0, Lane: 0) TS2 (L in k:0, Lane: 0) Lane

0 0 Active
TS2 (Link:0, Lane:0) TS2 (Link:0, L ane:0) TS1 (Link:PAD, Lan e:PAD)
Upconfigure Capability = 1 Upconfigure Capability = 1

1
Electrical Idle 1 Inactive

Electrical Idle Lan


2 2
e
Inactive

Electrical Idle
3 3 Inactive

Aswasthecasefordynamicspeedchanges,softwarecantinitiateLinkwidth
changes,butitcandisablethismechanismbysettingthebitintheLinkControl
registershowninFigure1467onpage638.Unlikethespeedchangecase,no
softwaremechanismwasdefinedtoallowsettingaparticularLinkwidth.

637
PCIe 3.0.book Page 638 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1467:LinkControlRegister

Related Configuration Registers


ManyoftheconfigurationregistersthatarerelevanttoLinkInitializationand
Training have been shown when their contents were described earlier, but it
seemsgoodtosummarizethemhere.

Link Capabilities Register


TheLinkCapabilitiesRegisterispicturedinFigure1468onpage639andeach
bitfieldisdescribedinthesubsectionsthatfollow.

638
PCIe 3.0.book Page 639 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Figure1468:LinkCapabilitiesRegister

31 24 23 22 21 20 19 18 17 15 14 12 11 10 9 4 3 0

Port Number

RsvdP
ASPM Optionality Compliance
Link Bandwidth
Notification Capability
Data Link Layer Link Active
Reporting Capable
Surprise Down Error
Reporting Capable
Clock Power Management
L1 Exit Latency

L0s Exit Latency


Active State
Link PM Support
Maximum Link Width
Max Link Speed

Max Link Speed [3:0]


ThisindicatesthemaximumLinkspeedforthisport,andisgivenasapointer
toabitlocationintheLinkCapabilities2registerSupportedLinkSpeedsVector
thatcorrespondstothemaxLinkspeed.Definedencodingsare:

0001bSupportedLinkSpeedsVectorfieldbit0
0010bSupportedLinkSpeedsVectorfieldbit1
0011bSupportedLinkSpeedsVectorfieldbit2
0100bSupportedLinkSpeedsVectorfieldbit3
0101bSupportedLinkSpeedsVectorfieldbit4
0110bSupportedLinkSpeedsVectorfieldbit5
0111bSupportedLinkSpeedsVectorfieldbit6

Allotherencodingsarereserved.MultifunctiondevicessharinganUpstream
Port must report the same value in this field in all Functions. This register is
ReadOnly.

639
PCIe 3.0.book Page 640 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Maximum Link Width[9:4]


This field indicates the maximum width of the PCI Express Link. The values
thataredefinedare:
000000b:Reserved
000001b:x1
000010b:x2
000100b:x4
001000b:x8
001100b:x12
010000b:x16
100000b:x32

Allotherencodingsarereserved.MultifunctiondevicessharinganUpstream
Port must report the same value in this field in all Functions. This register is
ReadOnly.

Link Capabilities 2 Register


The Link Capabilities Register is pictured in Figure 1468 on page 639 and
showstheSupportedLinkSpeedsVectortowhichtheMaxLinkSpeedfieldin
theLinkCapabilitiesregisterpoints.Thevaluesforthisfieldare:
Bit0=2.5GT/s
Bit1=5.0GT/s
Bit2=8.0GT/s
Bits6:3RsvdP(reservedandpreserved).

Figure1469:LinkCapabilities2Register

31 9 8 7 1 0

RsvdP

Crosslink Supported
Supported Link
Speeds Vector

RsvdP

640
PCIe 3.0.book Page 641 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Link Status Register


TheLinkStatusRegisterispicturedinFigure1439onpage597.

Current Link Speed[3:0]:


ThisreadonlyfieldindicatesthecurrentLinkspeed.Thespeedwillalwaysbe
2.5GT/swhentheLinkfirsttrainstoL0.Afterthat,ifahighercommonlysup
portedspeedisavailable,theLTSSMwillgotoRecoveryandattempttochange
tothatspeed.ThevaluesinthisfieldarethesameastheMaxLinkSpeedencod
ingsshownintheLinkCapabilitiesregister:
0001bSupportedLinkSpeedsVectorfieldbit0
0010bSupportedLinkSpeedsVectorfieldbit1
0011bSupportedLinkSpeedsVectorfieldbit2
0100bSupportedLinkSpeedsVectorfieldbit3
0101bSupportedLinkSpeedsVectorfieldbit4
0110bSupportedLinkSpeedsVectorfieldbit5
0111bSupportedLinkSpeedsVectorfieldbit6

Allotherencodingsarereserved.

NotethatthevalueofthisfieldisundefinedwhentheLinkisnotup(LinkUp=
0b).

Negotiated Link Width[9:4]


Thisfieldindicatestheresultoflinkwidthnegotiation.Therearesevenpossible
widths,allotherencodingsarereserved.Thedefinedencodingsare:

000001b:forx1.
000010bforx2.
000100bforx4.
001000bforx8.
001100bforx12.
010000bforx16.
100000bforx32.

Allotherencodingsarereserved.Notethatthevalueofthisfieldisundefined
whentheLinkisnotup(LinkUp=0b).

641
PCIe 3.0.book Page 642 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Undefined[10]
Currentlyundefined,thisbitwaspreviouslysetbyhardwareinearlierspecver
sionswhenaLinkTrainingErrorhadoccurred.ItwasclearedwhentheLTSSM
successfullyenteredL0.Thespecstatesthatsoftwarecanwriteanyvaluetothis
bitbutmustignoreanyvaluereadfromit.

Link Training[11]
ThisreadonlybitindicatesthattheLTSSMisintheprocessoftraining.Techni
cally, it means the LTSSM is either in the Configuration or Recovery state, or
that the Retrain Link bit has been written to 1b but Link training has not yet
begun.ThisbitisclearedbyhardwarewhentheLTSSMexitstheConfiguration
orRecoverystate.SincethismustbevisibletosoftwarewhileLinkTrainingisin
progress, it only has meaning for Ports that are facing downstream. Conse
quently,thisbitisnotapplicableandreservedforEndpoints,bridgeUpstream
PortsandSwitchUpstreamPorts.Forthem,thisbitmustbehardwiredto0b.

Figure1470:LinkStatusRegister

15 14 13 12 11 10 9 4 3 0

Link Autonomous
Bandwidth Status
Link Bandwidth
Management Status
Data Link Layer
Link Active
Slot Clock
Configuration
Link Training
Undefined
Negotiated
Link Width
Current Link Speed

Link Control Register


TheLinkControlRegisterispicturedinFigure1471onpage644,andthereare
threefieldsinitthatareinterestingforushere.

642
PCIe 3.0.book Page 643 Sunday, September 2, 2012 11:25 AM

Chapter 14: Link Initialization & Training

Link Disable
Whensettoone,thelinkisdisabled.Intuitively,thisbitisntapplicableandis
reserved for Endpoints, bridge Upstream Ports, and Switch Upstream Ports
becauseitmustbeaccessiblebysoftwareevenwhentheLinkisdisabled.When
thisbitiswritten,anyreadimmediatelyreflectsthevaluewritten,regardlessof
theLinkstate.Afterclearingthisbit,softwaremustbecarefultohonorthetim
ing requirements regarding the first Configuration Read after a Conventional
Reset(seeResetExitonpage 846).

Retrain Link
ThisbitallowssoftwaretoinitiateLinkretrainingwheneveritisdeemednec
essary,asforerrorrecovery.ThebitisnotapplicabletoandisreservedforEnd
pointdevicesandUpstreamPortsofBridgesandSwitches.Whensetto1b,this
directstheLTSSMtotheRecoverystatebeforethecompletionoftheConfigura
tionwriteRequestisreturned.

Extended Synch
Asitaffectstraining,thisbitisusedtogreatlyextendthetimespentintwositu
ations,forthepurposeofassistingslowerexternaltestoranalysishardwareto
synchronize with the Link before it resumes normal communication. One of
theseiswhenexitingL0s,wheresettingthisbitforcesthetransmissionof4096
FTSspriortoenteringL0.TheothercaseisintheRecoverystatepriortoenter
ingRecovery.RcvrCfg,whereitforcesthetransmissionof1024TS1s.

643
PCIe 3.0.book Page 644 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1471:LinkControlRegister

15 12 11 10 9 8 7 6 5 4 3 2 1 0

RsvdP

Link Autonomous Bandwidth


Interrupt Enable

Link Bandwidth Management


Interrupt Enable
Hardware Autonomous
Width Disable

Enable Clock
Power Management
Extended Synch
Common Clock
Configuration
Retrain Link

Link Disable
Read Completion
Boundary Control

RsvdP
Active State
PM Control

644
PCIe 3.0.book Page 645 Sunday, September 2, 2012 11:25 AM

PartFive:

AdditionalSystem
Topics
PCIe 3.0.book Page 646 Sunday, September 2, 2012 11:25 AM
PCIe 3.0.book Page 647 Sunday, September 2, 2012 11:25 AM

15 ErrorDetection
andHandling
The Previous Chapter
This chapter describes the operation of the Link Training and Status State
Machine(LTSSM)ofthePhysicalLayer.TheinitializationprocessoftheLinkis
describedfromPowerOn or ResetuntiltheLink reachesfullyoperationalL0
state during which normal packet traffic occurs. In addition, the Link power
managementstatesL0s,L1,L2,andL3arediscussedalongwiththestatetransi
tions.TheRecoverystate,duringwhichbitlock,symbollockorblocklockare
reestablishedisdescribed.LinkspeedandwidthchangeforLinkbandwidth
managementisalsodiscussed.

This Chapter
Althoughcareisalwaystakentominimizeerrorstheycantbeeliminated,so
detectingand reportingthemis an important consideration. This chapter dis
cusses error types that occur in a PCIe Port or Link, how they are detected,
reported,andoptionsforhandlingthem.SincePCIeisdesignedtobebackward
compatiblewithPCIerrorreporting,areviewofthePCIapproachtoerrorhan
dlingisincludedasbackgroundinformation.ThenwefocusonPCIeerrorhan
dlingofcorrectable,nonfatalandfatalerrors.

The Next Chapter


Thenextchapterprovidesanoverallcontextforthediscussionofsystempower
managementandadetaileddescriptionofPCIepowermanagement,whichis
compatible with the PCI Bus PM Interface Spec and the Advanced Configuration
and Power Interface (ACPI). PCIe defines extensions to the PCIPM spec that
focusprimarilyonLinkPowerandeventmanagement.

647
PCIe 3.0.book Page 648 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Background
SoftwarebackwardcompatibilitywithPCIisanimportantfeatureofPCIe,and
thats accomplished by retaining the PCI configuration registers that were
alreadyinplace.PCIverifiedthecorrectparityoneachtransmissionphaseof
thebustocheckforerrors.DetectederrorswererecordedintheStatusregister
andcouldoptionallybereportedwitheitheroftwosidebandsignals:PERR#
(Parity Error) for a potentially recoverable parity fault during data transmis
sion,andSERR#(SystemError)foramoreseriousproblemthatwasusuallynot
recoverable.Thesetwotypescanbecategorizedasfollows:

OrdinarydataparityerrorsreportedviaPERR#
Data parity errors during multitask transactions (special cycles)
reportedviaSERR#
AddressandcommandparityerrorsreportedviaSERR#
Othertypesoferrors(devicespecific)reportedviaSERR#

HowtheerrorsshouldbehandledwasoutsidethescopeofthePCIspecand
mightincludehardwaresupportordevicespecificsoftware.Asanexample,a
data parityerroronareadfrommemorymightberecoveredinhardwareby
detectingtheconditionandsimplyrepeatingtheRequest.Thatwouldbeasafe
stepifthememorycontentswerentchangedbythefailedoperation.

AsshowninFigure151onpage649,botherrorpinsweretypicallyconnected
tothe chipsetandusedtosignaltheCPUin aconsumerPC.Thesemachines
wereverycostsensitive,sotheydidntusuallyhavethebudgetformuchinthe
wayoferrorhandling.Consequently,theresultingerrorreportingsignalchosen
wastheNMI(NonMaskableInterrupt)signalfromthechipsettotheprocessor
that indicated significant system trouble requiring immediate attention. Most
consumerPCsdidntincludeanerrorhandlerforthiscondition,sothesystem
would simply be stopped to avoid corruption and the BSOD (Blue Screen Of
Death)wouldinformtheoperator.AnexampleofanSERR#conditionwouldbe
anaddressparitymismatchseenduringthecommandphaseofatransaction.
Thisisapotentiallydestructivecasebecausethewrongtargetmightrespond.If
thathappenedandSERR#reportedit,recoverywouldbedifficultandwould
probablyrequiresignificantsoftwareoverhead.(TolearnmoreaboutPCIerror
handling,refertoMindSharesbookPCISystemArchitecture.)

PCIXusesthesametwoerrorreportingsignalsbutdefinesspecificerrorhan
dlingrequirementsdependingonwhetherdevicespecificerrorhandlingsoft
ware is present. If such a handler is not present, then all parity errors are
reportedwithSERR#.

648
PCIe 3.0.book Page 649 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Figure151:PCIErrorHandling

NMI
Processor

FSB

Graphics
North
North Bridge
Bridge
(Intel
(Intel 440)
440) S
SDRAM
Address Port Data Port

PCI 33 MHz

Slots
IDE PERR#
CD HDD
Error
South Bridge Logic
USB SERR#

ISA
Ethernet SCSI
Boot Modem Audio Super
ROM Chip Chip I/O

COM1
COM2

PCIX2.0usessourcesynchronousclockingtoachievefasterdatarates(upto
4GB/s).Thisbustargetedhighendenterprisesystemsbecauseitwasgenerally
too expensive for consumer machines. Since these highperformance systems
alsorequirehighavailability,thespecwriterschosetoimprovetheerrorhan
dlingbyaddingErrorCorrectingCode(ECC)support.ECCallowsmorerobust
errordetectionandenablescorrectionofsinglebiterrorsonthefly.ECCisvery
helpfulinminimizingtheimpactoftransmissionerrors.(Tolearnmoreabout
PCIXerrorhandling,seeMindSharesbookPCIXSystemArchitecture.)

PCIemaintainsbackwardcompatibilitywiththeselegacymechanismsbyusing
theerrorstatusbitsinthelegacyconfigurationregisterstorecorderrorevents
inPCIethatareanalogoustothoseofPCI.ThatletslegacysoftwareseePCIe
error events in terms that it understands, and allows it to operate with PCIe
hardware.SeePCICompatibleErrorReportingMechanismsonpage 674for
thedetailsoftheseregisters.

649
PCIe 3.0.book Page 650 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

PCIe Error Definitions


Thespecusesfourgeneraltermsregardingerrors,definedhere:

1. ErrorDetectiontheprocessofdeterminingthatanerrorexists.Errorsare
discovered byanagent as a resultof a local problem,such as receiving a
badpacket,orbecauseitreceivedapacketsignalinganerrorfromanother
device(likeapoisonedpacket).
2. Error Logging setting the appropriate bits in the architected registers
basedontheerrordetectedasanaidforerrorhandlingsoftware.
3. ErrorReportingnotifyingthesystemthatanerrorconditionexists.This
cantaketheformofanerrorMessagebeingdeliveredtotheRootComplex,
assumingthedeviceisenabledtosenderrormessages.TheRoot,inturn,
cansendaninterrupttothesystemwhenitreceivesanerrorMessage.
4. ErrorSignalingtheprocessofoneagentnotifyinganotherofanerrorcon
dition by sending an error Message, or sending a Completion with a UR
(UnsupportedRequest)orCA(CompleterAbort)status,orpoisoningaTLP
(alsoknownaserrorforwarding).

PCIe Error Reporting


TwoerrorreportinglevelsaredefinedforPCIe.ThefirstisaBaselinecapability
requiredforalldevices.Thisincludessupportforlegacyerrorreportingaswell
asbasicsupportforreportingPCIeerrors.ThesecondisanoptionalAdvanced
Error Reporting Capability that adds a new set of configuration registers and
tracksmanymoredetailsaboutwhicherrorshaveoccurred,howseriousthey
areandinsomecases,canevenrecordinformationaboutthepacketthatcaused
theerror.

Baseline Error Reporting


Two sets of configuration registers are required in all devices in support of
Baselineerrorreporting.ThesearedescribedindetailinBaselineErrorDetec
tionandHandlingonpage 674andaresummarizedhere:

PCIcompatibleRegistersthesearethesameregistersusedbyPCIand
provide backward compatibility for existing PCIcompatible software. To
makethiswork,PCIeerrorsaremappedtoPCIcompatibleerrors,making
themvisibletothelegacysoftware.

650
PCIe 3.0.book Page 651 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

PCI Express Capability Registers these registers will only be useful to


newersoftwarethatisawareofPCIe,buttheyprovidemoreerrorinforma
tionspecificallyforPCIesoftware.

Advanced Error Reporting (AER)


Thisoptionalerrorreportingmechanismincludesa newand dedicatedset of
configuration registers that give error handling software more information to
workwithindiagnosingandrecoveringfromproblems.TheAERregistersare
mappedintotheextendedconfigurationspaceandprovidemuchmoreinfor
mationaboutthenatureofanyerrors.SeeAdvancedErrorReporting(AER)
onpage 685foradetaileddescriptionoftheseregisters.

Error Classes
Errorsfallintotwogeneralcategoriesbasedonwhetherhardwareisabletofix
theproblemornot,CorrectableandUncorrectable.TheUncorrectablecategory
isfurthersubdividedbasedonwhethersoftwarecanfixtheproblem,Nonfatal
andFatal.

Correctableerrorsautomaticallyhandledbyhardware
Uncorrectableerrors
Nonfatal handled by devicespecific software; Link is still operational
andrecoverywithoutdatalossmaybepossible
Fatalhandledbysystemsoftware;LinkorDeviceisnotworkingprop
erlyandrecoverywithoutdatalossisunlikely

Basedontheseclasses,errorhandlingsoftwarecanbepartitionedintoseparate
handlerstoperformtheactionsrequired.Suchactionsmightrangefromsimply
monitoringthefrequencyofCorrectableerrorstoresettingtheentiresystemin
theeventofaFatalerror.Regardlessofthetypeoferror,softwaremayarrange
forthesystemtobenotifiedofallerrorstoallowtrackingandloggingthem.

Correctable Errors
Correctableerrorsare,bydefinition,automaticallycorrectedinhardware.They
mayimpactperformancebyaddinglatencyandconsumingbandwidth,butif
allgoeswell,recoveryisautomaticandfastbecauseitdoesntdependonsoft
wareintervention,andnoinformationislostintheprocess.Theseerrorsarent

651
PCIe 3.0.book Page 652 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

requiredtobereportedtosoftware,butdoingsocouldallowsoftwaretotrack
errortrendsthatmightindicatethatsomedevicesareshowingsignsofimmi
nentfailure.

Uncorrectable Errors
Errorsthat cantbe automatically corrected in hardware are calledUncorrect
able,andtheseareeitherNonfatalorFatalinseverity.

Non-fatal Uncorrectable Errors


Nonfatalerrorsindicatethatinformationhasbeenlostbutthecausewaslikely
somethingotherthantheintegrityofaLinkorDevice.Apacketfailedsome
where,buttheLinkcontinuestofunctioncorrectlyandotherpacketsareunaf
fected.SincetheLinkisstillworking,recoveryofthelostinformationmaybe
possible,butwilldependonimplementationspecificsoftwaretohandleit.An
exampleofthiserrortypewouldbeaCompletiontimeout,inwhichaRequest
wassentbutnoCompletionwasreturnedwithintheallowedtime.Somewhere
therewasanissue,butitcouldbesomethingassimpleasarandombiterror
withinaSwitchthatcausedtheCompletiontoberoutedincorrectly.Anattempt
atrecoveryforthiscasecouldbeassimpleasreissuingtheRequest.

Fatal Uncorrectable Errors


FatalerrorsindicatethataLinkorDevicehashadanoperationalfailure,caus
ingdatalossthatisunlikelytoberecovered.Forthesecases,resettingatleast
thefailedLinkorDevicewillprobablybethefirststepinanyrecoveryprocess
becauseitsclearlynotoperationalforsomereason.Thespecalsoinvitesimple
mentationspecific approaches, in which software may attempt to limit the
effectsofthefailure,butitdoesntdefineanyparticularactionsthatshouldbe
taken.Anexampleofthistypeoferrorwouldbeareceiverbufferoverflow,in
which case information has been lost because flow control tracking counters
havegottenoutofsyncwitheachother.Sincetheresnomechanismtofixthis,a
resetofthisLinkwillusuallyberequired.

PCIe Error Checking Mechanisms


ThescopeofPCIeerrorcheckingfocusesonerrorsassociatedwiththeLinkand
packetdelivery,asshowninFigure152onpage653.Errorsthatdontpertain
toLinktransmissionarenotreportedthroughPCIeerrorhandlingmechanisms
and would need proprietary methods to report them, such as devicespecific

652
PCIe 3.0.book Page 653 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

interrupts.Eachlayeroftheinterfaceincludeserrorcheckingcapabilities,and
thesearesummarizedinthesectionsthatfollow.

Figure152:ScopeofPCIExpressErrorCheckingandReporting

PCIe Device A PCIe Device B

Device Core Device Core

PCIe Core PCIe Core


Hardware/Software Hardware/Software
Interface Interface

Transaction Layer Transaction Layer

Data Link Layer Data Link Layer

Physical Layer Physical Layer


(RX) (TX) Link (RX) (TX)

Scope of PCIe Error Reporting

CRC
Beforedivingintoerrorhandlingasitrelatestothelayers,itwillhelptofirst
discusstheconceptofCRC(CyclicRedundancyCheck)becauseitsanintegral
partofPCIeerrorchecking.ACRCcodeiscalculatedbythetransmitterbased
on the contents of the packet and adds it to the packet for transmission. The
CRC name is derived from the fact that this check code (calculated from the
packettocheckforerrors)isredundant(addsnoinformationtothepacket),and
isderivedfromcycliccodes.AlthoughaCRCdoesntsupplyenoughinforma
tiontodoautomaticerrorcorrectionthewayECC(ErrorCorrectingCode)can,
itdoesproviderobusterrordetection.CRCsarealsocommonlyusedinserial
transportsbecausetheyregoodatdetectingastringofincorrectbits.

653
PCIe 3.0.book Page 654 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

CRCshavetwodifferentusagecasesinPCIe.OneisthemandatoryLCRC(Link
CRC) generated and checked in the Data Link Layer for every TLP that goes
acrossaLink.ItsintendedtodetecttransmissionerrorsontheLink.

The second is the optional ECRC (Endtoend CRC) thats generated in the
TransactionLayerofthesenderandcheckedintheTransactionLayeroftheulti
matetargetofthepacket.Thisisintendedtodetecterrorsthatmightotherwise
be silent, such as when a TLP passes through an intermediate agent like a
Switch, as shown in Figure 153 on page 654. In this illustration, the packet
arrived safely on the downstream port of the Switch but while it was being
storedorprocessedwithintheSwitchabiterroroccurred.TheLCRConlypro
tects TLPs while on the Link. Once the Data Link Layer of the Ingress Port
checkstheLCRC,itremovesitfromthepacketbecauseanewLCRCwillbecal
culated(whichwillincludethenewSequenceNumber)attheEgressPort.This
meansthatthepacketisunprotectedwhileinsidetheSwitch.Thisisthepur
pose of having an ECRC. It is calculated at the originating device and is not
removed or recalculated by intermediate devices. So if the target device is
checking the ECRC and sees a mismatch, then there must have been an error
somewhere along the way even though no LCRC error was seen. Note that
usingtheECRCrequiresthepresenceoftheoptionalAdvancedErrorReport
ingregisters,sincetheycontainthebitstoenablethisfunctionality.

Figure153:ECRCUsageExample

Root Complex

Internal
Bit Error Switch

No external (LCRC)
transmission errors

PCIe
Endpoint

654
PCIe 3.0.book Page 655 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Error Checks by Layer


Differentaspectsofanincomingpacketarecheckedinthedifferentlayersatthe
Receiver.Someerrorcheckingislistedasoptional.Forthosecases,iftheerror
occursbutthedesignerhaschosennottoimplementthatformofchecking,it
willnotbedetected.

Physical Layer Errors


ApacketarrivingattheReceiverarrivesatthePhysicalLayerfirst.Therearea
fewthingsthatmustbecheckedatthislevelandothersthatmayoptionallybe
checked.Linktrainingalsotakesplaceatthislayer,andavarietyofproblems
mayariseduringthatprocessbutthoseandotherdetailsofthePhysicalLayer
arecoveredinChapter14,entitledLinkInitialization&Training,onpage505.
Insummary,though,PhysicalLayererrors,alsocalledReceiverErrorsorLink
Errors,includethefollowingcases:
Whenusing8b/10b,checkingfordecodeviolations(checkingrequired)
Framingviolations(optionalfor8b/10b,requiredfor128b/130b)
Elasticbuffererrors(checkingoptional)
LossofsymbollockorLanedeskew(checkingoptional)
IfaTLPwasinprogresswhenaReceiverErrorwasdetected,itisdiscarded.To
resolve the error, the Data Link Layer is signaled to send a NAK if one isnt
alreadypending.

Data Link Layer Errors


After the Physical Layer, incoming packets go next into the Data Link Layer,
wheretheyarecheckedforseveralpossibleproblems.Thedetailsofthesecon
ditionscanbefoundinChapter10,entitledAck/NakProtocol,onpage317.In
summary,theerrorsare:

LCRCfailureforTLPs
SequenceNumberviolationforTLPs
16bitCRCfailureforDLLPs
LinkLayerProtocolerrors

AswiththePhysicalLayer,ifaTLPwasinprogresswhenanerrorisseen,the
TLPisdiscardedandaNAKisscheduledifoneisntalreadypending.

There are some Data Link Layer errors to watch for at the transmitter, too,
including REPLAY_TIMER expiring and the REPLAY_NUM counter rolling
over.AtimeoutishandledbyreplayingthecontentsoftheReplayBufferand

655
PCIe 3.0.book Page 656 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

incrementing the REPLAY_NUM counter. The timer and counter are reset
whenever an ACK or NAK arrives at the transmitter that indicates forward
progresshasbeenmade(meaningitresultsinclearingoneormoreTLPsfrom
theReplayBuffer).ButifanAckorNakisntreceivedquicklyenough,thetime
outconditionisseenwhichwillresultinareplay.

Transaction Layer Errors


Lastly,ifincomingTLPspassallthechecksatthePhysicalandDataLinkLay
ers,theywillfinallyreachtheTransactionLayer,wheretheyarecheckedfor:

ECRCfailure(checkingoptional)
MalformedTLP(errorinpacketformat)
FlowControlProtocolviolation
UnsupportedRequests
DataCorruption(poisonedpacket)
CompleterAbort(checkingoptional)
ReceiverOverflow(checkingoptional)

As with the Data Link Layer, there are some error checks at the transmitter
TransactionLayer,too,suchas:

CompletionTimeouts
UnexpectedCompletion(CompletiondoesnotmatchpendingRequest)

Error Pollution
Aproblemcanariseifadeviceseesseveralproblemsforthesametransaction.
Thiscouldresultinseveralerrorsgettingreported(referredtoasErrorPollu
tion). To avoid this, reported errors are limited to only the most significant
one.Forexample,ifaTLPhasaReceiverErroratthePhysicalLayer,itwould
certainlybefoundtohaveerrorsattheDataLinkLayerandTransactionLayers,
too,butreportingthemallwouldjustaddconfusion.Whatismostrelevantis
reportingthefirsterrorthatwasseen.Consequently,ifanerrorisseeninthe
Physical Layer, theres no reason to forward the packet to the higher layers.
Similarly,ifanerrorisseenintheDataLinkLayer,thenthepacketwontbefor
warded to the Transaction Layer. Offending packets at one level are not for
wardedtothenextlevelbutaredropped.

Still,multipleerrorsmaybeseenforthesamepacketattheTransactionLayer.
Only the most significant one should be reported in the order of priority as
definedbythespec.TransactionLayererrorpriorityfromhighesttolowestis:

656
PCIe 3.0.book Page 657 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

UncorrectableInternalError
ReceiverBufferOverflow
FlowControlProtocolError
ECRCCheckFailed
MalformedTLP
AtomicOpEgressBlocked
TLPPrefixBlocked
ACS(AccessControlServices)Violation
MC(Multicast)BlockedTLP
UR (Unsupported Request), CA (Completer Abort), or Unexpected Com
pletion
PoisonedTLPReceived

Asanexample,aTLPmightexperienceanECRCfaultcausedbyacorrupted
header.Sincesomethingwascorruptedwithinthepacket,itmightalsobeseen
as Malformed or possibly as an Unsupported Request. The ECRC fault is the
highest priority, since it means that the header contents may have been cor
rupted, and due to this, there is no point in reporting errors that depend on
thosecontents.

Sources of PCI Express Errors


Ratherthanconsideralloftheerrorconditionsindividually,itwillbehelpfulto
groupthemintocommonareas.

ECRC Generation and Checking


As mentioned earlier, ECRC generation and checking requires the optional
Advanced Error Reporting configuration register structure to be present, as
showninFigure154onpage658.Configurationsoftwarechecksforthiscapa
bilityregistertodeterminewhetherECRCsaresupportedinaFunction.Ifitis,
awritetotheErrorCapabilityandControlregistercanbeusedtoenableit.

657
PCIe 3.0.book Page 658 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure154:LocationofErrorRelatedConfigurationRegisters

Byte 0d
Status Command
Header
63d CapPtr
PCI
Required Compatible
PCIe Capability Block Space
255d
Advanced Error Reporting
Optional Capability Structure

Other PCIe Extended


Capability Structures Capability
Space

4095d

AdeviceenabledtogenerateECRCsoriginatesaTLP(RequestorCompletion),
computesthe32bitECRCbasedontheheaderanddataportionsofthepacket
andaddsittotheendofthepacket.TheECRCiscalledendtoendbecause
theintentisthatitwillbegeneratedattheTLPsoriginandneverstrippedoffor
regenerated by any intermediate device along its path. Switches in the path
betweentheoriginatingandreceivingdevicesareallowedtocheckandreport
ECRC errors but arent required to do so. Whether or not there is an error, a
Switchmuststillforwardthepacketunalteredsothattheultimatetargetdevice
canevaluatetheECRCandtakeappropriatesteps.IfaSwitchisactingasthe
originator or recipientof theTLP itcanparticipate likeanordinary device in
ECRC generation and checking. For more on the topic of how a Switch is
allowedtoreportsucherrors,seeAdvisoryNonFatalErrorsonpage 670.

658
PCIe 3.0.book Page 659 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

TLP Digest
IftheoptionalECRCcapabilityisenabled,aspecialbitcalledTD(TLPDigest)is
setintheheadertoindicatethatitspresentattheendofthepacket(theECRC
isalsocalledtheDigest).TheTDbitinthepacketheaderisshowninFigure15
5onpage659.Thespecemphasizesthatthisbitmustbetreated withspecial
carewhenforwardingaTLPbecauseifitsmissingbuttheECRCispresent,or
viceversa,thenthepacketwillbeconsideredMalformed.

Figure155:TLPDigestBitinaCompletionHeader

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Bytes 8-11 Vary with Type Field
Byte 12 Bytes 12-15 Vary with Type Field

Variant Bits Not Included in ECRC Mechanism


The ECRC is calculated based on the contents of the header and data. Since
thesearenotexpectedtochange,theresultshouldbethesamewhenthecheck
is performed at the receiver. However, it turns out that two header bits can
legallychangewhilethepacketisinflight:bit0oftheTypefield,andtheEPbit.
Bit0oftheTypefieldcanchangeinConfigurationRequestsforthesimplerea
sonthattheRequestwillbeType1untilithasreacheditsdestinationbus,and
thenitwillbecomeType0.Thatinvolveschangingbit0oftheTypefield.The
EPbitcanalsobelegallychangedbyintermediatedevicesiftheydetectadata
error.Forexample,ifaSwitchforwardsaTLPbutitsuffersaninternalerrorof
somekindthatcorruptsthedata,settingtheEPbitasitgoesouttheEgressPort
isonewaytoreporttheerror(knownaserrorforwardingordatapoisoning).
Since these two bits can change while the packet is in flight they are called
variant bits and cannot be used in the generation or checking of ECRC.
Instead, their values are always assumed to be 1b for ECRC generation and
checkinginsteadofusingtheactualvalues.ThatwaytheECRCdoesntdepend
onthemandwillbecorrectlyevaluated.

659
PCIe 3.0.book Page 660 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

TheactionstakenwhenanECRCerrorisdetectedarebeyondthescopeofthe
spec, but the possible choices will depend on whether the error is found in a
RequestoraCompletion.
ECRC in Request Completers that detect an ECRC error must set the
ECRCerrorstatusbit.TheymayalsochoosenottoreturnaCompletionfor
this Request, resulting in a Completion timeout at the Requester, whose
softwaremightthenchoosetorescheduletheRequest.
ECRCinCompletionRequestersthatdetectanECRCerrormustsetthe
ECRC error status bit. Besides the standard error reporting mechanism,
theymayalsochoosetoreporttheerrortotheirdevicedriverwithaFunc
tionspecificinterrupt.Asbefore,thesoftwaremightdecidetoreschedule
thefailedRequest.

Ineithercase,anUncorrectableNonfatalerrorMessagemaybesenttothesys
tem.Ifso,thedevicedriverwouldprobablybeaccessedtocheckthestatusbits
intheUncorrectableErrorStatusRegisterandlearnthenatureoftheerror.Ifpos
sible,thefailedRequestmayberescheduled,butotherstepsmightbeneeded.

Data Poisoning
Datapoisoning,alsocalled ErrorForwarding,providesanoptionalwayfor a
device to indicate that the data associated with a TLP is corrupted. In these
cases,theEP(ErrorPoisoned)bitinthepacketheaderissettoindicatetheerror.
TheEPbitisshowninFigure156onpage660.

Figure156:TheError/PoisonedBitinaCompletionHeader

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Bytes 8-11 Vary with Type Field
Byte 12 Bytes 12-15 Vary with Type Field

660
PCIe 3.0.book Page 661 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Anytime data is transferred, such as in write Requests or Completions with


data,corruptionofthatdatacouldhappenwhichneeds to bereportedtothe
targetdevice.Ineachofthesecases,thepacketcanbeforwardedtotherecipient
but marked as having bad data by the EP bit in the header. The thoughtful
readermaywonderwhyonemightwanttosenddatathatisalreadyknownto
bebad.Asithappens,therearesomecaseswhereitsuseful:

1. If a Request results in a Completion returned with data, but that data


encountered an error as it was gathered from the target (like a parity or
ECC failure in memory), then what is the best way to report it? One
approachwouldbenottosendtheCompletionatallbut,iftheerrorisnt
reportedinsomeotherway,thesystemonlyseesaCompletiontimeoutat
theRequester.Thatresponseisntveryhelpfulbecauseanynumberofprob
lemsmightresultinthatoutcome.
If,ontheotherhand,theCompletionisdeliveredwiththepoisonedbitset,
thenatleasttheRequestercanseethattheroundtrippathtotheCompleter
must have been working correctly. Therefore, the problem must have
occurredinternallytotheCompleterorelseinaSwitchthatwasinthepath.
What steps will be taken will be implementation specific, but more is
knownaboutwhatmusthavegonewrongthaniftheCompletionsimply
timedout.
2. Itcanbeusedtoreportanintermediateproblem.Ifadatapayloadiscor
rupted while passing through a Switch, the packet can still be forwarded
withtheEPbitsettoindicatetheproblem.
3. Itmaybethatthetargetdevicecanacceptthedatawitherrors.Asanexam
ple,anaudiooutputdeviceneedstoreceiveatimelydatastreamtowork
well.Ifincomingdatahasanerror,theconsequencesaresmall(glitchinthe
audio output) and the time to recover would be long enough to cause a
noticeable delay, so it can be better to take it as is rather than attempting
recoveryofthedata.
4. Atargetdevicemighthaveameansofcorrectingthedata.Thedatamight
be directly recoverable, or the target might have a means of recreating
partsofit,orhavesomeothermeansofworkingaroundtheproblem.
Thespecstatesthatdatapoisoningappliesonlytothedatapayloadassociated
withapacket(suchasMemory,Configuration,orI/OwritesandCompletions)
andnevertothecontentsoftheTLPheader.Consequently,areceiversbehavior
isundefinedifitseesapoisonedpacket(EP=1)withnopayload(likeapoisoned
memoryread).PoisoningcanonlybedoneattheTransactionLayerofadevice;
theDataLinkLayerdoesnotexamineoraffectthecontentsoftheTLPheader.

Error forwarding support is stated to be optional for transmitters, and the


absenceofsuchastatementforreceiversimpliesthatitsnotoptionalforthem.

661
PCIe 3.0.book Page 662 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Ifatransmittersupportsit,itsenabledwiththeParityErrorResponsebitinthe
legacyCommandregister. Thats because a Poisoned packet is roughly analo
goustoaparityerrorinPCI,sincethatshowPCIreportsbaddata.Receiptofa
poisoned packet may be reported to the system with an error Message if
enabled and, if the optional Advanced Error Reporting registers are present,
willalsosetthePoisonedTLPstatusbit.

As one might expect, poisoned writes to control locations are not allowed to
modifythecontentsinthetarget.ExamplesgiveninthespecareConfiguration
writes,IOormemorywritestocontrolregisters,andAtomicOps.Switchesthat
receivepoisonedpacketsmustforwardthemunchangedtothedestinationport
although,iftheyvebeenenabledtodoso,theymustreportthispacketasan
error to help software determine where the error happened. Completers that
receive a poisoned nonposted Request are expected to return a Completion
withastatusofUR(UnsupportedRequest).

Split Transaction Errors


Avarietyoffailurescanoccurduringasplittransactionassociatedwithnon
postedrequests.PCIedefinesastatusfieldwithintheCompletionheaderthat
allowstheCompletertoreportsomeerrorsbacktotheRequester.Figure157
on page 662 illustrates the location of this field in a completion header and
Table 151onpage 663givesthepossiblevalues.Asthetableshows,onlyfour
encodingsaredefined,twoofwhichrepresenterrorconditions.

Figure157:CompletionStatusFieldwithintheCompletionHeader

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x0 01010 tr H D P 00
Compl. B
Byte 4 Completer ID C
Status M
Byte Count
Byte 8 Requester ID Tag R Lower Address

662
PCIe 3.0.book Page 663 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Table151:CompletionCodeandDescription

StatusCode CompletionStatusDefinition

000b SuccessfulCompletion(SC)

001b UnsupportedRequest(UR)error

010b ConfigurationRequestRetryStatus(CRS)

011b CompleterAbort(CA)error

100b111b Reserved

Unsupported Request (UR) Status


IfareceiverdoesntsupportaRequest,itreturnsaCompletionwithURstatus.
ThespecdefinesanumberofconditionsthatcouldresultinaURstatus.Some
examplesare:

Request type not supported (example: IO Request to native Endpoint or


MRdLktonativeEndpoint)
Messagewithunsupportedorundefinedmessagecode
Requestdoesnotreferenceaddressspacemappedtothedevice
RequestaddressisntmappedwithinaSwitchPortsaddressrange
Poisonedwrite Request(EP=1)targetsanI/OorMemorymapped control
spaceintheCompleter.SuchRequestsmustnotbeallowedtomodifythe
location and are instead discarded by the Completer and reported with a
CompletionhavingaURstatus.
AdownstreamRootorSwitchPortreceivesaconfigurationRequesttarget
ing a device on its Secondary Bus that doesnt exist (e.g. a device with a
nonzerodevicenumber,unlessARIisenabled).ThePortmustterminate
the Request and return a Completion with UR status because the down
streamDevicenumberisrequiredtobezero(unlessARI,AlternativeRout
ingIDInterpretation,isenabled).
Type1configurationRequestisreceivedatanEndpoint.
Completion using a reserved Completion Status field encoding must be
interpretedasUR.
A function in the D1, D2, or D3hot power management state receives a
RequestotherthanaconfigurationRequestorMessage.
ATLPwithouttheNoSnoopbitsetinitsheaderisroutedtoaportthathas
theRejectSnoopTransactionsbitsetinitsVCResourceCapabilityregister.

663
PCIe 3.0.book Page 664 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Completer Abort (CA) Status


SeveralcircumstancescanoccurthatcouldresultinaCompleterreturningthis
CAstatustotheRequester.Someexamplesare:

CompleterreceivesaRequestthatitcannotcompletewithoutviolatingits
programmingrules.Forexample,someFunctionsmaybedesignedtoonly
allowaccessestosomeregistersinacompleteandalignedmanner(e.g.a4
byte register may require a 4byte aligned access). Any attempt to access
oneoftheseregistersinapartialormisalignedfashion(e.g.readingonly
twobytesofa4byteregister)wouldfail.Suchrestrictionsarenotviolations
of the spec, but rather legal constraints associated with the programming
interfaceforthisFunction.AccesstosuchaFunctionisbasedontheexpec
tationthatthedevicedriverunderstandshowtoaccessitsFunction.
CompleterreceivesaRequestthatitcannotprocessbecauseofsomeperma
nenterrorconditioninthedevice.Forexample,awirelessLANcardthat
wontacceptnewpacketsbecauseitcanttransmitorreceiveoveritsradio
untilanapprovedantennaisattached.
CompleterreceivesaRequestforwhichitdetectsanACS(AccessControl
Services)error.AnexampleofthiswouldbeaRootPortthatimplements
theACSregistersandhasACSTranslationBlockingenabled.Ifamemory
RequestisseenonthatPortwithanythingotherthanthedefaultvaluein
theATfield,itwillbeanACSviolation.
PCIetoPCI Bridge may receive a Request that targets the PCI bus. PCI
allows the target device to signal a target abort if it cant complete the
Request due to some permanent condition or violation of the Functions
programming rules. In response, the bridge would return a Completion
withCAstatus.

ACompleterthatabortsaRequestmayreporttheerrortotheRootwithaNon
fatalErrorMessageand,iftheRequestrequiresaCompletion,thestatuswould
beCA.

Unexpected Completion
When a Requester receives a Completion, it uses the transaction descriptor
(Requester ID and Tag) to match it with an earlier Request. In rare circum
stances, the transaction descriptor may not match any previous Request. This
might happen because the Completion was misrouted on its journey back to
theintendedRequester.AnAdvisoryNonfatalErrorMessagecanbesentby
thedevice thatreceivesthe unexpectedCompletion, butits expected that the
correct Requester will eventually timeout and take the appropriate action, so
thaterrorMessagewouldbealowpriority.

664
PCIe 3.0.book Page 665 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Completion Timeout
ForthecaseofapendingRequestthatneverreceivestheCompletionitsexpect
ing,thespecdefinesaCompletiontimeoutmechanism.Thespecclearlyintends
this to detect when a Completion has no reasonable chance of returning; it
shouldbelongerthananynormalexpectedlatencies.
TheCompletiontimeouttimermustbeimplementedbyalldevicesthatinitiate
RequeststhatexpectCompletions,exceptfordevicesthatonlyinitiateconfigu
ration transactions. Note also that every Request waiting for Completions is
timed independently, and so there must be a way to track time for each out
standingtransaction.The1.xand2.0versionsofthespecdefinedthepermissi
blerangeofthetimeoutvalueasfollows:
Itisstronglyrecommendedthatadevicenottimeoutearlierthan10msafter
sendingaRequest;however,ifthedevicerequiresgreatergranularityatim
eoutcanoccurasearlyas50s.
Devicesmusttimeoutnolaterthan50ms.

Beginningwiththe2.1specrevision,theDeviceControlRegister2wasadded
tothePCIExpressCapabilityBlocktoallowsoftwarevisibilityandcontrolof
thetimeoutvalues,asshowninFigure158onpage665.

Figure158:DeviceControlRegister2

0000b = 50s - 50ms


0001b = 50s - 100s
A 0010b = 1ms - 10ms
0101b = 16m s - 55ms
B
0110b = 65m s - 210ms
1001b = 260m s - 900ms
C
1010b = 1s - 3.5s
1100b = 4s - 13s
D
1110b = 17s - 64s

High-order bits
select range

If Requests need multiple Completions to return the requested data, a single


Completionwontstopthetimer.Instead,thetimercontinuestorununtilallthe
data has been returned regardless of how many Completions are needed. If
onlypartofthedatahasbeenreturnedwhenthetimeoutoccurs,theRequester
maydiscardorkeepthatdata.

665
PCIe 3.0.book Page 666 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Link Flow Control Related Errors


Prior to forwarding the packet to the Data Link Layer for transmission, the
Transaction Layer must check Flow Control (FC) credits to ensure that the
receivebuffersoftheLinkneighborhavesufficientroomtoholdit.FlowCon
trolviolationsmayoccur,andtheyareconsidereduncorrectable.Protocolviola
tions related to Flow Control can detected by and associated with the port
receivingtheFlowControlinformation.Someexamplesaregivenhere:

LinkpartnerfailstoadvertiseatleasttheminimumnumberofFCcredits
definedbythespecduringFCinitializationforanyVirtualChannel.
Link partner advertises more than the allowed maximum number of FC
credits(upto2047unusedcreditsfordatapayloadand127unusedcredits
forheaders).
ReceiptofFCupdatescontainingnonzerovaluesincreditfieldsthatwere
initiallyadvertisedasinfinite.
Areceivebufferoverflow,resultinginlostdata.Thischeckisoptionalbuta
detectedviolationisconsideredtobeaFatalerror.

Malformed TLP
TLPsarrivingintheTransactionLayerarecheckedforviolationsofthepacket
formatting rules. A violation in the packet format is considered a Fatal error
becauseitmeansthetransmitterhasmadeagrievousmistakeinprotocol,such
asfailingtoproperlymaintainitscounters,andtheresultisthatitsnolonger
performing as expected. Some examples of a packet being considered mal
formed(badlyformed)includethefollowing:

DatapayloadexceedsMaxpayloadsize.
Datalengthdoesnotmatchlengthspecifiedintheheader.
Memorystartaddressandlengthcombinetocauseatransactiontocrossa
naturallyaligned4KBboundary.
TLPDigest(TDfield)indicationdoesntcorrespondwithpacketsize(ECRC
isunexpectedlymissingorpresent).
ByteEnableviolation.
UndefinedTypefieldvalues.
CompletionthatviolatestheReadCompletionBoundary(RCB)value.
CompletionwithstatusofConfigurationRequestRetryStatusinresponse
toaRequestotherthanaconfigurationRequest.
TrafficClassfieldcontainsavaluenotassignedtoanenabledVirtualChan
nel(thisisalsoknownasTCFiltering).

666
PCIe 3.0.book Page 667 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

I/O and Configuration Request violations (checking optional) examples:


TCfield,Attr[1:0],andtheATfieldmustallbezero,whiletheLengthfield
musthaveavalueofone.
Interruptemulationmessagessentdownstream(checkingoptional).
TLPreceivedwithaTLPPrefixerror:
TLPPrefixbutnoTLPHeader
EndtoEndTLPPrefixesprecedingLocalPrefixes
LocalTLPPrefixtypenotsupported
Morethan4EndtoEndTLPPrefixes
MoreEndtoEndTLPPrefixesthanaresupported
TransactiontyperequiringuseofTC0hasadifferentTCvalue:
I/OReadorWriteRequestsandcorrespondingCompletions
ConfigurationReadorWriteRequestsandcorrespondingCompletions
ErrorMessages
INTxmessages
PowerManagementmessages
Unlockmessages
SlotPowermessages
LTRmessages
OBFFmessages
AtomicOpoperanddoesntmatchanarchitectedvalue.
AtomicOpaddressisntnaturallyalignedwithoperandsize.
Routingisincorrectfortransactiontype(e.g.,transactionsrequiringrouting
toRootComplexdetectedmovingawayfromRootComplex).

Internal Errors
The Problem
ThefirstversionsofthePCIespecdidnotincludeamechanismforreporting
errorswithinadevicethatwereunrelatedtotransactionsontheinterfaceitself.
ForEndpointsthiswasntreallyaproblembecausetheyhaveavendorspecific
device driver associated with them that can detect and report internal errors.
However, Switches are considered system resources that are managed by the
OS,andtypicallydonthavesoftwaretohelpwithinternalerrordetection.In
highendsystems,theabilitytocontainerrorsisimportant,soSwitchvendors
createdproprietarymeansofhandlinginternalerrors.Unfortunately,sincedif
ferentvendorsolutionswereincompatiblewitheachother,theendresultwas
thattheywereseldomused.

667
PCIe 3.0.book Page 668 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

The Solution
To alleviate this situation, a standardized internal error reporting option was
addedwiththe2.1specversion.Thedefinitionofwhatconstitutesaninternal
error is beyond the scope of the spec, but they can be reported as either Cor
rectedorUncorrectableInternalErrors.

ACorrectedInternalErrormeansanerrorwasmasked orworkedaroundby
the hardware with no loss of information or improper behavior. An example
wouldbeanECCerroronaninternalmemorylocationthatwascorrectedauto
matically.Ontheotherhand,anUncorrectableInternalErrormeansimproper
operation has resulted with potential data loss, such as a parity error on an
internalmemorylocation.Reportinginternalerrorsisoptionaland,ifitisused,
theAER(AdvancedErrorReporting)registersmustbepresenttosupportit.

How Errors are Reported

Introduction
PCIExpressincludesthreemethodsof reportingerrors,asshownbelow.The
firsttwo,Completionsandpoisonedpackets,werecoveredearlier,soournext
topicwillbetheerrorMessages.

CompletionsCompletionStatusreportserrorsbacktotheRequester
PoisonedPacketreportsbaddatainaTLPtothereceiver
ErrorMessagereportserrorstothehost(software)

Error Messages
PCIe eliminatedthe sideband signalsfromPCIand replacedthemwithError
Messages. These Messages provide information that could not be conveyed
withthePERR#andSERR#signals,suchasidentifyingthedetectingFunction
andindicatingtheseverityoftheerror.Figure159illustratestheErrorMessage
format. Note that theyre routed to the Root Complex for handling. The Mes
sage Code defines the type of Message being signaled. Not surprisingly, the
specdefinesthreetypesoferrorMessages,asshowninTable152.

668
PCIe 3.0.book Page 669 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Table152:ErrorMessageCodesandDescription

Message
Name Description
Code

30h ERR_COR Devicedetectedacorrectableerror.Thisisautomati


callycorrectedbyhardwareanddoesntrequiresoft
wareattention.However,itcanbehelpfultoreport
themanywaysosoftwarecanwatchfortrendslike
anincreasingnumberofcorrectableerrors.

31h ERR_NONFATAL IndicatesanuncorrectableNonFatalerror.Nohard


warecorrectionmechanismwasavailablebutthe
Linkisstillworkingreliably.Softwareattentionwill
berequiredtoresolvetheproblem.

33h ERR_FATAL IndicatesanuncorrectableFatalerror.Nohardware


correctionmechanismwasavailableandLinkopera
tionhasfailedinsomeimportantrespect.Software
attentionwillberequiredandaresetofatleastone
devicewillprobablyberequiredtoresolvethisissue.

Figure159:ErrorMessageFormat

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
001 10000 tr H D P
Byte 4 Requester ID Tag Message Code
(30h, 31h or 33h)

Byte 8 Reserved for Error Messages

Byte 12 Reserved for Error Messages

Route to Root Complex 30h = ERR_COR


31h = ERR_NONFATAL
33h = ERR_FATAL

669
PCIe 3.0.book Page 670 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Advisory Non-Fatal Errors


SincewevejustseenthatbothtypesofUncorrectableerrorswillneedsoftware
attention,itsoundscounterintuitivetosaythattherearecaseswhereitsprefer
able that a device not report NonFatal errors it detects, but there are. These
cases are predominantly based on the role of the detecting agent (Requester,
Completer, orIntermediatedevice) andthe typeoferror. The problem isthat
multipledevicesmightreportanerrorcausedbythesameeventand,onsome
platforms, sending one of the NonFatal Error Messages (ERR_NONFATAL)
canpreventsoftwarefromproperlyhandlingtheerror.Forexample,ifanEnd
point reports an error, its device driver will be called to service the situation.
However,ifaSwitchreportsanerrorfirstforthesametransaction,systemsoft
waremightbecalledtoinvestigateandmightnotunderstandwhatthedriver
wastryingtoaccomplishorwhatwouldbetheoptimalresponse.
Thatexampleillustratesthatsomedetectingagentsarentthebestonestodeter
minetheultimatedispositionoftheerrorandshouldntsendanuncorrectable
message.Instead,suchanagentcansignalanadvisorynotificationtosoftware
with ERR_COR. This avoids confusion about the source of the uncorrectable
error but still gives software a little more information about what happened.
Eventually, the appropriate detecting agent will send the ERR_NONFATAL
messagewheneveritseestheerror.Beginningwiththe1.1specrevision,anew
fieldwasaddedinthePCIExpressDeviceCapabilitiesregistertoindicatesup
portforthiscapabilityasshowninFigure1510onpage670.Thisbitmustbe
setforeveryagentthatiscompliantwiththe1.1specorlater.
Figure1510:DeviceCapabilitiesRegister

Device Capabilities Register


31 28 27 26 25 18 17 16 15 14 12 11 9 8 6 5 4 3 2 0

RsvdP RsvdP

Function-Level Reset Capability


Captured Slot Power Limit Scale

Captured Slot Power Limit Value

Role-Based Error Reporting

Undefined
Endpoint L1 Acceptable Latency
Endpoint L0s Acceptable Latency
Extended Tag Field Supported
Phantom Functions Supported
Max Payload Size Supported

670
PCIe 3.0.book Page 671 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Inspiteofthereasonsjustdescribed,softwaremightwanttostopoperationas
soonassomeadvisoryerrorsareseenbyanintermediatedevice.Sincenewer
deviceswillalwaysperformrolebasederrorreporting,anoverridemechanism
isneeded.Tohandlethiscase,softwarecanescalatetheseverityoftheadvisory
errorsfromNonFataltoFatalintheAER(AdvancedErrorReporting)registers.
Sincethereisnoadvisoryfatalcase,theerrorwillnowbereportedasaFatal
Error(ERR_FATAL),ifenabled,regardlessoftheroleofthedevice.

Advisory Non-Fatal Cases


Thespeclistsfivesituationsforwhichanadvisorymessage(ERR_COR)ispre
ferred overaERR_NONFATAL message. In each of these cases,the detecting
agentwillhandletheerrorasanAdvisoryNonFatalError.Thismeansthata
NonFatal condition will be handled by sending an ERR_COR, assuming the
agenthasAERregistersandhasenabledERR_COR.IfitdoesnthaveAERreg
istersorERR_CORwasnotenabled,itsendsnoErrorMessage.Thefivecases
areasfollows:

1. CompletersentaCompletionwithURorCAStatus.Theexpectationinthis
caseisthattheRequesterwillhaveamechanismtohandletheerrorwhenit
seestheoffendingCompletionandwillbethebestagenttosendwhatever
ErrorMessagesareneeded.AERR_NONFATALmessagefromtheCompl
eterwouldjustbeconfusing,soitmustbehandledasAdvisoryNonFatal
(ERR_COR).
Curiously, there is no PCIe mechanism for the Requester to report that it
received a Completion with this status. Instead, a designspecific method
likeaninterruptwillbeneededtogetdevicedriverattention.Animportant
example of this happens when the Root Complex receives a Completion
with UR or CA status in response to a Configuration Read Request. On
someplatformstheresponseistoreturnall1stosoftwareforthiscase,to
support backward compatibility with PCI enumeration (configuration
probing)software.
2. Intermediatedevicedetectedanerror.Thiscasecomesupinsystemsthat
employSwitchesbecauseadetectingagentmaynotbethefinaldestination
foraTLP.Asanexampleofthis,considerFigure1511onpage672,show
ingapoisonedpacketdeliveredthroughanintermediateSwitch.TheTLP
is seen as a NonFatal error by the Switch but it can only signal an
ERR_CORmessageinstead(aslongasitsenabledtodoso).
Toexplorethisconceptalittlemore,whywouldntwewanttheSwitchto
reportERR_NONFATAL?Onereasonisseenbylookingaterrortrackingin
theAERregisters.Figure1512onpage672showstheAERregistersthat
tracktheSourceID(BDFofthesendingdevice)ofErrorMessagescoming
into a Root Port and we can see that theres only one space available for

671
PCIe 3.0.book Page 672 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

uncorrectableerrors.Ifmultipleuncorrectableerrorsareseen,thatfactwill
benotedbutonlythefirstsourceIDwillbesavedsinceitisconsideredto
be the probable cause of subsequent errors. Its important, therefore, that
uncorrectableerrorscomefromthemostappropriatedevicetoreportthem.
Its worth noting that its still helpful for intermediate devices to report
ERR_COR, because it allows software to determine where the error was
firstdetected.

Figure1511:RoleBasedErrorReportingExample

CPU

Root Complex

Poisoned
ERR_COR Packet

PCIe
PCIe Switch Endpoint
Endpoint

PCIe Legacy
Endpoint Endpoint


Figure1512:AdvancedSourceIDRegister

Error Source Identification Register


of the AER Capability Structure
31 0
ERR_FATAL/NONFATAL Source ID ERR_COR Source ID
(ROS) (ROS)
ROS: Read-Only and Sticky

672
PCIe 3.0.book Page 673 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

As another example, 1.0a devices that have the UR Reporting Enable bit
clearedbutdonthavetheRoleBasedErrorReportingcapabilityareunable
to report any error Messages when a UR error is detected (for posted or
nonpostedRequests).Incontrast,a1.1compliantorlater Completerthat
hastheSERR#EnablebitsetwillsendanERR_NONFATALorERR_FATAL
messageforbadpostedRequests,eveniftheUnsupportedRequestReport
ingEnablebitisclear,soastoavoidsilentdatacorruption.Butitwontsend
an error Message for nonposted Requests received, so as to support the
PCIcompatibleconfigurationmethodofprobingwithconfigurationreads.
Its recommended that software keep the UR Error Reporting Enable bit
clearfordevicesthatarenotcapableofRoleBasedErrorReporting,butset
it for those that are. That way, UR errors are reported on bad posted
requests, but not for bad nonposted requests like configuration probing
transactions, and backward compatibility with older software is main
tained.
ThespecalsomentionsthatpoisonedTLPssenttotheRootwillbehandled
inthesamewayiftheRootisactingasanintermediateagent,butthereis
one exception: If the Root doesnt support Error Forwarding, it will be
unable to communicate the poisoned error with the TLP and must report
thisasaNonFatalerrorinstead.
3. Destination device received a poisoned TLP. Normally, Endpoints would
reporttheNonFatalerrorinthiscase,buttheresanexceptiontothisrule:
If theultimatedestinationdeviceis abletohandlethe poisoneddataina
waythatallowsforcontinuedoperation,itmusttreatthiscaseasanAdvi
soryNonFatalErrorinstead.
Anexampleofthisbehaviormightbeanaudiodevicethatreceivesstream
ingdatathathasbeenpoisoned.Inthissituation,thedatamaybeaccepted
even though its known to be corrupted because pausing the audio flow
longenoughtogetsoftwareattentionandtakeremedialactionwouldbea
worsealternativethanallowingaglitchinthesoundoutput.
4. RequesterexperiencedaCompletionTimeout.Thisisasimilarcasetothe
previousone;iftheRequesterhasameansofcontinuingoperationinspite
of the problem then it must treat this as an Advisory NonFatal Error. A
simpleworkaroundfortheRequesterinthiscasewouldsimplybetosend
therequestagainandhopeforbetterresultsthistime.Clearly,thiswould
onlymakesenseifthepreviousrequestdidnotcauseanysideeffects,but
Requestersarepermittedtodothisasoftenastheylike(althoughthespec
saysthenumberofretriesmustbefinite).
5. Unexpected completion received. This must be handled as an Advisory
NonFatalError.Thereasonisthatitwasprobablycausedbyamisrouted
CompletionandtheoriginalRequesterwilleventuallyreportaCompletion
timeout. To allow that other Requester to attempt a retry of the failed

673
PCIe 3.0.book Page 674 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

request,itsimportantthattheonethatseestheUnexpectedCompletionnot
sendanNonFatalmessage.

Baseline Error Detection and Handling


This section defines the required support for detecting and reporting PCI
Expresserrors.Compliantdevicesmustinclude:

PCICompatiblesupportrequiredtohonorPCIcompatibleerrorcontrol
andstatusfieldsforoldersoftwarethathasnoawarenessofPCIExpress.
PCI Express Error reporting uses standard PCIe structures to for error
control and status which can be used by newer software that does have
knowledgeofPCIExpress.

PCI-Compatible Error Reporting Mechanisms


General
PCIExpresserrorsaremappedintotheoriginalPCIconfigurationregisterbits
forbackwardcompatibility,allowingerrorstatusandcontroltobeaccessibleto
PCIcompliant software. To understand the features available from the PCI
compatiblepointofview,considertheerrorrelatedbitsoftheCommandand
StatusregisterslocatedwithintheConfigurationheader.Someofthefielddefi
nitions have been modified to reflect the related PCIe error conditions and
reportingmechanisms.ThePCIExpresserrorstrackedbythePCIcompatible
registersare:

TransactionPoisoning/ErrorForwarding(synonymoustodataparityerror
inPCI)
Completer Abort (CA) detected by a Completer (synonymous to Target
AbortinPCI)
UnsupportedRequest(UR)detectedbyaCompleter(synonymoustoMas
terAbortinPCI)

Asmentionedearlier,thePCImechanismforreportingerrorsistheassertionof
PERR#(dataparityerrors)andSERR#(unrecoverableerrors).ThePCIExpress
mechanisms for reporting these events are the Completion Status values in
CompletionsandErrorMessagestotheRoot.

674
PCIe 3.0.book Page 675 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Legacy Command and Status Registers


Figure1513onpage675illustratestheCommandregisterandthelocationof
the errorrelated fields. These bits are set to enable baseline error reporting
undercontrolofPCIcompatiblesoftware.Table153definesthespecificeffects
ofeachbit.

Figure1513:CommandRegisterinConfigurationHeader

15 11 10 9 8 7 6 5 4 3 2 1 0

Reserved 0 0 0 0 0

Interrupt Disable

Fast Back-to-back Enable*

SERR# Enable
Stepping Control*
Parity Error Response
VGA Palette Snoop Enable*

Mem Write & Invalidate Enable*


Special Cycles*
Bus Master Enable
Memory Space Enable
IO Space Enable
* Not used in PCIe, these must be set to zero

Table153:ErrorRelatedFieldsinCommandRegister

Name Description

SERR#Enable SettingthisbitenablessendingERR_FATALandERR_NONFATAL
errormessagestotheRootComplex.Theseareconsideredroughly
analogoustoassertingtheSystemError(SERR#)signalinPCI.
ForType1headers(bridges),thisbitcontrolstheforwardingof
ERR_FATALandERR_NONFATALerrormessagesfromthesec
ondaryinterfacetotheprimaryinterface.
ThisfieldhasnoaffectoverERR_CORmessages.

675
PCIe 3.0.book Page 676 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table153:ErrorRelatedFieldsinCommandRegister(Continued)

Name Description

ParityError SettingthisbitenablesloggingofpoisonedTLPsintheMasterData
Response ParityErrorbitintheStatusregister.
Poisonedpacketsindicatebaddataandareroughlyanalogoustoa
PCIparityerror.

Figure 1514 on page 676 illustrates the Configuration Status register and the
locationoftheerrorrelatedbitfields.Table 154onpage 677definesthecircum
stances under which each bit is set and the actions taken by the device when
errorreportingisenabled.

Figure1514:StatusRegisterinConfigurationHeader

15 14 13 12 11 10 9 8 7 6 5 4 3 2 0

0 0 0 R 0 1 Reserved

Interrupt Status
Capabilities List**
66 MHz Capable*
Reserved
Fast Back-to-back Capable*
Master Data Parity Error
DEVSEL Timing*
Signalled Target Abort
Received Target Abort
Received Master Abort
Signalled System Error
Detected Parity Error
* Not used in PCIe, these must be set to zero
** Must be set to one because some capability registers are required

676
PCIe 3.0.book Page 677 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Table154:ErrorRelatedFieldsinStatusRegister

ErrorRelatedBit Description

DetectedParityError SetbytheportthatreceivesapoisonedTLP.Thisstatus
bitisupdatedregardlessofthestateoftheParityError
Responsebit.

SignalledSystemError SetbyaportthathasreportedanUncorrectableError
withERR_FATALorERR_NONFATALandtheSERR#
enablebitintheCommandregisterwasset.

ReceivedMasterAbort SetbyaRequesterthatreceivesaCompletionwithsta
tusofUR(UnsupportedRequest).Thisisconsidered
analogoustoaPCImasterabortbecausethetargetdid
notclaimthetransaction.

ReceivedTargetAbort SetbyaRequesterthatreceivesaCompletionwithsta
tusofCA(CompleterAbort).ThisisanalogoustoaPCI
targetabortinthatthetargethashadaprogramming
violationorinternalerrorcondition.

SignaledTargetAbort SetbytheCompleterthathandledarequest(either
postedornonposted)asaCompleterAbort.Ifitwasa
nonpostedrequest,thenaCompletionwithaComple
tionStatusofCAissent.

MasterDataParityError ForType0headers(e.g.,Endpoints),thisbitissetifthe
ParityErrorResponsebitintheCommandregisteris
setANDiteitherinitiatesapoisonedrequestOR
receivesapoisonedcompletion.
ForType1headers(e.g.,SwitchesandRootPorts),this
bitissetiftheParityErrorResponsebitintheCom
mandregisterissetANDiteitherinitiatesapoisoned
requestheadingupstreamORreceivesapoisonedcom
pletionheadingdownstream.

Baseline Error Handling


TheBaselinecapabilityrequirestheuseofthePCIExpressCapabilitystructure.
These registers include error detection and handling fields that provide finer
granularityregardingthenatureofanerrorandwhethertoreportitornotthan
whatispossiblewithjustPCIcompatibleerrorhandling.

677
PCIe 3.0.book Page 678 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1515onpage678illustratesthePCIExpressCapabilitystructure.Some
oftheseregistersprovidesupportfor:

Enabling/disablingerrorreporting(ErrorMessageGeneration)
Providingerrorstatus
Providinglinktrainingstatusandinitiatinglinkretraining

Figure1515:PCIExpressCapabilityStructure

{{ {{ {
31 15 7 0
Next Cap PCI Express
PCI Express Capabilities Register Pointer Cap ID DW0
All Ports
Devices with Links

Device Capabilities Register DW1


Ports with Slots

Device Status Device Control DW2


Link Capabilities DW3
Root Ports

Event Collector

DW4
Root Complex

Link Status Link Control

Slot Capabilities DW5


Slot Status Slot Control DW6

{
Root Capability Root Control DW7

{
Root Status DW8

{ DW9
All Ports

Device Capabilities 2
Devices with Links

Device Status 2 Device Control 2 DW10


Ports with Slots

Link Capabilities 2 DW11


Link Status 2 Link Control 2 DW12
Slot Capabilities 2 DW13
Slot Status 2 Slot Control 2 DW14

Gen2 and later devices only

Enabling/Disabling Error Reporting


TheDeviceControlregistersallowsoftwaretoenablegenerationofthreediffer
entErrorMessagesforfourerrorevents,andDeviceStatusregistersallowitto
seewhicherrorhasbeendetected.Thefourerrorcasesare:

678
PCIe 3.0.book Page 679 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

CorrectableErrors
NonFatalErrors
FatalErrors
UnsupportedRequestErrors

Note that the only specific error identified here is the Unsupported Request.
AlthoughanUnsupportedRequestistechnicallyasubsetofNonFatalerrors,
and,whenreported,isevensignaledwithanERR_NONFATALmessage,ithas
its own enable and status bits. Thats because during system enumeration
Unsupported Requests are going to happen (whenever an attempt it made to
readconfigspacefromaFunctionthatdoesntactuallyexistinthesystem)but
theymustnotbereportedaserrors.Theenumerationsoftwaremayhavevery
limitederrorhandlingcapabilityandifitwasrequiredtostopandservicean
erroritmightfail.Therefore,thesoftwaredoesntwanterrormessagesgener
atedfortheURcaseduringthattime,butdoeswanttoknowaboutanyother
NonFatalerrorsthatmaybedetected.(SeethesectiontitledDiscoveringthe
Presence or Absence of a Function on page 105 for more details on Unsup
portedRequestsduringenumeration.)

Table 155 on page 679 lists each error type and its associated error classifica
tion.

Table155:DefaultClassificationofErrors

Classification&Severity NameofError LayerDetected

Correctable ReceiverError Physical

Correctable BadTLP Link

Correctable BadDLLP Link

Correctable ReplayNumberRollover Link

Correctable ReplayTimerTimeout Link

Correctable AdvisoryNonFatalError Transaction

Correctable CorrectedInternalError

Correctable HeaderLogOverflow Transaction

UncorrectableNonFatal PoisonedTLPReceived Transaction

UncorrectableNonFatal ECRCCheckFailed Transaction

679
PCIe 3.0.book Page 680 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table155:DefaultClassificationofErrors(Continued)

Classification&Severity NameofError LayerDetected

UncorrectableNonFatal UnsupportedRequest Transaction

UncorrectableNonFatal CompletionTimeout Transaction

UncorrectableNonFatal CompleterAbort Transaction

UncorrectableNonFatal UnexpectedCompletion Transaction

UncorrectableNonFatal ACSViolation Transaction

UncorrectableNonFatal MCBlockedTLP Transaction

UncorrectableNonFatal AtomicOpsEgressBlocked Transaction

UncorrectableNonFatal TLPPrefixBlocked Transaction

UncorrectableFatal UncorrectableInternalError
(optional)

UncorrectableFatal SurpriseDown(optional) Link

UncorrectableFatal ReceiverOverflow(optional) Transaction

UncorrectableFatal DLLProtocolError Link

UncorrectableFatal ReceiverOverflow Transaction

UncorrectableFatal FlowControlProtocolError Transaction

UncorrectableFatal MalformedTLP Transaction

DeviceControlRegister. Setting bits in the Device Control Register,


shown in Figure 1516 on page 681, enables sending the corresponding
ErrorMessagestoreporterrors.UnsupportedRequesterrorsarespecified
as NonFatal errors and are reported via a NonFatal Error Message, but
onlywhentheURReportingEnablebitisset.

InorderforaFunctiontoactuallysendanerrormessage,eitherthecorre
sponding enable bit in the Device Control register needs to be set, or for
FatalandNonFatalerrors,theSERR#Enableshouldbeset.ForUncorrect
ableErrors,ifeithertheSERR#EnablebitintheCommandRegisterisset
OR the corresponding enable bit in the Device Control register is set, the
appropriateerrormessagewillbesent(ERR_FATALorERR_NONFATAL).

680
PCIe 3.0.book Page 681 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

ForCorrectableErrors,aFunctionwillonlysendtheERR_CORmessageif
theCorrectableErrorReportingEnablebitintheDeviceControlregisterisset.
ThereisnocontroltoenableERR_CORmessagesfromthePCICompatible
mechanisms, which makes sense because in PCI, there was no concept of
correctableerrors.

Figure1516:DeviceControlRegisterFieldsRelatedtoErrorHandling

15 14 12 11 10 9 8 7 5 4 3 2 1 0

Bridge Config. Retry Enable/


Initiate Function-Level Reset
Max Read Request Size

Enable No Snoop

Aux Power PM Enable

Phantom Functions Enable

Extended Tag Field Enable


Max Payload Size
Enable Relaxed Ordering
Unsupported Request
Reporting Enable
Fatal Error Reporting Enable
Non-Fatal Error
Reporting Enable
Correctable Error
Reporting Enable

DeviceStatusRegister.AnerrorstatusbitissetintheDeviceStatusreg
ister,showninFigure1517onpage682,anytimeanerrorassociatedwith
itsclassificationisdetected,regardlessofthesettingoftheerrorreporting
enable bits in the Device Control Register. Because Unsupported Request
errors are considered NonFatal Errors, when these errors occur both the
NonFatalErrorDetectedstatusbitandtheUnsupportedRequestDetectedsta
tusbitwillbeset.Likeseveralotherstatusbits,theseareSticky(theirval
ues are not cleared by a reset event so theyll be available for diagnosing
problemsevenifaresetwasneededtogettheLinkworkingwellenoughto
readthestatus).

681
PCIe 3.0.book Page 682 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1517:DeviceStatusRegisterBitFieldsRelatedtoErrorHandling

15 6 5 4 3 2 1 0

RsvdZ
Transactions Pending
Aux Power Detected
Unsupported Request Detected
Fatal Error Detected
Non-Fatal Error Detected
Correctable Error Detected

Roots Response to Error Message


WhenanErrorMessageisreceivedbytheRoot,theactionittakesisdetermined
inpartbythesettingsintheRootControlRegister.Figure1518depictsthisreg
isterandhighlightsthethreefieldsthatspecifywhetherareceivedErrorMes
sageshouldbereportedasSystemError.Insomex86basedsystems,itslikely
thatanNMI(NonMaskableInterrupt)willbesignalediftheerrorisenabledto
triggeraSystemError.

Other options for reporting Error Messages are not configurable via standard
registers. The most likely scenario is that an interrupt will be signaled to the
processorthatwillcallanErrorHandler,whichmaylogtheerrorandattempt
tocleartheproblem.

682
PCIe 3.0.book Page 683 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Figure1518:RootControlRegister

15 5 4 3 2 1 0

RsvdP

CRS Software Visibility Enable


PME Interrupt Enable

System Error on Fatal Error Enable

System Error on Non-Fatal Error Enable

System Error on Correctable Error Enable

Link Errors
LinkfailuresaretypicallydetectedinthePhysicalLayerandcommunicatedto
theDataLinkLayer.Foradownstreamdevice,ifthelinkhasincurredaFatal
errorandisnotoperatingcorrectly,itcantreporttheerrortothehost.Forthese
cases,theerrormustbereportedbytheupstreamdevice.Ifsoftwarecanisolate
errorstoagivenlink,onestepinhandlinganuncorrectableerror(ortoprevent
future uncorrectable errors) is to retrain the Link. The Link Control Register
includesabitthatallowssoftwaretoforcetheLinktoretrain,asshowninFig
ure1519onpage684.Ifthatsolvestheproblem,operationresumeswithlittle
downtime.

683
PCIe 3.0.book Page 684 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1519:LinkControlRegisterForceLinkRetraining

15 12 11 10 9 8 7 6 5 4 3 2 1 0

RsvdP

Link Autonomous Bandwidth


Interrupt Enable

Link Bandwidth Management


Interrupt Enable
Hardware Autonomous
Width Disable

Enable Clock
Power Management
Extended Synch
Common Clock
Configuration
Retrain Link

Link Disable
Read Completion
Boundary Control

RsvdP
Active State
PM Control

Havingoncerequestedretraining,softwarecanpolltheLinkTrainingbitinthe
Link Status Register to see when training has completed. Figure 1520 high
lightsthisstatusbits.Whenthisbitis1b,theLinkisstillintheretrainingpro
cess(orhasyettostartretraining).HardwarewillclearthisbitoncethePhysical
Layer reports the Link as active meaning the training process has completed
successfully.

684
PCIe 3.0.book Page 685 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Figure1520:LinkTrainingStatusintheLinkStatusRegister

15 14 13 12 11 10 9 4 3 0

Link Autonomous
Bandwidth Status
Link Bandwidth
Management Status
Data Link Layer
Link Active
Slot Clock
Configuration
Link Training
Undefined
Negotiated
Link Width
Current Link Speed

Advanced Error Reporting (AER)


TheAdvancedErrorReportingStructureillustratedinFigure1521onpage686
allows for much more sophisticated error handling. These registers provide
severaladditionalfeatures:
Bettergranularityinloggingtheactualtypeoferrorthatoccurred
Controltospecifytheseverityofeachuncorrectableerrortype
Supportforloggingtheheaderofpacketsthathaderrors
StandardizingcontrolfortheRoottoreportreceivedErrorMessageswith
aninterrupt
IdentifyingthesourceoftheerrorinthePCIetopology
Abilitytomaskreportingindividualtypesoferrors

685
PCIe 3.0.book Page 686 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1521:AdvancedErrorCapabilityStructure

PCIe Extended Capability Register 00h


Uncorrectable Error Status Register 04h
Uncorrectable Error Mask Register 08h
Uncorrectable Error Severity Register 0Ch
Correctable Error Status Register 10h
Correctable Error Mask Register 14h
Advanced Error Capability and Control Register 18h
1Ch

Header Log Register

Root Error Command 2Ch


Root Ports &
Root Complex Root Error Status 30h
Event Collectors
Uncorr. Error Source ID Corr. Error Source ID 34h
38h

Functions
that support TLP Prefix Log Register
TLP Prefixes

Advanced Error Capability and Control


LetsbeginourdiscussionofAERbylookingattheAdvancedErrorCapability
and Control register. EndtoEnd CRC (ECRC) generation and checking
requires AER, and this register, shown in Figure 1522 on page 687, reports

686
PCIe 3.0.book Page 687 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

whether this device supports it. If so, configuration software can enable (and
force)itsusebysettingtheappropriatebits.

The five loworder bits of this register contain the First Error Pointer, set by
hardware when the Uncorrectable Error status bits are updated. There are 32
statusbitsandtheFirstErrorPointerindicateswhichoftheunmasked,Uncor
rectableErrorswasdetectedfirst,meaningwhichstatusbitwassetwhenallthe
otherstatusbitswerestill0.Thefirsterroristhemostinterestingbecausethe
othersmayhavebeencausedbythefirstone.

Figure1522:TheAdvancedErrorCapabilityandControlRegister

31 12 11 10 9 8 7 6 5 4 0
First Error
RsvdP Pointer (ROS)

TLP Prefix Log Present (ROS)


Multiple Header Recording Enable (RWS)
Multiple Header Recording Capable (RO)
ECRC Check Enable (RWS)
ECRC Check Capable (RO)
ECRC Generation Enable (RWS)
ECRC Generation Capable (RO)

Beginning with the 2.1 spec revision, this capability was enhanced to allow
trackingmultipleerrors.Forthatreason,ifmultipleerrorstatusbitshavebeen
set and cleared, the meaning really becomes more like an Oldest Error
Pointerinstead.Thepointerisupdatedbyhardwarewhenthecorresponding
statusbitisclearedbysoftware,atwhichtimeitpointstowhichevererrorwas
detectednext(seeFigure1525onpage691forthelistofuncorrectableerrors).
Interestingly,thenexterrormaybethesameoneagainifthaterrorhadbeen
detectedmultipletimes,withtheresultthattheupdatedpointerstillindicates
thesamevalue.

Since multiple errors can be recorded in the Uncorrectable Status register, it


would be very helpful to store multiple headers, too. Hardware must be
designedtologatleastoneheader,butisallowedtosupportmore.Ifitdoes,
theMultipleHeaderRecordingCapablebitwillbesetandtheMultipleHeader
RecordingEnablebit canbeusedtoenablestoringmorethanone.Whenever
the First Error Pointer indicates a status bit position that is not set or is not
implemented,itmeanstherearenomoreuncorrectableerrorstoservice.

687
PCIe 3.0.book Page 688 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Thelastbitinthisregister,TLPPrefixLogPresent,indicateswhethertheTLP
PrefixLogregisterscontainvalidinformationfortheuncorrectableerrorindi
catedbytheFirstErrorPointer.

ThefieldsinthisregisterandtheotherAERregistershavevariouscharacteris
tics,whichareabbreviatedasfollows:
ROReadOnly,setbyhardware
ROSReadOnlyandSticky(seethenextsectiononstickybits)
RsvdPReservedandPreserved.Thesebitsmustnotbeusedforanypur
pose,butsoftwaremustbecarefultomaintainwhatevervaluestheycon
tain.
RsvdZReservedandZero.Bitsthatmustnotbeusedforanypurposeand
mustalwaysbewrittentozeros.
RWSReadable,WriteableandSticky
RW1CSReadable,Write1toClear,andSticky

Handling Sticky Bits


SeveralAERregisterfieldsemploystickybits,whichmeansthataresetwont
clear their contents. All other register fields are forced to default values on a
reset,butthesearenot.ThisisagoodideabecauseaLinkmayencounterafail
ure that cantbe clearedwithoutareset. Ifthe problem is in the downstream
deviceofthefailedLink,itsregistercontentsareunavailableuntiltheLinkis
workingagain,whichtheresetwillaccomplish.Butiftheregisterswerecleared
bytheresetthentheinformationislost.Tosolvethisproblem,stickybitskeep
error status information available through a reset. Specifically, sticky bits will
surviveanFLR(FunctionLevelReset),aHotReset,andaWarmResetbecause
powerisavailabletokeepthemactive.TheymayevensurviveaColdResetifa
secondary power source like Vaux is available to keep them active when the
mainpowerisshutoff.

Advanced Correctable Error Handling


AdvancedErrorReportingprovidestheabilitytorecordwhichspecificcorrect
ableerrorshavebeendetected.TheseerrorscanbeusedtoinitiateaCorrectable
Error Message to the host system. Although system operation continues nor
mally,reportingcorrectableerrorscanbeusefulbecauseitallowssystemsoft
waretoseewhichcomponentsarehavingtroubleandtopredictwhetherthey
mayfailcompletelyinthefuture.

688
PCIe 3.0.book Page 689 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Advanced Correctable Error Status


CorrectableerrorswillautomaticallysetthecorrespondingbitintheAdvanced
CorrectableErrorStatusregister,showninFigure1523onpage689,regardless
ofwhethertheerrorisreportedwithanErrorMessage.Thesebitsarecleared
bysoftwarewritinga1tothebitposition,hencethedesignationRW1CS.

Figure1523:AdvancedCorrectableErrorStatusRegister

31 16 15 14 13 12 11 9 8 7 6 5 1 0

RsvdZ RsvdZ RsvdZ

Header Log Overflow Status


Corrected Internal Error Status
Advisory Non-Fatal Error Status
Replay Timer Timeout Status
REPLAY_NUM Rollover Status
Bad DLLP Status
Bad TLP Status
Receiver Error Status
Note: all bits designated RW1CS

ReceiverError(optional)PhysicalLayerdetectedanerrorintheincom
ingpacket.ThepacketisdiscardedatthePhysicalLayer,anybufferspace
allocatedtoitisreleased,andtheLinkLayerisinformedthatareceiveerror
occurred.
BadTLPDataLinkLayerdetectedapacketwithabadLCRC,anoutof
sequenceSequenceNumberoranincorrectlynullifiedpacket.Ineachcase,
theLinkLayerdiscardsthepacketandreportsaNakDLLPtothetransmit
ter,triggeringaTLPreplay.
BadDLLPDataLinkLayernoticedanincomingDLLPhada16bitCRC
failure so the packet is dropped. A subsequent DLLP of the same type is
expectedtomakeupfortheinformationitcontained.
REPLAY_NUMRolloverAttheDataLinkLayer,asetofTLPshavebeen
sent without success (no Ack) four times in a row and this counter has
rolledoverbacktozero.Hardwarewillautomaticallyretrainthelinkinan
attempt to clear the failure condition, then start the sequence again by
replayingthecontentsoftheReplayBuffer.

689
PCIe 3.0.book Page 690 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Replay Timer Timeout At the Data Link Layer, transmitted TLPs have
notreceivedanacknowledgement(AckorNak)withinthetimeoutperiod.
Hardware automatically replays all unacknowledged TLPs, meaning all
packetsintheReplayBuffer.
AdvisoryNonFatalErrorDetectionofthesecases(seeAdvisoryNon
Fatal Errors on page 670) is logged in the corresponding Uncorrectable
ErrorStatusregisterandasacorrectableerrorhere.Itmayalsogeneratea
CorrectableErrorMessage,ifenabled.
Corrected Internal Error (optional) An error internal to the device was
detected,butitwascorrectedorworkedaroundwithoutcausingimproper
behavior.
HeaderLogOverflow(optional)Themaximumnumberofheadersthat
canbestoredintheheaderloghasbeenreached.Thenumberisjustoneif
theMultipleHeaderRecordingEnablebitisnotsetintheAdvancedError
CapabilityandControlregister.

Advanced Correctable Error Masking


Correctable Error reporting is controlled collectively by the Correctable Error
EnablebitintheDeviceControlregister,butalsoindividuallybytheCorrect
ableMaskregister,illustratedinFigure1524.Thedefaultstateofthemaskbits
iscleared,meaninganERR_CORmessagecanbedeliveredwhenanycorrect
ableerrorsaredetectediftheyvebeenenabled(meaningtheCorrectableError
Enablebitisset).However,softwaremaychoosetosetbitsinthismaskregister
topreventamessagefrombeingsentwhenthosespecificerrorsaredetected.

Figure1524:AdvancedCorrectableErrorMaskRegister

31 16 15 14 13 12 11 9 8 7 6 5 1 0

RsvdP RsvdP RsvdP

Header Log Overflow Mask


Corrected Internal Error Mask
Advisory Non-Fatal Error Mask
Replay Timer Timeout Mask
REPLAY_NUM Rollover Mask
Bad DLLP Mask
Bad TLP Mask
Receiver Error Mask
Note: all bits designated RWS

690
PCIe 3.0.book Page 691 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Advanced Uncorrectable Error Handling


Foruncorrectableerrors,AERprovidestheabilitytotrackwhichspecificerror
hasoccurred,controlwhetheritshouldbeconsideredFatalorNonFatal,and
choosewhetheritwillresultinanUncorrectableErrorMessagebeingsentto
theRoot.

Advanced Uncorrectable Error Status


When an uncorrectable error occurs, the corresponding bit in this register is
automatically set by hardware (see Figure 1525 on page 691) regardless of
whether the error will bereportedtothe Root. Ifmultipleerrors occur, hard
warewillsetthecorrespondingbitforeacherrorandwillrecordwhichonewas
firstintheFirstErrorPointerfieldoftheAdvancedErrorCapabilityandCon
trolregister.Itmayevenhappenthatmultipleinstancesofthesameerrorare
detectedbeforethefirstonecanbeserviced.Hardwarethatiscompliantwith
the2.1specrevisionorlaterwillbeabletokeeptrackofadesignspecificnum
berofthosecases.

Figure1525:AdvancedUncorrectableErrorStatusRegister

31 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 6 5 4 3 1 0

RsvdZ RsvdZ RsvdZ

TLP Prefix Blocked Error Status


Atomic Op Egress Blocked Status Undefined
MC Blocked TLP Status
Uncorrectable Internal Error Status
ACS Violation Status Data Link
Unsupported Request Error Status Protocol
ECRC Error Status Error Status
Malformed TLP Status Surprise Down
Receiver Overflow Status Error Status
Unexpected Completion Status
Completer Abort Status
Completion Timeout Status
Flow Control Protocol Error Status
Poisoned TLP Status
Note: all bits designated RW1CS

Thefollowinglistdescribeseachoftheregisterbitsfromrighttoleft:
UndefinedPreviously,thisfirstbitrepresentedalinktrainingfailureat
thePhysicalLayer,butthatmeaningwasremovedwiththe1.1revisionof

691
PCIe 3.0.book Page 692 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

thespec.Softwaremustnowignoreanyvalueinthisbitbutmaywriteany
valuetoit.Thisinformationwasnolongerneededbecausebit5,Surprise
DownError,nowincludesthesameinformationinabroadermeaning:the
LinkisnotcommunicatingatthePhysicalLayer.
Data Link Protocol Errors Caused by Data Link Layer protocol errors
including the Ack/Nak retry mechanism. For example, a transmitter
receivesanAckorNakwhosesequencenumberdoesntcorrespondtoan
unacknowledgedTLPortotheACKD_SEQnumber.
Surprise Down If the Physical Layer reports LinkUp = 0b (Link is no
longercommunicating)unexpectedly,thiswillbeseenasanerrorunlessit
wasanallowedexception.Forexample,iftheLinkDisablebithasalready
beenset,thenitsexpectedthatLinkUpwillbeclearedandthiscondition
wontbeanerror.ThisbitisonlyvalidforDownstreamPorts,whichmakes
sensebecauseitwontbepossibletoreadstatusfromanUpstreamPortif
theLinkisntworking.
PoisonedTLPTLPwasseenthathadtheEPbitset.
FlowControlProtocolError(optional)Errorsassociatedwithfailuresof
the Flow Control mechanism. Example: receiver reports more than 2047
datacredits.
CompletionTimeoutACompletionisnotreceivedwithintherequired
amountoftimeafteranonpostedrequestwassent.
Completer Abort (optional) Completer cannot fulfill a Request due to
problemswiththeRequestorfailureoftheCompleter.
Unexpected Completion Requester receives a Completion that doesnt
matchanyRequeststhatareawaitingaCompletion.
ReceiverOverflow(optional)MoreTLPshavearrivedthantheReceive
Bufferhadroomtoaccept,resultinginanoverflowerror.
MalformedTLPCausedbyerrorsassociatedwithareceivedTLPheader
(seeMalformedTLPonpage 666).
ECRCError(optional)CausedbyanECRCcheckfailureattheReceiver.
Unsupported Request Error Completer does not support the Request.
Requestiscorrectlyformedandhadnoothererrors,butcannotbefulfilled
bytheCompleter,perhapsbecauseitsaninvalidcommandforthisdevice.
ACSViolationAccesscontrolerrorwasseeninareceivedpostedornon
postedrequest.
Uncorrectable Internal Error An internal error detected in the device
couldnotbecorrectedorworkedaroundbythehardwareitself.
MCBlockedTLPATLPdesignatedforMultiCastroutingwasblocked.
Forexample,anEgressPortcanbeprogrammedtoblockanyMChitsthat
arrive with untranslated addresses (see Routing Multicast TLPs on
page 896).
AtomicOpEgressBlockedEgressPortsofroutingelementscanbepro

692
PCIe 3.0.book Page 693 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

grammed to block AtomicOps from being forwarded to agents that


shouldntseethem(seeAtomicOpsonpage 897).
TLP Prefix Blocked Error Egress Ports of routing elements can be pro
grammednottoforwardTLPscontainingEndtoEndTLPPrefixes.Ifthey
thenseeone,theylldroptheTLPandreportthiserror.Formoreonthis,
seeTPH(TLPProcessingHints)onpage 899.
RecallthattheFirstErrorPointerintheCapabilityandControlRegisterindi
cates which unmasked uncorrectable error was the first to arrive since the
pointerwaslastupdated.Errorhandlingsoftwarecanreadthepointertofind
outwhicherrortoinvestigatefirst.Asanexample,ifthepointervalueis18d,
thatmeansbitposition18intheUncorrectableStatusregisterwasfirst,whichis
aMalformedTLP.Oncethaterrorhasbeenserviced,softwarewritesaonetobit
18inthestatusregistertoclearthatevent,whichupdatestheFirstErrorPointer
tothenextmostrecenterror
Selecting Uncorrectable Error Severity
Software can select whether or not uncorrectable errors should be considered
Fatalinthisregister,allowingerrorstobetreateddifferentlyfordifferentappli
cations.Forexample,aPoisonedTLPwillbeaNonFatalconditionbydefault,
andistreatedasanAdvisoryNonFatalerrorinsomecases,asdiscussedear
lier. But software can escalate it to Fatal by setting its severity bit to one and
thenitwillnolongerbeanadvisorycase.Thedefaultseverityvaluesareillus
tratedintheindividualbitfieldsofFigure1526onpage694(1=Fatal,0=Non
Fatal).Iftheyareenabledandnotmasked,thoseerrorsselectedasNonFatal
willcauseanERR_NONFATALmessagetobesenttotheRootComplex,and
thoseselectedasFatalwillcauseanERR_FATALmessage.

693
PCIe 3.0.book Page 694 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1526:AdvancedUncorrectableErrorSeverityRegister

31 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 6 5 4 3 1 0

RsvdP 0 0 0 1 0 0 0 1 1 0 0 0 1 0 RsvdP 1 1 RsvdP x

TLP Prefix Blocked Error Severity


Atomic Op Egress Blocked Severity Undefined
MC Blocked TLP Severity
Uncorrectable Internal Error Severity
Data Link
ACS Violation Severity
Protocol Error
Unsupported Request Error Severity
Severity
ECRC Error Severity
Malformed TLP Severity Surprise Down
Receiver Overflow Severity Error Severity
Unexpected Completion Severity
Completer Abort Severity
Completion Timeout Severity
Flow Control Protocol Error Severity
Poisoned TLP Severity
Note: all bits designated RWS

Uncorrectable Error Masking


Softwarecanmaskoutindividualerrorssotheywontcauseanerrormessage
tobesentbyusingtheAdvancedUncorrectableErrorMaskregister,shownin
Figure1527onpage694.ThedefaultconditionistoallowErrorMessagesfor
eachtypeoferror(allmaskbitsarecleared).

Figure1527:AdvancedUncorrectableErrorMaskRegister

31 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 6 5 4 3 1 0

RsvdP RsvdP RsvdP

TLP Prefix Blocked Error Mask


Atomic Op Egress Blocked Mask Undefined
MC Blocked TLP Mask
Uncorrectable Internal Error Mask
ACS Violation Mask Data Link
Unsupported Request Error Mask Protocol
ECRC Error Mask Error Mask
Malformed TLP Mask Surprise Down
Receiver Overflow Mask Error Mask
Unexpected Completion Mask
Completer Abort Mask
Completion Timeout Mask
Flow Control Protocol Error Mask
Poisoned TLP Mask
Note: all bits designated RWS

694
PCIe 3.0.book Page 695 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Header Logging
A4DWportionoftheAdvancedErrorReportingstructureisusedforstoring
the header of a received TLP that incurs an unmasked, uncorrectable error.
SinceheaderloggingisonlyusefulwhenaTLPhasbeenreceivedwithaprob
lemthatwasntseenbythePhysicalorDataLinkLayers,thenumberofpossi
bilities is limited, as shown in Table 156 on page 695. As mentioned earlier,
whentheoptionalAERcapabilityisimplemented,hardwareisrequiredtobe
abletologatleastoneheader,thoughitmaysupportloggingmore.

WhentheFirstErrorPointerisvalid,theheaderlogcontainstheheaderforthe
correspondingerrorifitwascausedbyanincomingTLP.UpdatingtheUncor
rectableErrorStatusregisterwillcausetheHeaderLogregisterstoalsoupdate
to the next value in sequence, meaning the next uncorrectable error that was
detected. Since the hardware can only track a limited number of headers, its
important that software service uncorrectable errors quickly enough to avoid
runningoutofheaderspace.Iftheheaderlogcapacityisreached,thatsacor
rectableerrorinitself(HeaderLogOverflow).Thiscouldhappenifthenumber
ofsupportedlogregistersisexceededoriftheMultipleHeaderLogEnablebit
isnotsetandtheFirstErrorPointerisalreadyvalidwhenanewuncorrectable
errorisdetected.

Table156:ErrorsThatCanUseHeaderLogRegisters

NameofError DefaultClassification

PoisonedTLPReceived UncorrectableNonFatal

ECRCCheckFailed UncorrectableNonFatal

UnsupportedRequest UncorrectableNonFatal

CompleterAbort UncorrectableNonFatal

UnexpectedCompletion UncorrectableNonFatal

ACSViolation UncorrectableNonFatal

MalformedTLP UncorrectableFatal

695
PCIe 3.0.book Page 696 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Root Complex Error Tracking and Reporting


The Root Complex is the target of all error Messages from devices in a PCIe
topology. Errors received by the Root update status registers and may be
reportedtothehostsystemifenabledtodoso.

Root Complex Error Status Registers


When the Root receives an error Message, it sets status bits within the Root
ErrorStatusregister(Figure1528onpage697).Thisregisterindicatesthetype
of error received and whether multiple errors of the same type have been
received.NotethatanerrordetectedintheRootPortitselfwillsetthesestatus
bits,too,asiftheporthadsentitselfanerrormessage.Thestatusbitsare:

ERR_CORReceived
MultipleERR_CORReceivedreceivedanERR_CORmessage,ordetected
anunmaskedRootPortcorrectableerrorwiththeERR_CORReceivedbit
alreadyset.
ERR_FATAL/NONFATALReceived
MultipleERR_FATAL/NONFATALReceivedreceivedanERR_FATALor
ERR_NONFATALmessageordetectedanunmaskedRootPortuncorrect
ableerrorwiththeERR_FATAL/NONFATALReceivedbitalreadyset.

ItspossibleforasystemtoimplementseparatesoftwareerrorhandlersforCor
rectable,NonFatal,andFatalerrors,sothisregisterincludesbitstodifferenti
atewhetherUncorrectableerrorswereFatalorNonFatal:

IfthefirstUncorrectableErrorMessagereceivedisFataltheFirstUncor
rectableFatalbitisalsosetalongwiththeFatalErrorMessageReceived
bit.
If the first Uncorrectable Error Message received is NonFatal the Non
fatal Error Message Received bit is set. (If a subsequent Uncorrectable
Error is Fatal, the Fatal Error Message Received bit will be set, but
because the First Uncorrectable Fatal remains cleared, software knows
thatthefirstUncorrectableErrorwasNonFatal).

696
PCIe 3.0.book Page 697 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Figure1528:RootErrorStatusRegister

31 27 26 7 6 5 4 3 2 1 0

RsvdZ

Advanced Error Interrupt Message Number (RO)

Fatal Error Messages Received


Non-Fatal Error Messages Received
RW1CS First Uncorrectable Fatal
Multiple ERR_FATAL/NONFATAL Received
ERR_FATAL/NONFATAL Received
Multiple ERR_COR Received
ERR_COR Received

Finally,aninterruptmayhavebeenenabled(intheRootErrorCommandregis
ter)tobesenttothehostsystemasaresultofdetectingoneoftheseevents.To
supportthat,the5bitInterruptMessageNumberinthisregistersuppliesthe
MSIorMSIXvectornumbertobeused,andthereare32possibilities.ForMSI,
thenumberistheoffsetfromthebasedatapattern.ForMSIX,itrepresentsthe
tableentrytobeused,andmustbeoneofthefirst32eveniftheagentsupports
morethan32.Thisreadonlyvalueissetbyhardwareandmustbeautomati
callyupdatedifthenumberofMSImessagesassignedtothedevicechanges.

Advanced Source ID Register


Software error handlers may need to read and clear status registers in the
devicethatdetectedandreportedtheerror.Tofacilitatethis,theerrorMessages
containtheID(Bus:Dev:Func)ofthefirstdevicereportingthaterrortype.The
Source ID register captures that ID from the Message for an incoming
ERR_FATAL/NONFATAL message if the ERR_FATAL/NONFATAL bit isnt
already set (meaning this is the first one). Similarly, the Source ID of the first
receivedERR_CORmessageiscaptured,too,asshowninFigure1529onpage
698.

697
PCIe 3.0.book Page 698 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1529:AdvancedSourceIDRegister

31 0
ERR_FATAL/NONFATAL Source ID ERR_COR Source ID
(ROS) (ROS)
ROS: Read-Only and Sticky

Root Error Command Register


TheRootComplexhasseparateenablebitsforeachofthethreeerrorcategories
tocontrolwhetherthaterrortypewillgenerateaninterrupttocallanerrorhan
dlerasshowninFigure1530onpage698.Theinterruptthatisgeneratewill
eitherbeanMSIorMSIXasdiscussedinRootComplexErrorStatusRegis
tersonpage 696.Oncetheinterruptisreceived,thecallederrorhandlerwould
probablyfirstreadtheRootComplexstatusregisterstodeterminethenatureof
theerror,andthengodowntothesourceBDFoftheerrortoreadstandardsta
tus register as well as possibly devicespecific registers to determine what
occurredandhowitshouldbehandled.

Figure1530:AdvancedRootErrorCommandRegister

31 3 2 1 0

RsvdP

Fatal Error Reporting Enable


Non-Fatal Error Reporting Enable
Correctable Error Reporting Enable
Note: all bits designated RW

Summary of Error Logging and Reporting


The spec includes the flow chart in Figure 1531 on page 699 that shows the
actions taken by a Function when an error is detected. The part inside the
dashedlinehighlightstheitemsthatareaddedwhentheoptionalAERcapabil
itystructureispresent.

698
PCIe 3.0.book Page 699 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Figure1531:FlowChartofErrorHandlingWithinaFunction

Error Detected

Uncorrectable Correctable
Error Type?

Determine severity using


Uncorrectable Error Severity Register

Advisory Yes AER Yes


Non-Fatal Error? Implemented?

No No
Set Fatal/NonFatal Error Detected bit Set Correctable Error Detected bit
in Device Status Reg Done in Device Status Reg

If UR, set Unsupported Request If UR, set Unsupported Request


Detected bit in Device Status Reg Detected bit in Device Status Reg

Set corresponding bit in


Advanced Set corresponding bit in
Correctable Error Status Reg
Uncorrectable Error Status Reg Error
Reporting
Only Is error masked in Yes
Correctable Error Mask
Masked in Yes Register?
Uncorrectable Error Mask
Register? No Done

If Advisory Non-Fatal Error:


No Done 1) Set Uncorrectable Error status bit, and
2) If not masked by Uncorrectable mask,
As appropriate, record prefix and as appropriate, record prefix and
header, and update prefix and header header, and update prefix and header
reporting fields and registers reporting fields and registers

UR Error and Yes Yes


both SERR and UR Reporting UR error and
disabled? UR Reporting disabled?

No No
Done Done
Fatal Non-Fatal
Severity?

SERR enabled or No SERR enabled or No


No Correctable Reporting
Fatal Error Reporting Non-Fatal Error Reporting Enabled?
Enabled? Enabled?
Yes
Yes Done
Done Yes Done
Send ERR_FATAL Send ERR_NONFATAL Send ERR_COR

Done Done Done

Example Flow of Software Error Investigation


NowthatweknowallthemechanismsdefinedinPCIefordetecting,logging
andreportingerrors,itisworthwhiletolookathowsoftwarewouldfindand
usethisinformationtodeterminehowtohandleareportederror.

699
PCIe 3.0.book Page 700 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ThisexampleisgoingtoassumethatboththeoriginatingFunctionaswellas
theRootPortupstreamofitbothsupportAER.WithoutAERsupport,thestan
dardizedregistersforerrorloggingareverylimited.

The system used for this example is shown in Figure 1532 on page 701. The
RootPorthasaBDFof0:28:0andwasenabledtogenerateaninterruptwhenit
receiveseitheranERR_FATALorERR_NONFATALmessage.Wearegoingto
followthestepsoferrorhandlingsoftwarewouldtaketodeterminewhaterrors
haveoccurred,wheretheyoccurredandwhatpacketsweretheydetectedin.

TheerrorhandlingsoftwarehasbeencalledbecauseofaninterruptfromRoot
Port 0:28:0. The steps below are just an example, but illustrate the process of
errorhandlingsoftwaregatheringerrorinformation.

1. SoftwareknowsitwasRootPort0:28:0thatcalledtheerrorhandlerbased
on the interrupt vector used. Since MSI or MSIX interrupts are used to
report errors, each Root Port will have their own unique set of interrupt
vectors.
2. TheerrorhandlerreadstheRootErrorStatusregisteroftheAERstructure
on0:28:0todeterminewhattypesoferrormessageshavebeenreceivedby
theRootPort.Thevalueinthatregisteris0800_007Chwhichindicatesthat
thisRootPorthasnotreceivedanyERR_CORmessages,buthasreceived
bothERR_FATALandERR_NONFATALmessagesandthefirstuncorrect
ableerrormessagethatitreceivedwasanERR_FATAL.
3. The next step is to determine which BDF beneath this Root Port sent the
firstuncorrectableerror.SoftwarethenreadstheSourceIDregisterofthe
RootPortandfindsthevalue0500_0000h,whichindicatesthatthesource
BDFofthefirstuncorrectableerrorwas5:0:0.
4. NowsoftwareknowsthatthefirstuncorrectableerrorreceivedbyRootPort
0:28:0wasaFatalerrorthatoriginatedfromBDF5:0:0.Withthisinforma
tion,software thengoesandreadstheUncorrectableErrorStatusregister
on BDF 5:0:0 to see which specific uncorrectable errors have occurred on
that BDF. The value returned from that read is 0004_1000h which means
thatthisBDFhasdetectedatleastoneMalformedTLPandatleastonePoi
soned TLP. But what the error handler really cares about is which one
occurredfirst,becausethatstheonethatshouldbehandledfirst.
5. Todeterminewhichofthemultipleuncorrectableerrorsoccurredfirst,soft
warethenreadstheAdvancedErrorCapabilityandControlregisterof5:0:0
andfindsthevalue0000_0012hwhichhasaFirstErrorPointervalueof12h
meaningthat thefirstuncorrectableerror wasaMalformed TLP(bit18d)
andnotthePoisonedTLP(bit12d).

700
PCIe 3.0.book Page 701 Sunday, September 2, 2012 11:25 AM

Chapter 15: Error Detection and Handling

Figure1532:ErrorInvestigationExampleSystem

AER Capability Structure


Extended Capability Header
00 01 00 01
Uncorrectable Error Status
00 00 00 00
Uncorrectable Error Mask
00 06 20 11
Uncorrectable Error Severity
00 00 20 00
Correctable Error Status CPU
00 00 20 00
Correctable Error Mask
00 00 00 06
Advanced Error Capability and Control Root Complex System
00 00 00 00 Memory
Header Log - 1st DW (DRAM)
00 00 00 00 P2P
Header Log - 2nd DW
00 00 00 00 0:28:0
Header Log - 3rd DW
00 00 00 00
Header Log - 4th DW
00 00 00 00
Root Error Command
00 00 00 06
2:0:0
Switch
Root Error Status P2P AER Capability Structure
08 00 00 7C
Error Source ID Extended Capability Header
05 00 00 00 14 01 00 01
P2
3:0:0 P P2
P 3:5:0 Uncorrectable Error Status
00 04 10 00
Uncorrectable Error Mask
AER Capability Structure 00 00 00 00
Uncorrectable Error Severity
Extended Capability Header 00 06 20 11
14 01 00 01 Correctable Error Status
Uncorrectable Error Status 00 00 00 01
00 10 80 00 Correctable Error Mask
Uncorrectable Error Mask
00 00 00 00
4:0:0 5:0:0 00 00 20 00
Advanced Error Capability and Control
Uncorrectable Error Severity 00 00 00 12
00 16 20 11 Header Log - 1st DW
Correctable Error Status
PCIe PCIe
60 00 80 80
00 00 00 40 Endpoint Endpoint Header Log - 2nd DW
Correctable Error Mask 00 00 04 FF
00 00 20 00 Header Log - 3rd DW
Advanced Error Capability and Control FB 80 10 00
00 00 00 0F Header Log - 4th DW
Header Log - 1st DW 00 00 00 01
00 00 00 80
Header Log - 2nd DW
0A 00 0C FF
Header Log - 3rd DW
FB 80 10 00
Header Log - 4th DW
00 00 00 00

701
PCIe 3.0.book Page 702 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

6. Nowthattheerrorhandlerknowsthatthefirstuncorrectableerrorat5:0:0
was a Malformed TLP, it can check the Header Log register to see the
header of the packet that was malformed, since this is one of the errors
where a header is recorded. In reading the Header Log register it finds
thesefourdoublewords:
6000_8080h1stDW
0000_04FFh2ndDW
FB80_1000h3rdDW
0000_0001h4thDW
7. Theevaluationofthose4DWsidentifiesthemalformedpacketas:Memory
Write,4DWheader,TC=0,TD=1,EP=0,Attr=0,AT=0,Length=80h(128DWs
or 512 bytes), Requester ID=0:0:0, Tag=4, Byte Enables=FFh,
Address=1_FB80_1000h.
Theheaderofthepacketalllookscorrectandeveryfieldusesvalidencod
ings,sosoftwaremustdigalittledeepertodiscoverwhythiswastreatedas
aMalformedTLP.Inthisexample,letsassumethatafterfurtherinspection
of config space on 5:0:0, software discovers that the Max Payload Size
enabledforthisFunctionis256bytes,butthispacketcontained512bytes.
This is a condition that will be treated as a Malformed TLP by the target
device,inthiscase5:0:0.

Ifyouwouldlikeverifyyourknowledgeofthiserrorinvestigationprocess,go
aheadandevaluatewhatthefirstuncorrectableerrordetectedon4:0:0was.

Ifyourefeelingadventurousandwouldliketocheckoutthistypeofinfoona
real system, say your desktop or laptop, you can do so by downloading the
MindShare Arbor software (www.mindshare.com/arbor). You can run this on
an x86based machine and it will scan your system and display every visible
PCIcompatibledevicewithitsconfigurationspacedecodedforeasyinterpreta
tion.

702
PCIe 3.0.book Page 703 Sunday, September 2, 2012 11:25 AM

16 Power
Management
The Previous Chapter
The previous chapter discusses error types that occur in a PCIe Port or Link,
howtheyaredetected,reported,andoptionsforhandlingthem.SincePCIeis
designedtobebackwardcompatiblewithPCIerrorreporting,areviewofthe
PCI approach to error handling is included as background information. Then
wefocusonPCIeerrorhandlingofcorrectable,nonfatalandfatalerrors.

This Chapter
This chapter provides an overall context for the discussion of system power
managementandadetaileddescriptionofPCIepowermanagement,whichis
compatible with the PCI Bus PM Interface Spec and the Advanced Configuration
and Power Interface (ACPI). PCIe defines extensions to the PCIPM spec that
focus primarily on Link Power and event management. An overview of the
OnNowInitiative,ACPI,andtheinvolvementoftheWindowsOSisalsopro
vided.

The Next Chapter


The next chapter details the different ways that PCIe Functions can generate
interrupts. The old PCI model used pins for this, but sideband signals are
undesirableinaserialmodelsosupportfortheinbandMSI(MessageSignaled
Interrupts)mechanismwasmademandatory.ThePCIINTx#pinoperationcan
stillbeemulatedinsupportofalegacysystemusingPCIeINTxmessages.Both
the PCI legacy INTx# method and the newer versions of MSI/MSIX are
described.

703
PCIe 3.0.book Page 704 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Introduction
PCIExpresspowermanagement(PM)definesfourmajorareasofsupport:

PCICompatiblePM. PCIe power managementishardwareand software


compatiblewiththePCIPMandACPIspecs.Thissupportrequiresthatall
FunctionsincludethePCIPowerManagementCapabilityregisters,allow
ing software to transition a Function between PM states under software
controlthroughtheuseofConfigurationrequests.Thiswasmodifiedinthe
2.1 spec revision with the addition of Dynamic Power Allocation (DPA),
anothersetofregistersthataddedseveralsubstatestotheD0powerstateto
givesoftwareafinergrainedPMmechanism.
NativePCIeExtensions.Thesedefineautonomous,hardwarebasedActive
StatePowerManagement(ASPM)fortheLink,aswellasmechanismsfor
waking the system, a Message transaction to report Power Management
Events (PME), and a method for calculating and reporting the
lowpowertoactivestatelatency.
BandwidthManagement.The2.1specrevisionaddedtheabilityforhard
ware to automatically change either the Link width or Link data rate or
bothtoimprovepowerconsumption.Thisallowshighperformancewhen
neededandkeepspowerusagelowwhenlowerperformanceisacceptable.
EventhoughBandwidthManagementisconsideredaPowerManagement
topic, we describe this capability in the section Dynamic Bandwidth
Changes on page 618 in the Link Initialization & Training chapter
becauseitinvolvestheLTSSM.
Event Timing Optimization. Peripheral devices that initiate bus master
eventsorinterruptswithoutregardtothesystempowerstatecauseother
systemcomponentstostayinhighpowerstatestoservicethem,resulting
inhigherpowerconsumption thanwouldbenecessary.Thisshortcoming
wascorrectedinthe2.1specbyaddingtwonewmechanisms:Optimized
Buffer Flush and Fill (OBFF), which lets the system inform peripherals
about the current system power state, and Latency Tolerance Reporting
(LTR),whichallowsdevicestoreporttheservicedelaytheycantolerateat
themoment.

Thischapterissegmentedintoseveralmajorsections:

1. Thefirstpartisaprimeronpowermanagementingeneralandcoversthe
role of system software in controlling power management features. This
discussiononlyconsiderstheWindowsOperatingSystemperspectivesince
itsthemostcommononeforPCs,andotherOSsarenotdescribed.

704
PCIe 3.0.book Page 705 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

2. Thesecondsection,FunctionPowerManagementonpage 713,discusses
themethodforputtingFunctionsintotheirlowpowerdevicestatesusing
thePCIPMcapabilityregisters.Notethatsomeoftheregisterdefinitions
aremodifiedorunusedbyPCIeFunctions.
3. ActiveStatePowerManagement(ASPM)onpage 735describesthehard
warebased autonomous Link power management. Software determines
whichlevelofASPMtoenablefortheenvironment,possiblybyreadingthe
recoverylatencyvaluesthatwillbeincurredforthatFunction,butafterthat
the timing of the power transitions is controlled by hardware. Software
doesnt control the transitions and is unable to see which power state the
Linkisin.
4. Software Initiated Link Power Management on page 760 discusses the
Linkpowermanagementthatisforcedwhensoftwarechangesthepower
stateofadevice.
5. Link Wake Protocol and PME Generation on page 768 describes how
Devices may request that software return them to the active state so they
canserviceanevent.WhenpowerhasbeenremovedfromaDevice,auxil
iarypowermustbepresentifitistomonitoreventsandsignalaWakeupto
thesystemtogetpowerrestoredandreactivatetheLink.
6. Finally,eventtimingfeaturesaredescribed,includingOBFFandLTR.

Power Management Primer


The PCI Bus PM Interface spec describes the power management registers
requiredforPCIe.ThesepermittheOStomanagethepowerenvironmentofa
Function directly. Rather than dive into a detailed description, lets start by
describingwherethiscapabilityfitsintheoverallcontextofthesystem.

Basics of PCI PM
ThissectionprovidesanoverviewofhowaWindowsOSinteractswithother
majorsoftwareandhardwareelementstomanagethepowerusageofindivid
ual devices and the system as a whole. Table 161 on page 706 introduces the
majorelementsinvolvedinthisprocessandprovidesaverybasicdescriptionof
how they relate to each other. It should be noted that neither the PCI Power
ManagementspecnortheACPIspecdictatethePMpoliciesthattheOSuses.
Theydo,however,definetheregisters(andsomedatastructures)thatareused
tocontrolthepowerusageofaFunction.

705
PCIe 3.0.book Page 706 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table161:MajorSoftware/HardwareElementsInvolvedInPCPM

Element Responsibility

OS Directsoverallsystempowermanagementbysendingrequeststothe
ACPIDriver,devicedriver,andthePCIExpressBusDriver.Applica
tionsthatarepowerconservationawareinteractwiththeOStoaccom
plishdevicepowermanagement.

ACPIDriver Managesconfiguration,powermanagement,andthermalcontrolof
embeddedsystemdevicesthatdontadheretoanindustrystandard
spec.Examplesofthisincludechipsetspecificregisters,system
boardspecificregisterstocontrolpowerplanes,etc.ThePMregisters
withinPCIeFunctions(embeddedorotherwise)aredefinedbythePCI
PMspecandarethereforenotmanagedbytheACPIdriver,butrather
bythePCIExpressBusDriver(seeentryinthistable).

DeviceDriver TheClassdrivercanworkwithanydevicethatfallswithintheClassof
devicesthatitwaswrittentocontrol.Thefactthatitsnotwrittenfora
specificvendormeansthatitdoesnthavebitlevelknowledgeofthe
devicesinterface.Whenitneedstoissueacommandtoorcheckthesta
tusofthedevice,itissuesarequesttotheMiniportdriversuppliedby
thevendorofthespecificdevice.
Thedevicedriveralsodoesntunderstanddevicecharacteristicsthatare
peculiartoaspecificbusimplementationofthatdevicetype.Asan
example,itwontunderstandaPCIeFunctionsconfigurationregister
set.ThePCIExpressBusDriveristheonetocommunicatewiththose
registers.
WhenitreceivesrequestsfromtheOStocontrolthepowerstateofa
PCIedevice,itpassestherequesttothePCIExpressBusDriver.
WhenarequesttopowerdownitsdeviceisreceivedfromtheOS,the
device driver saves the contents of its associated Functions
devicespecific registers (in other words, a context save) and then
passestherequesttothePCIExpressBusDrivertochangethepower
stateofthedevice.
Conversely, when a request to repower the device is received, the
device driver passes the request to the PCI Express Bus Driver to
changethepowerstateofthedevice.AfterthePCIExpressBusDriver
hasrepoweredthedevice,thedevicedriverthenrestoresthecontext
totheFunctionsdevicespecificregisters.

MiniportDriver Suppliedbythevendorofadevice,itreceivesrequestsfromtheClass
driverandconvertsthemintotheproperseriesofaccessestothe
devicesregisterset.

706
PCIe 3.0.book Page 707 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Table161:MajorSoftware/HardwareElementsInvolvedInPCPM(Continued)

Element Responsibility

PCIExpressBus ThisdriverisgenerictoallPCIExpresscompliantdevices.Itmanages
Driver theirpowerstatesandconfigurationregisters,butdoesnothave
knowledgeofaFunctionsdevicespecificregisterset(thatknowledgeis
possessedbytheMiniportDriverthatthedevicedriverusestocommu
nicatewiththedevicesregisterset).Itreceivesrequestsfromthedevice
drivertochangethestateofthedevicespowermanagementlogic.For
example:
Whenarequesttopowerdownthe deviceis received,thisdriveris
responsibleforsavingthecontextoftheFunctionsPCIExpresscon
figurationregisters.Itthendisablestheabilityofthedevicetoactasa
RequesterorrespondasatargetandwritestotheFunctionsPMregis
terstochangeitsstate.
Conversely, when the device must be repowered, the PCI Express
BusDriverwritestothePCIExpressFunctionsPMregisterstochange
its state and then restores the Functions configuration registers to
theiroriginalstate.

PCIExpressPMregis Thelocation,formatandusageoftheseregistersisdefinedbythe
terswithineachFunc PCIespec.ThePCIExpressBusDriverunderstandsthisspecandthere
tionsconfiguration foreistheentityresponsibleforaccessingaFunctionsPMregisters
space. whenrequestedtodosobytheFunctionsdevicedriver.

SystemBoardpower Theimplementationandcontrolofthislogicistypicallysystemboard
planeandbusclock designspecificandisthereforecontrolledbytheACPIDriver(under
controllogic OSdirection).

ACPI Spec Defines Overall PM


TheACPI(AdvancedConfigurationandPowerInterface)specwasfirstwritten
severalyearsagoasajointeffortbyseveralcompaniestoprovideindustrystan
dards for OSPM (OSlevel Power Management) in compute platforms. Power
managementatthattimewasbeinghandledinproprietarywaysondifferent
platforms and that made it difficult for vendors to coordinate their efforts. In
addition,platformspecificcodewasntalwaysfullycompatiblewithOSopera
tions or aware of all the system conditions or policy considerations. ACPI
helpedintheseareasbydefiningsystempowerstates,hardwareregistersand
software interactions to accomplish OSbased power management. A detailed
descriptionofACPIisbeyondthescopeofthisbook,butanintroductiontothe
conceptsandterminologywillbehelpful.

707
PCIe 3.0.book Page 708 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

System PM States
Table 162onpage 708definesthepossiblestatesoftheoverallsystemwithref
erence to power consumption. The Working, Sleep, and Soft Off states
aredefinedintheOnNowDesignInitiativedocuments.

Table162:SystemPMStatesasDefinedbytheOnNowDesignInitiative

Power
Description
State

Working Thesystemisfullyoperational.
(G0/S0)

Sleeping Thesystemappearstobeoffandpowerconsumptionhasbeen
(G1) reduced.TheamountoftimeittakestoreturntotheWorkingstate
isinverselyproportionaltotheselectedlevelofpowerconservation.
S1cachesflushed,CPUhalted
S2sameasS1exceptthatnowCPUispoweredoff.Notcommonly
usedbecauseitsnotmuchbetterthanS3.
S3(alsocalledSuspendtoRAMorStandby)Thisisthesame
asS2exceptthatthesystemcontextissavedinmemoryandmore
of the system is shut down. When the system wakes up the CPU
beginsthefullbootprocessbutfindsflagssetintheCMOSmem
orythatdirectittoreloadthecontextfromRAMinstead,andthus
programexecutioncanberesumedveryquickly.
S4(alsocalledSuspendtoDisk or Hibernate)Similar to S3,
exceptthatnowthesystemcopiesthesystemcontexttodisk,and
then removes power from the system, including main memory.
Thisgivesbetterpowersavingsbuttherestarttimewillbelonger
becausethecontextmustberestoredfromthediskbeforeresuming
programexecution.

SoftOff Thesystemappearstobeoffandpowerconsumptionisminimal.It
(G2/S5) requiresafullreboottoreturntotheWorkingstatebecausethe
contentsofmemoryhavebeenlost,butthereisstillsomepoweravail
abletodothewakeup,suchasbypressingthePowerbuttononthe
system.

Mechanical Thesystemhasbeendisconnectedfromallpowersourcesandno
Off(G3) powerisavailable.

708
PCIe 3.0.book Page 709 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Device PM States
ACPI also defines the PM states at the device level, which are listed in
Table 163onpage 709.Table 163onpage 709presentsthesameinformationin
aslightlydifferentform.Theregistersthatsupportthesedevicestatesmustbe
implementedforPCIedevices.

Table163:OnNowDefinitionofDeviceLevelPMStates

State Description

D0 Mandatory.Deviceisfullyoperationalandusesfullpowerfromthesys
tem.The2.1specrevisionaddedanothersetofregisterstosupport32
substatesunderD0referredtoasDynamicPowerAllocationregisters.

D1 Optional.Lowpowerstateinwhichdevicecontextmayormaynotbe
lost.Nodefinitionforthisstateisgiven,butitwouldrepresentalower
powerstatethanD0andhigherthanD2

D2 Optional.PresumablyalowerpowerstatethanD1thatattainsgreater
powersavings,butwouldincuralongerrecoverydelayandmaycause
Devicetolosesomecontext.

D3 Mandatory.Deviceispreparedforlossofpowerandcontextmaybelost
whetherthepoweractuallygoesoffornot.Recoverytimewillbelonger
thanforD2,butpowercanberemovedfromthedevicegracefullyinthis
state.

Definition of Device Context


General.Duringnormaloperation,theoperationalstateofaDeviceiscon
stantlychanging.Adevicedrivermaywriteorreaditsregisters,oralocal
processorontheDevicemayexecutecodethataffectsitsinteractionwith
thesystem.Thestateofthedeviceatagiveninstantintimeincludes:

Thecontentsofitsconfigurationregisters.
ThestateofitslocalmemoryandIOregisters.
Ifitcontainsaprocessor,thenthecurrentprogrampointerandcontents
ofitsotherregisterswouldbeincluded.

Thisstateinformationisreferredtoasthedevicecontext.Someorallofthis
maybelostiftheDevicePMstateischangedtoamoreaggressivelevel.If
the context information is not maintained, the Device wont operate cor
rectlywhenitreturnstotheD0(fullyoperational)state.

709
PCIe 3.0.book Page 710 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

PMEContext.If the OS enables a modem to wake the system for an


incomingcallandthenpowersdownthesystem,theDevicewakeupcon
text will need to be retained locally during that time. The chipset retains
enoughpowertoallowittomonitorfortheseevents.Tosupportthisfea
ture,aPCIemodemmustimplementconfigurationregistersincluding:

PMEMessagecapability.
PMEenable/disablecontrolbit.
PMEstatusbitindicatingwhetherthedevicehassentaPMEmessage.
Oneormoredevicespecificcontrolbitsthatselectivelyenableordis
ablevariousdevicespecificeventsthatcancausethedevicetosenda
PMEmessage.
Correspondingdevicespecificstatusbitsthatindicatewhythedevice
issuedaPMEmessage.
Device-Class-Specific PM Specs
DefaultDeviceClassSpec.Asmentionedearlier,ACPIgivesfourpos
sibledevicepowerstates(D0throughD3).Italsodefinestheminimum
PMstatesthatalldevicetypesmustimplement,aslistedinTable 164on
page 710.

Table164:DefaultDeviceClassPMStates

State Description

D0 Deviceison,isrunningatfullpower,andisfullyoperational.

D1 ThisoptionalstateisonlydefinedasbeinglowerpowerthanD0.Itisnot
commonlyused.

D2 ThisoptionalstateisonlydefinedasbeinglowerpowerthanD1.Itisnot
commonlyused.

D3 Deviceconsumestheminimumpossiblepowerandmainpowermaybe
turnedoff.Theonlyrequirementisthat,whilepowerisstillon,thedevice
mustbeabletoserviceaconfigurationcommandtoreenterD0.Power
canberemovedfromthedeviceinthisstate,andthedevicewillexperi
enceahardwareresetwhenpowerisrestored.

710
PCIe 3.0.book Page 711 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

DeviceClassSpecificPMSpecs.Above and beyond the power states


mandated by the Default Device Class Spec, certain device classes may
require the intermediate power states (D1 and/or D2) or exhibit certain
commoncharacteristicsinaparticularpowerstate.

TherulesassociatedwithaparticulardeviceclassarefoundintheDevice
ClassPowerManagementSpecsavailableonMicrosoftsHardwareDevelop
erswebsite.Forexample,DeviceClassPowerManagementSpecsexistfor
thefollowingclasses:

Audio
Communications
Display
Input
Network
PCCard
Storage

Power Management Policy Owner


ADevicesPMpolicyownerisdefinedasthesoftwaremodulethatmakesdeci
sionsregardingthePMstateofadevice.InaWindowsenvironment,thepolicy
owneristheclassspecificdriverassociatedwithdevicesofthatclass.

PCI Express Power Management vs. ACPI


PCI Express Bus Driver Accesses PM Registers
As indicated in Table 161 on page 706 and Figure 161 on page 712, the PCI
ExpressBusDriverunderstandsthelocation,formatandusageofthePMcon
figurationregisters.ItscalledwhentheOSneedstochangethepowerstateofa
PCIedeviceordetermineitsstatusandcapabilities.Otherexamplesinclude:

TheIEEE1394BusDriver,whichunderstandshowtousethePMregisters
definedinthe1394PowerManagementspec.
The USB Bus Driver, which understands how to use the PM registers
definedintheUSBPowerManagementspec.

711
PCIe 3.0.book Page 712 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ACPI Driver Controls Non-Standard Embedded Devices


There are devices embedded on the system board whose register sets do not
adheretoanyparticularindustrystandardspec.Atboottime,theBIOSreports
thesedevicestotheOSviatheACPItables,alsoreferredtoasthenamespace.
WhentheOSneedstocommunicatewithanyofthesedevices,itcallstheACPI
Driver,whichexecutesahandlercalledaControlMethodassociatedwiththe
device.ThehandlerisalsofoundintheACPItablesandiswrittenbytheplat
form designer using a special interpretive language called ACPI Source Lan
guage,orASL.TheASLcodeisthencompiledintoACPIMachineLanguage,or
AML.NotethatAMLisnotaprocessorspecificmachinelanguage.Itsatoken
ized(i.e.,compressed)versionoftheASLsourcecode.AnACPIDriverincorpo
ratesanAMLtokeninterpreterthatallowsittoexecuteaControlMethod.

Figure161:RelationshipofOS,DeviceDrivers,BusDriver,PCIExpressRegisters,andACPI

Microsoft

OS

Interface defined Interface defined


by Microsoft by Microsoft

Windows ACPI Written by Microsoft


Device Driver Driver to ACPI spec

Interface defined
by Microsoft
Written by Microsoft Written by system
PCIe Bus AML Control
to OS, PCIe, and PCI board designer to ACPI
Driver Method
PM specs and chip-specific specs

Non-standard
PCIe Functions PCIe Functions Embedded Register set defined
Configuration PM Registers System Board by chip designer
Registers Device
Register set defined Register set defined
by PCIe spec by PCI PM spec and
extensions for PCIe

712
PCIe 3.0.book Page 713 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Function Power Management


PCIExpressFunctionsarerequiredtosupportpowermanagement,andseveral
registersandrelatedbitfieldsmustbeimplementedasdiscussedbelow.

The PM Capability Register Set


ThePCIPMspecdefinesthePowerManagementCapabilityconfigurationreg
isters. These registers were optional for PCI, but required for PCIe, and are
locatedinthePCIcompatibleconfigurationspacewithaCapabilityIDof01h.
Softwarecanperformthefollowingsequencetolocatetheseregisters:

1. Bit4oftheFunctionsConfigurationStatusregistershouldbeset,indicat
ingthattheCapabilitiesPointerinthefirstbyteofdword13doftheFunc
tions configuration Header is valid. Reading the Capabilities Pointer
registergivestheoffsettothefirstoftheFunctionslinkedlistofcapability
registers.
2. IftheleastsignificantbyteofthedwordatthatoffsetcontainsCapability
ID01h(seeFigure162onpage713),thisisthePMregisterset.Thebyte
immediatelyfollowingtheCapabilityIDbyteisthePointertoNextCapabil
ityfieldthatgivestheoffsetinconfigurationspaceofthenextCapability(if
thereisone).Anonzerovalueisavalidpointer,whileavalueof00hindi
catestheendofthelinkedlist.AdescriptionofallthePMregisterscanbe
foundinDetailedDescriptionofPCIPMRegistersonpage 724.

Figure162:PCIPowerManagementCapabilityRegisterSet

31 16 15 8 7 0
Power Management Capabilities Pointer to Capability ID
(PMC) Next Capability 01h 1st Dword
Bridge Support
Control/Status Register
Data Register Extensions
(PMCSR_BSE) (PMCSR)
2nd Dword

Device PM States
EachPCIExpressFunctionmustsupportthefullonD0stateandthefulloffD3
state,whileD1andD2areoptional.Thesectionsthatfollowdescribethepossi
blePMstates.

713
PCIe 3.0.book Page 714 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

D0 StateFull On
Mandatory.Inthisstate,nopowerconservationisineffectandthedevice
isfullyoperational.AllPCIeFunctionsmustsupporttheD0stateandthere
aretechnicallytwosubstates:D0UninitializedandD0Active.ASPMhard
ware control can change the Link power while the Device is in this state.
Table 165onpage 714summarizesthePMpoliciesintheD0state.

D0Uninitialized.AFunctionentersD0UninitializedafteraFundamen
talResetor,insomecases,whensoftwaretransitionsitfromD3hottoD0.
Usually, the registers are returned to their default state. In this state, the
Functionexhibitsthefollowingcharacteristics:

Itonlyrespondstoconfigurationtransactions.
ItsCommandregisterenablebitsareallreturnedtotheirdefaultstates,
meaningitcannotinitiatetransactionsoractasthetargetofmemoryor
IOtransactions.

D0Active.Once the Function has been configured and enabled by soft


ware,itisintheD0Activestateandisfullyoperational.

Table165:D0PowerManagementPolicies

Link Function Registersor Actions Actions


PM PM Statethatmust Power permittedto permittedby
State State bevalid Function Function

L0 D0uninitialized PMEcontext** <10W PCIExpress None


configtransac
tions.

L0 D0active all full AnyPCI Anytransac


L0s(required)* Expresstrans tion,interrupt,
L1(optional)* action. orPME.**

L2/L3 D0active N/A***

*ActiveStatePowerManagement
**IfPMEsupportedinthisstate.
***ThiscombinationofBus/FunctionPMstatesnotallowed.

Dynamic Power Allocation (DPA)


Optional. The2.1 revision of the base spec addedanother optional capability
that defines 32 more substates for D0 and describes their characteristics. This
wasintendedtofacilitatenegotiationregardingpowermanagementbetweena

714
PCIe 3.0.book Page 715 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

devicedriver,OS,andanexecutingapplication,partlybecausesomeFunctions
donthavedevicedriversthathandlePMwell.Oneadvantageofthismodelis
that the Device technically still remains in the D0 state and may therefore be
able to continue operating in a reduced capacity instead of going offline as
wouldbecausedbyachangetotheD1orlowerstate.

DPAregistersonlyapplywhentheDevicepowerstateisinD0andarentappli
cableinstatesD1D3.Upto32substatescanbedefined,andtheymustbecon
tiguouslynumberedfromzerotothemaximumvalue.Substate0istheinitial
default value and represents the maximum power the Function is capable of
consuming.Softwareisnotrequiredtotransitionbetweensubstatesinsequen
tialorderorevenwaituntilaprevioustransitioniscompletedbeforerequesting
anotherchangeinthesubstate.Consequently,whenaFunctionhascompleteda
substatechangeitmustchecktheconfiguredsubstateand,iftheydontmatch,
itmustbeginchangingtotheconfiguredvalue.TheregisterstosupportDPA,
illustratedinFigure163onpage715,arefoundintheEnhancedconfiguration
space.

Figure163:DynamicPowerAllocationRegisters

31 0 Offset

PCIe Enhanced Capability Header 000h

DPA Capability Register 004h

DPA Latency Indicator Register 008h

DPA Control Register DPA Status Register 00Ch

010h
DPA Power Allocation Array
(Sized by number of substates)
Up to
02Ch

TheDPAcapabilityregister,showninFigure164onpage716,containsseveral
interesting values associated with the substates. The Substate_Max number
indicateshowmanysubstatesaredescribed,andthenumbersmustincrement
contiguouslyfromzerotothatvalue.TwoTransitionLatencyValuesaregiven
andeachsubstatewillbeassociatedwithoneortheotherbytheLatencyIndica
torregister. whichcontainsonebitforeachpossible substate;ifthat bitisset
TransitionLatencyValue1isused,otherwiseValue0isused.Thelatencyvalue
givesthemaximumtimerequiredtotransitionintothatsubstatefromanyother

715
PCIe 3.0.book Page 716 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

substate.ThelatenciesaremultipliedbytheTransitionLatencyUnitstogivethe
timeinmilliseconds.Similarly,thePowerAllocationScalevaluegivesthemulti
plierforthepowerusedineachsubstate,expressedinwatts.Foreachdefined
substate,a32bitfieldintheDPAPowerAllocationArraydescribesthepower
usedforthatstate.Thefirstoneoftheseislocatedatoffset010h,andtherestare
implementedinsubsequentdwords.

Figure164:DPACapabilityRegister

31 24 23 16 15 14 13 12 11 10 9 8 7 5 4 0
Substate
Xlcy1 Xlcy0 RsvdZ PAS RsvdZ RsvdZ
_Max

Transition Latency Value 0 All fields not reserved


are read-only
Transition Latency Value 1

Power Allocation Scale (PAS)


Transition Latency Unit (Tlunit)

TheloworderfivebitsoftheDPAControlregisterarewrittenbysoftwareto
setanewsubstate,andthecurrentsubstatecanbereadfromtheStatusregister,
asshowninFigure165onpage716.Noticethatbit8oftheStatusregisterindi
cates whether the use of DPA substates has been enabled but its labeled as
RW1C(Read,Write1toClear),meaningsoftwarecanclearthisbitbutcantset
it.DPAisenabledbydefaultafterareset,andsoftwarewouldneedtodisableit
bywritingaonetothisbitifitdidnotintendtouseDPA.

Figure165:DPAStatusRegister

15 9 8 7 5 4 0

RsvdZ RsvdZ

Substate Control Enabled (RW1C)

Substate status (RO)

D1 StateLight Sleep
Optional.Beforegoingintothisstate,softwaremustensurethatalloutstanding
nonposted Requests have received their associated Completions. This can be
achievedbypollingtheTransactionsPendingbitintheDeviceStatusregisterof

716
PCIe 3.0.book Page 717 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

thePCIExpressCapabilityblock;whenthebitisclearedtozero,itssafetopro
ceed.InthislightpowerconservationstatetheFunctionwontinitiateRequests
exceptPMEMessages,ifenabled.OthercharacteristicsoftheD1stateinclude:

LinkisforcedtotheL1powerstatewhentheDevicegoesintotheD1state.
ConfigurationandMessageRequestsareacceptedinthisstate,butallother
Requests must be handled as Unsupported Requests and all completions
mayoptionallybehandledasUnexpectedCompletions.
IfanerroriscausedbyanincomingRequestandreportingitisenabled,an
Error Message may be sent while in this state. If a different type of error
occurs(suchasaCompletiontimeout),themessagewontbesentuntilthe
DeviceisreturnedtotheD0state.
The Function may reactivate the Link and send a PME message, if sup
ported and enabled in this state, to notify software that the Function has
experiencedaneventrequiringthatpowerberestored.
TheFunctionmayormaynotloseitscontextinthisstate.Ifitdoesandthe
devicesupportsPME,itmustatleastmaintainitsPMEcontext(seePME
Contextonpage 710)whileinthisstate.
The Function must be returned to the D0 Active PM state in order to be
fullyoperational.
Table166liststhePMpolicieswhileintheD1state.

Table166:D1PowerManagementPolicies

Link Function Registersor


Actionspermittedto Actionspermitted
PM PM Statethat Power
Function byFunction
State State mustbevalid

PMEMessages.**
Device
D0 ConfigRequestsand Thoughnottypi
classspecific
unini Messages.Linktransi callypermitted,
L1 registers
tial tionsbacktoL0toser theywouldrequire
D1 andPME
ized vicetherequest. theLinktotransi
context.*
tionbacktoL0.

L2L3 NA*

*ThiscombinationofBus/FunctionPMstatesnotallowed.
**IfPMEsupportedinthisstate.

D2 StateDeep Sleep
Optional.Beforegoingintothisstate,softwaremustensurethatalloutstanding
nonposted Requests have received their associated Completions. This can be
achievedbypollingtheTransactionsPendingbitintheDeviceStatusregisterof

717
PCIe 3.0.book Page 718 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

thePCIExpressCapabilityblock;whenthebitisclearedtozero,itssafetopro
ceed. This power state provides deeper power conservation than D1 but less
than the D3hot state. As in D1, the Function wont initiate Requests (except a
PME Message) or act as the target of Requests other than configuration. Soft
ware must still be able to access the Functions configuration registers in this
state.

OthercharacteristicsoftheD2stateinclude:
Before going into this state, software must ensure that all outstanding
nonpostedRequestshavereceivedtheirassociatedCompletions.Thiscan
be achieved by polling the Transactions Pending bit in the Device Status
registerofthePCIeCapabilityblock.ItcouldhappenthattheCompletions
willneverbereturnedand,inthatcase,softwareshouldwaitlongenough
toensuretheyneverwillbereturned.
LinkstatemusttransitiontoL1whentheDevicetransitionstotheD2state.
ConfigurationandMessageRequestsareacceptedinthisstate,butallother
Requests must be handled as Unsupported Requests and all completions
mayoptionallybehandledasUnexpectedCompletions.
IfanerroriscausedbyanincomingRequestandreportingitisenabled,an
Error Message may be sent while in this state. If a different type of error
occurs(suchasaCompletiontimeout),themessagewontbesentuntilthe
DeviceisreturnedtotheD0state.
Function may send a PME message, if supported and enabled, to notify
softwarethatitneedspowerrestoredtohandleanevent.
TheFunctionmayormaynotloseitscontextinthisstate.Ifitdoesandthe
device supports PME messages, it must at least maintain its PME context
forthispurpose.
TheFunctionmustreturntotheD0Activestatetobefullyoperational.
Table 167onpage 719illustratesthePMpolicieswhileintheD2state.

718
PCIe 3.0.book Page 719 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Table167:D2PowerManagementPolicies

Registers
Link Function
and/orState Actionspermitted Actionspermitted
PM PM Power
thatmustbe toFunction byFunction
State State
valid

ConfigRequests
andtransactions PMEMessages.*
Device
nexthigher permittedby Thoughnottypi
classspecific
supportedPM deviceclass(typi callypermitted,
L1 registers
stateorD0 callynone). theywouldrequire
D2 andPMEcon
uninitialized. Thisrequiresthe theLinktotransi
text.*
Linktotransition tionbacktoL0.
backtoL0

L2/L3 N/A**

*IfPMEsupportedinthisstate.
**ThiscombinationofBus/FunctionPMstatesnotallowed.

D3Full Off
Mandatory.AllFunctionsmustsupporttheD3state.Thisisthedeepeststate
andpowerconservationismaximized.Whensoftwarewritesthispowerstate
totheDevice,itgoestotheD3hotstate,meaningpowerisstillapplied.Remov
ingpower(Vcc)fromtheDeviceputsitintotheD3coldstateandtheLinkinto
L2,ifasecondarypowersource(Vaux)isavailable,orL3ifitsnot.

D3HotState.(Mandatory.)SoftwareputsaFunctionintoD3hotbywriting
the appropriate value into the PowerState field of its Power Mgt Control
and Status Register (PMCSR). In this state, the Function can only initiate
PME or PME_TO_ACK Messages, and can only respond to configuration
Requests or the PME_Turn_Off Message. Software must be able to access
theFunctionsconfigurationregisterswhilethedeviceisintheD3hotstate,
if only to be able to change the state back to D0. Other characteristics of
D3hotinclude:

Before going into this state, software must ensure that all outstanding
nonpostedRequestshavereceivedtheirassociatedCompletions.Thiscan
be achieved by polling the Transactions Pending bit in the Device Status
registerofthePCIeCapabilityblock.ItcouldhappenthattheCompletions
willneverbereturnedand,inthatcase,softwareshouldwaitlongenough
toensuretheyneverwillbereturned.
TheLinkisforcedtotheL1statewhentheFunctionchangestoD3hot.

719
PCIe 3.0.book Page 720 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

TheFunctionisallowedtosendaPMEmessagetonotifyPMsoftwareofits
needtobereturnedtothefullyactivestate(assumingitsupportsgenera
tionofPMeventsintheD3hotstateandhasbeenenabledtodoso).
Functioncontextmaybelostwhengoingtothisstateandifthepoweris
turnedoffthespecassumesallcontextwillbelost.Ontheotherhand,ifthe
power never goes off before software initiates a return to D0 the context
couldbemaintained.Inearlierspecversionsthatwasntpossible;changing
fromD3hottoD0involvedasoftresetandalltheregisterswerereinitial
ized. However, the 1.2 revision of that spec added a new capability bit
called No Soft Reset to indicate that the Function would not do a soft
resetinthatcase.TobeabletogeneratePMEmessagesintheD3hotstate,a
DevicemustmaintainitsPMEcontext(seePMEContextonpage 710).
TheFunctionexitsfromtheD3hotstateundertwocircumstances:
IfVccisremovedfromthedevice,ittransitionsfromD3hottoD3cold.
SoftwarecanwritetothePowerStatefieldoftheFunctionsPMCSRregister
tochangeitsPMstatetoD0.WhenprogrammedtoexitD3hotandreturnto
D0,theFunctionreturnstotheD0UninitializedPMstate.Aresetmayor
maynotberequired.Table 168onpage 721liststhePMpolicieswhilein
theD3hotstate.

720
PCIe 3.0.book Page 721 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Table168:D3hotPowerManagementPolicies

Registers
Bus Function
and/orState Actionspermitted Actionspermitted
PM PM Power
thatmust toFunction byFunction
State State
bevalid

PMEmessage**
PCIExpressconfig
transactions PME_TO_ACK
&PME_Turn_Off message***
nexthigher broadcast
PMEcon
L1 supportedPM message*** PM_Enter_L23
text.**
stateorD0 (Thesecanonly DLLP***
D3hot uninitialized. occuraftertheLink
transitionsbackto (Thesecanoccur
itsL0state. onlyaftertheLink
returnstoL0)

L2/L3 L2/L3ReadyenteredfollowingthePME_Turn_Offhandshakesequence,which
Ready preparesadeviceforpowerremoval***

L2/L3 NA*

*ThiscombinationofBus/FunctionPMstatesnotallowed.
**IfPMEsupportedinthisstate.
***SeeL2/L3ReadyHandshakeSequenceonpage 764fordetailsregardingthesequence.

D3ColdState.Mandatory. Every PCI Express Function enters the D3Cold


PMstateuponremovalofpower(Vcc)fromtheFunction.Whenpoweris
restored, the device must be reset or generate an internal reset, taking it
from D3Cold to D0 Uninitialized. A Function capable of generating a PME
must maintain PME context while in this state and when transitioning to
theD0state.Sincepowerwasremovedtoarriveatthisstate,theFunction
musthaveanauxiliarypowersourceavailableifitistomaintainthePME
context.Then,whenthedevicegoestoD0Uninitialized,itcangenerateaPME
messagetoinformthesystemofawakeupevent,ifitscapableandenabled
to do so. For more on auxiliary power, refer to Auxiliary Power on
page 775.

Table 169onpage 722illustratesthePMpolicieswhileintheD3Coldstate.

721
PCIe 3.0.book Page 722 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table169:D3coldPowerManagementPolicies

Registers
Bus Function Actions
and/orState Actionspermitted
PM PM Power permittedto
thatmustbe byFunction
State State Function
valid

PME AUX SignalBeacon


L2
context* Power orWAKE#**
Busresetonly
D3cold
L3 None None

*IfPMEsupportedinthisstate.
**Themethodusedtosignalawaketorestoreclockandpowerdependsontheformfactor.

Function PM State Transitions


Figure166illustratesthePMstatetransitionsforaPCIeFunction.Table 1610
onpage 723providesadescriptionofeachtransition.Table 1611onpage 724
illustratesthetransitionsfromonestatetoanotherfrombothahardwareanda
softwareperspective.

Figure166:PCIeFunctionDStateTransitions

Power On
Reset D0
Un-initialized

D0
Active

D3
D1 D2
Hot

D3
Vcc Cold
Removed

722
PCIe 3.0.book Page 723 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Table1610:DescriptionofFunctionStateTransitions

FromState ToState Description

D0 D0Active Functionhasbeencompletelyconfiguredand
Uninitialized enabledbyitsdriver.

D1 SoftwarewritesthePMCSRPowerStatetoD1.

D0Active D2 SoftwarewritesthePMCSRPowerStatetoD2.

D3hot SoftwarewritesthePMCSRPowerStatetoD3hot.

D0Active SoftwarewritesthePMCSRPowerStatetoD0.

D1 D2 SoftwarewritesthePMCSRPowerStatetoD2.

D3hot SoftwarewritesthePMCSRPowerStatetoD3hot.

D0Active SoftwarewritesthePMCSRPowerStatetoD0.
D2
D3hot SoftwarewritesthePMCSRPowerStatetoD3hot.

D3cold PowerisremovedfromtheFunction.
D3hot
D0 SoftwarewritesthePMCSRPowerStatetoD0.
Uninitialized

D3cold D0 PowerisrestoredtotheFunction.
Uninitialized

723
PCIe 3.0.book Page 724 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table1611:FunctionStateTransitionDelays

Next
InitialState Minimumsoftwareguaranteeddelays
State

D0 D1 0

D0orD1 D2 200sfromnewstatesettingtofirstaccess(including
configaccesses).

D0,D1,orD2 D3hot 10msfromnewstatesettingtofirstaccess.

D1 D0 0

D2 D0 200sfromnewstatesettingtofirstaccess.

D3hot D0 10msfromnewstatesettingtofirstaccess.

D3cold D0

Detailed Description of PCI-PM Registers


ThePCIBusPMInterfacespecdefinesthePMregisters(seeFigure167)thatare
implementedinPCIeFunctions.ConfigurationsoftwarecandeterminethePM
capabilitiesandcontrolitsproperties.

Figure167:PCIFunctionsPMRegisters

31 16 15 8 7 0
Power Management Capabilities Pointer to Capability ID
(PMC) Next Capability 01h 1st Dword
Bridge Support
Control/Status Register
Data Register Extensions
(PMCSR_BSE) (PMCSR)
2nd Dword

PM Capabilities (PMC) Register


Thefieldsofthis16bitreadonlyregisteraredescribedinTable1612.

724
PCIe 3.0.book Page 725 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Table1612:ThePMCRegisterBitAssignments

Bit(s) Description

31:27 PME_Supportfield.IndicatesinwhichPMstatestheFunctioniscapable
ofsendingaPMEmessage.AzeroinabitindicatesPMEnotificationis
notsupportedintherespectivePMstate.
BitCorrespondstoPMState
27D0
28D1
29D2
30D3hot
31D3cold(FunctionrequiresauxpowerforPMElogic
andWakesignalingviabeaconorWAKE#pin)
SystemsthatsupportwakefromD3coldmustalsosupportauxpowerand
mustuseittosignalthewakeup.
Bits31,30,and27mustbesetto1bforvirtualPCIPCIBridgesimple
mentedwithinRootandSwitchPorts.Thisisrequiredforportsthatfor
wardPMEMessages.

26 D2_Supportbit.1=FunctionsupportstheD2PMstate.

25 D1_Supportbit.1=FunctionsupportstheD1PMstate.

725
PCIe 3.0.book Page 726 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table1612:ThePMCRegisterBitAssignments(Continued)

Bit(s) Description

24:22 Aux_Currentfield.ForaFunctionthatsupportsgenerationofthePME
messagefromtheD3coldstate,thisfieldreportsthecurrentdemandmade
uponthe3.3Vauxpowersource(seeAuxiliaryPoweronpage 775)by
theFunctionslogicthatretainsthePMEcontextinformation.Thisinfor
mationisusedbysoftwaretodeterminehowmanyFunctionscansimul
taneouslybeenabledforPMEgeneration(basedonthetotalamountof
currenteachdrawsfromthesystem3.3Vauxpowersourceandthepower
sourcingcapabilityofthepowersource).

If the Function does not support PME notification from within the
D3coldPMstate,thisfieldisnotimplementedandalwaysreturnszero
whenread.Alternatively,anewfeaturedefinedbyPCIExpressper
mitsdevicesthatdonotsupportPMEstoreporttheamountofAux
current they draw when enabled by the Aux Power PM Enable bit
withintheDeviceControlregister.
IftheFunctionimplementstheDataregister(seeDataRegisteron
page 731),thisfieldalwaysreturnszeroswhenread.TheDataregister
thentakesprecedenceoverthisfieldinreportingthe3.3Vauxcurrent
requirementsfortheFunction.
If the Function supports PME notification from the D3cold state and
does not implement the Data register, then the Aux_Current field
reports the 3.3Vaux current requirements for the Function. It is
encodedasfollows:

Bit
242322MaxCurrentRequired
111375mA
110320mA
101270mA
100220mA
011160mA
010100mA
00155mA
0000mA

726
PCIe 3.0.book Page 727 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Table1612:ThePMCRegisterBitAssignments(Continued)

Bit(s) Description

21 DeviceSpecificInitialization(DSI)bit.Aoneinthisbitindicatesthat
immediatelyafterentryintotheD0Uninitializedstate,theFunction
requiresadditionalconfigurationaboveandbeyondsetupofitsPCIcon
figurationHeaderregistersbeforetheClassdrivercanusetheFunction.
MicrosoftOSsdonotusethisbit.Rather,thedeterminationandinitializa
tionismadebytheClassdriver.

20 Reserved.

19 PMEClockbit.DoesnotapplytoPCIExpress.Mustbehardwiredto0.

18:16 Versionfield.ThisfieldindicatestheversionofthePCIBusPMInterface
specthattheFunctioncomplieswith.

Bit
181716ComplieswithSpecVersion
0011.0
0101.1(requiredbyPCIExpress)

PM Control and Status Register (PMCSR)


Thisregister, required forall PCIExpressDevices, servesseveralpurposes as
describedbelow.Table 1613onpage 728providesadescriptionofthePMCSR
bitfields.
IftheFunctionimplementsPMEcapability,aPMEEnablebitpermitssoft
waretoenableordisabletheFunctionsabilitytoassertthePMEmessageor
WAKE#signal,andaStatusbitreflectswhetherornotaPMEhasoccurred.
If the optional Data register is implemented (see Data Register on
page 731),twofieldsareusedtopermitsoftwaretoselectwhichinforma
tioncanbereadthroughtheDataregister,andprovidethescalingmulti
plierfortheDataregistervalue.
The registers PowerState field can be read to determine the current PM
stateoftheFunctionandwrittentoplacetheFunctionintoanewPMstate.

727
PCIe 3.0.book Page 728 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table1613:PMControl/StatusRegister(PMCSR)BitAssignments

Value
Read/
Bit(s) at Description
Write
Reset

all Read
31:24 SeeDataRegisteronpage 731.
zeros Only

zero Read
23 NotusedinPCIExpress
Only

zero Read
22 NotusedinPCIExpress
Only

all Read
21:16 Reserved
zeros Only

PME_Statusbit.Optional:onlyimplementedifthe
FunctionsupportsPMEnotification,otherwisezero.
ThisbitreflectswhethertheFunctionhasexperienced
aPME(evenifthePME_Enbitinthisregisterhasdis
abledtheFunctionsabilitytosendaPMEmessage).If
Read, settoone,theFunctionhasexperiencedaPME.Soft
Write wareclearsthisbitbywritingaonetoit.
See oneto Afterreset,thisbitiszeroiftheFunctiondoesntsup
15 Descrip clear, portPMEinD3cold.IftheFunctiondoessupportPME
tion. Sticky inD3cold,thisbitisindeterminateatinitialOSboot
timebutafterthatreflectswhethertheFunctionhas
RW1CS
experiencedaPME.
IftheFunctionsupportsPMEfromD3cold,thestateof
thisbitmustpersistevenifpowerislostortheFunc
tionisreset(astickybit).Thisimpliesthatanauxil
iarypowersourcekeepsthislogicactiveduringthese
conditions(seeAuxiliaryPoweronpage 775).

728
PCIe 3.0.book Page 729 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Table1613:PMControl/StatusRegister(PMCSR)BitAssignments(Continued)

Value
Read/
Bit(s) at Description
Write
Reset

Data_Scalefield.Optional.IftheFunctiondoesnot
implementtheDataregisterthisfieldishardwiredto
returnzeros.
IftheDataregisterisimplemented,theData_Scale
Device Read
14:13 fieldismandatoryandmustbeareadonlyvaluerep
specific Only
resentingthemultiplierforit.Thevalueandinterpre
tationoftheData_Scalefielddependsonthedata
itemselectedtobeviewedthroughtheDataregister
bytheData_Selectfield.

Data_Selectfield.Optional.IftheFunctiondoesnot
implementtheDataregister,thisfieldishardwiredto
returnzeros.
Read/ IftheDataregisterisimplemented,Data_Selectisa
12:9 0000b
Write mandatoryread/writefield.Thevalueplacedinthis
registerselectsthedatatobeviewedintheDataregis
ter.Thatvaluemustthenbemultipliedbythevalue
readfromtheData_Scalefield.

729
PCIe 3.0.book Page 730 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table1613:PMControl/StatusRegister(PMCSR)BitAssignments(Continued)

Value
Read/
Bit(s) at Description
Write
Reset

PME_Enbit.Optional.
1=enableFunctionsabilitytosendPMEmessages
whenaneventoccurs.
0=disable.
IftheFunctiondoesnotsupportthegenerationof
PMEsfromanypowerstate,thisbitalwaysreturn
zerowhenread.
Afterreset,thisbitiszeroiftheFunctiondoesntsup
See Read/ portPMEfromD3cold.IftheFunctionsupportsPME
8 Descrip Write fromD3cold:
tion. thisbitisindeterminateatinitialOSboottime.
otherwise,itenablesordisableswhethertheFunc
tioncansendaPMEmessageincaseaPMEoccurs.
IftheFunctionsupportsPMEfromD3cold,thestateof
thisbitmustpersistwhiletheFunctionremainsinthe
D3coldstateandduringthetransitionfromD3coldto
theD0Uninitializedstate.ThisimpliesthatthePME
logicmustuseanauxpowersourcetopowerthis
logicduringtheseconditions.

all Read
7:2 Reserved
zeros Only

PowerStatefield.Mandatory.Softwareusesthisfield
toreadthecurrentPMstateoftheFunctionorwritea
newPMstate.IfsoftwareselectsaPMstatenotsup
portedbytheFunction,thewritecompletesnormally
butthedataisdiscardedandnostatechangeoccurs.
Read/
1:0 00b
Write
10PMState
00D0
01D1
10D2
11D3hot

730
PCIe 3.0.book Page 731 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Data Register
Optional,readonly.RefertoFigure168onpage732.TheDataregisterisan
8bit,readonlyregisterthatprovidessoftwarewiththefollowinginformation:

PowerconsumedintheselectedPMstate;usefulinpowerbudgeting.
PowerdissipatedintheselectedPMstate;usefulinmanagingthethermal
environment.
Anytypeofdatacouldbereportedthroughthisregister,butthePCIPM
spec only defines power consumption and power dissipation information
forit.

IftheDataregisterisimplemented,theData_SelectandData_Scalefieldsofthe
PMCSRregistersmustalsobeimplemented,andtheAux_Currentfieldofthe
PMCregistermustnotbeimplemented.

DeterminingPresenceoftheDataRegister.Softwarecanperformthe
followingproceduretocheckforthepresenceoftheDataregister:

1. Writeavalueof0000bintotheData_SelectfieldofthePMCSRregister.
2. ReadfromeithertheDataregisterortheData_ScalefieldofthePMCSR
register.AnonzerovalueindicatesthattheDataregisteraswellasthe
Data_Scale and Data_Select fields of the PMCSR registers are imple
mented.Ifavalueofzeroisread,gotostep4.
3. IfthecurrentvalueoftheData_Selectfieldisavalueotherthan1111b,
gotostep4.IfthecurrentvalueoftheData_Selectfieldis1111b,allpos
sibleDataregistervalueshavebeenscannedandreturnedzero,indicat
ing that neither the Data register nor the Data_Scale and Data_Select
fieldsofthePMCSRregistersareimplemented.
4. Increment the content of the Data_Select field and go back to step 2.
Sincethedataselectfieldisonly4bits,acompletescanrequirestesting
16possibleselectvaluesandlookingtoseeifanynonzerovaluesare
seenforthedataandscaleregisters.

OperationoftheDataRegister.Theinformationreturnedistypicallya
staticcopyoftheFunctionsworstcasepowerconsumptionandpowerdis
sipation characteristics in the various PM states (as listed in the Devices
data sheet). To use the Data register, the programmer uses the following
sequence:

1. WriteavalueintotheData_Selectfield(seeTable 1614onpage 733)of


the PMCSR register to select the data item to be viewed through the
Dataregister.

731
PCIe 3.0.book Page 732 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

2. ReadthedatavaluefromDataregisterandtheData_Scalefieldofthe
PMCSRregister.
3. Multiplythevaluebythescalingfactor.

MultiFunctionDevices.In a multifunction PCI Express device, each


Functionmustsupplyitsownpowerinformation.Thepowerinformation
for the logic common to all the Functions is reported through Function
zerosDataregister(seeDataSelectValue=8inTable 1614onpage 733).

VirtualPCItoPCIBridgePowerData. The spec doesnt specify data


fielduseinPCItoPCIbridgeFunctionsinaRootComplexorSwitch.But,
tomaintainPCIPMcompatibility,bridgesmustreportthepowerinforma
tion they consume. Software could read the virtual PPB Data registers at
each port of a switch to determine the power consumed by the switch in
eachpowerstate.

Figure168:PMRegisters

31 16 15 8 7 0
Power Management Capabilities Pointer to Capability ID
(PMC) Next Capability 01h 1st Dword
Bridge Support
Control/Status Register
Data Register Extensions
(PMCSR_BSE) (PMCSR)
2nd Dword

732
PCIe 3.0.book Page 733 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Table1614:DataRegisterInterpretation

DataReportedin InterpretationofData Units/


DataSelectValue
DataRegister ScaleFieldinPMCSR Accuracy

00h PowerconsumedinD0

01h PowerconsumedinD1

02h PowerconsumedinD2

03h PowerconsumedinD3

04h PowerdissipatedinD0
00b=unknown
05h PowerdissipatedinD1 01b=multiplyby0.1
06h PowerdissipatedinD2 10b=multiplyby0.01 Watts
11b=multiplyby0.001
07h PowerdissipatedinD3

InamultifunctionPCI
device,Function0indi
catespowerconsumed
08h
bylogiccommontoall
Functionsinthepack
age.

09h0Fh Reservedforfutureuse
ofFunction0ina
multifunctiondevice.

08h0Fh Reservedinsinglefunc Reserved TBD


tiondevicesandFunc
tionsotherthan
Function0ina
multifunctiondevice

Introduction to Link Power Management


WevejustseenhowsoftwarecanputDevicesintooneofseveraldevicepower
states, now lets consider how PCIe also manages Link power. Device power
andLinkpowerarerelatedtoeachother,asshowninTable 1615onpage 734.
Notealsotherelationshipbetweendownstreamandupstreamdevices,which
canbesummarizedbysayingthatanupstreamDeviceorLinkcannotbeina
moreaggressivepowerconservingstatethantheonebelowit.Thereasonisto

733
PCIe 3.0.book Page 734 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

facilitatetimelydeliveryofpacketsfromtheEndpoints,whosetrafficwouldbe
delayedifupstreamdeviceswereinalowerpowerstate.Eachrelationship is
describedbelow:

D0DeviceisfullypoweredandtypicallyintheL0Linkstate.Somepower
conservationisavailablewithoutleavingthisstatebyusingDPAsubstates(see
Dynamic Power Allocation (DPA) on page 714), and by using the hard
warebased Link power management (see Active State Power Management
(ASPM)onpage 735formoredetails).

D1&D2WhensoftwarechangesthedevicestatetoD1orD2,theLinkmust
automaticallytransitiontotheL1state.SincebothLinkpartnersareinvolvedin
thisoperationthereisahandshakemechanismtoensurethatthingsaredonein
anorderlyfashion.

D3hot When software places a device into the D3 state, the Link automati
callytransitionstoL1justasitdoeswhengoingtotheD1andD2states.Soft
ware may now choose to remove the reference clock and power, putting the
deviceintoD3cold.But,beforedoingthat,itsexpectedthatthesystemwillini
tiateahandshakeprocesstopreparetheLinksbyputtingthemintotheL2/L3
Readystate.

D3coldInthisstate,mainpowerandthereferenceclockhavebeenturnedoff.
However,auxiliarypower(VAUX)maybeavailable,allowingthedevicetosig
nalawakeupeventtothesystem.Ifitis,theLinkstatewillbeinL2.Ifmain
powerisremovedbutVAUXisnotavailable,theLinkwillbeinL3.Table 1616
onpage 735providesadditionalinformationregardingtheLinkpowerstates.

Table1615:RelationshipBetweenDeviceandLinkPowerStates

Downstream PermissibleUpstream Permissible


ComponentDState ComponentDState InterconnectState

D0 D0 L0,L0s&L1(optional)

D1 D0D1 L1

D2 D0D2 L1

D3hot D0D3hot L1,L2/L3Ready

D3cold D0D3cold L2(AUXPwr),L3

734
PCIe 3.0.book Page 735 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Table1616:LinkPowerStateCharacteristics

Active
Software Ref. Main
State Description State PLL Vaux
Directed? Clocks Power
LinkPM

L0 FullyActive Yes(D0) On On On On On/Off

L0s Standby No Yes On On On On/Off


(D0)

L1 LowPower Yes* Yes(option) On On On/Off On/Off


Standby (D1D3hot) (D0)

L2/L3 Stagingfor Yes No On On On/Off On/Off


Ready power PME_Turn_Off
handshake
removal

L2 LowPower Yes** No Off Off Off On


Sleep

L3 Off N/A N/A Off Off Off Off


(ZeroPower)

* TheL1stateisenteredeitherduetoPMsoftwareplacingadeviceintothe
D1,D2,orD3statesorunderhardwarecontrolwithASPM.
** The spec describes the L2 state as being software directed. The other
Lstatesinthetablearelistedassoftwaredirectedbecausesoftwareinitiates
the transition into these states. For example, when software initiating a
devicepowerstatechangetoD1,D2,orD3devicesmustrespondbyenter
ingtheL1state.SoftwarethencausesthetransitiontotheL2/L3Readystate
by initiating a PME_Turn_Off message. Finally, software initiates the
removalofpowerfromadeviceafterthedevicehastransitionedtotheL2/
L3Readystate.BecauseVauxpowerisavailableinL2,awakeupeventcan
besignaledtonotifysoftware.

Active State Power Management (ASPM)


ASPM is a hardwarebased Link power conservation mechanism that only
applieswhilethedeviceisintheD0devicepowerstate.Transitionsintoandout
ofASPMstatesareinitiatedbyhardwarebasedonimplementationspecificcri
teria;softwarecantcontrolorobservethisoperation,itcanonlyenableordis
ableitusingconfigurationregisterbits(seeFigure1615onpage744).

735
PCIe 3.0.book Page 736 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

TwolowpowerstatesaredefinedforASPM:

1. L0s(standbystate)Thisstateprovidesubstantialpowersavingsbutstill
allowsquickentryandexitlatencies.Themainwaythisisdoneisbyput
tingtheTransmitterintotheElectricalIdlecondition.Supportforthisstate
waspreviouslyrequiredforallPCIedevicesintheearlierspecversions,but
inthe3.0specitbecameoptional.
2. L1ASPMThegoalforL1istoachievegreaterpowerconservationthan
L0sforsituationswherelongerentryandexitlatenciesareacceptable.For
example,inthisstatebothTransmittersgointoElectricalIdleatthesame
time.Supportforthisstatecontinuestobeoptionalinthe3.0specasitwas
intheearlierspecs.

Electrical Idle
SinceputtingaTransmitterintoElectricalIdleisacentralpartofASPM,itwill
helptodiscusshowdoingsoworks.WhenaTransmittersdifferentialsignals
(TxD+andTxD)goesintotheElectricalIdlecondition,itstopssignalingand
insteadholdsitsvoltageveryclosetothecommonmodevoltagewithadiffer
entialvoltageof0V.Signaltransitionsconsumepower,sostoppingthemonthe
Linkgivespowersavingswhilestillallowingafairlyquickresumptionbackto
normalLinkactivityduringwhichitissaidtobeintheL0state.Dependingon
thedegreeofpowersavings,theLinkiseitherintheL0sorL1state.Duringthis
time, the transmitter may choose to remain in the lowimpedance state or
change to high impedance by turning off its termination logic to save more
power.InadditiontoL0sandL1,ElectricalIdlewillalsobeineffectwhenthe
Linkhasbeendisabled.

Transmitter Entry to Electrical Idle


TransmittersthatwishtoentertheElectricalIdleconditionmustfirstinformthe
Linkpartnersothelackoffurthersignalingwontbemisinterpretedasanerror.
TheydothatbysendingtheEIOS(ElectricalIdleOrderedSet)andthenquickly
ceasing transmission and tristating the Link output drivers. What the EIOS
lookslikedependsontheencodingmethodinuse,asdescribedinthefollowing
sections.OncethelastEIOShasbeensent,theTransmittermustenterElectrical
Idlewithin8nsandremaininthatmodeforatleast20ns,regardlessofthedata
rate. The differential peak voltage allowed during Electrical Idle must be
between 0 and 20mV peak, again regardless of the data rate, to reduce the
chanceoftheReceivermisinterpretingnoiseonthelineasavalidsignal.(See
Table 133onpage 489formoreonthesetimingandvoltageparameters.)

736
PCIe 3.0.book Page 737 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Gen1/Gen2ModeEncoding.For Gen1/Gen2 mode, the EIOS takes the


formshowninFigure169onpage737.AllfourSymbolsmustbesent,but
theReceiveronlyneedstoseetwoIDLcontrolcharacterstorecognizethis
condition.
Figure169:Gen1/Gen2ModeEIOSPattern

Encoding
COM K28.5
IDL K28.3
IDL K28.3
IDL K28.3

Gen3ModeEncoding.ForGen3mode,theEIOSisanOrderedSetblock
thatconsistsofanOrderedSetSyncHeader(01b)followedby16bytesthat
areall66h,asshowninFigure1610onpage737.Curiously,aTransmitteris
notrequiredtofinishtheblockifitwillgodirectlytoElectricalIdlebutis
allowedtostopafterSymbol13(anywhereinSymbol14or15).Thereason
istoallowforthecasewhereaninternalclockdoesntlineupwiththeSym
bol boundaries due to 128b/130b encoding. This truncation wont cause a
problem at the Receiver because it only needs to see Symbols 0 3 of the
EIOStorecognizeit.

Figure1610:Gen3ModeEIOSPattern

EIOS
Sync Header 01
Byte 0 01100110
1 01100110
2 01100110
3 01100110
4 01100110

13 01100110
14 01100110
15 01100110

737
PCIe 3.0.book Page 738 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Transmitter Exit from Electrical Idle


WhenaTransmitterisinstructedtoexitfromElectricalIdle,thestepsittakes
dependonthedatarateinuse(seebelow).However,itmustresumetransmis
sionwithinlessthan8nsbysendingFTSsorTS1/TS2scausingtransitionbackto
theL0fullonstate.

Gen1Mode.For2.5GT/s,theprocessissimple:itbeginsusingvaliddif
ferential signals to send the TS1s or FTSs that will serve to inform the
Receiveraboutthechange.TheReceiverdetectsthevoltageasbeingabove
thesquelchthresholdandbeginstoevaluatetheincomingsignal.

Gen2Mode.Whenusing5.0GT/s,thesignalsarechangingsoquicklythat
theydonthavetimetoreachthehighervoltagelevels.Thatmakesitmore
difficulttoquicklydetectwhenthevoltageshavechangedbacktotheoper
ationalvalues.Tomakethiseasier,theEIEOS(ElectricalIdleExitOrdered
Set),wasdefinedtoprovidealowerfrequencysequence.TheEIEOSfor8b/
10bencoding,showninFigure1611onpage739,usesrepeatedK28.7con
trolcharacterstoappearasarepeatingstringof5onesfollowedby5zeros.
Thisgivesthelowfrequencysignalthatallowsthehighersignalvoltages
thataremorereadilyseen.Infact,thespecstatesthatthispatternguaran
teesthattheReceiverwillproperlydetectanexitfromElectricalIdle,some
thing that scrambled data cannot do. The EIEOS is to be sent under the
followingconditions:

BeforethefirstTS1afterenteringtheConfiguration.Linkwidth.Startor
Recovery.RcvrLockstate.
Afterevery32TS1sorTS2saresentinConfiguration.Linkwidth.Start,
Recovery.RcvrLock, or Recovery.RcvrCfg states. The TS1/TS2 count is
resettozerowheneveranEIEOSissentorthefirstTS2isreceivedinthe
Recovery.RcvrCfgstate.

738
PCIe 3.0.book Page 739 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Figure1611:Gen1/Gen2ModeEIEOSSymbolPattern

EIEOS
Symbol 0 K28.5
1 K28.7
2 K28.7
3 K28.7
4 K28.7

13 K28.7
14 K28.7
15 D10.2

Gen3Mode.AnEIEOSisneededfor8GT/sratetooandforthesamerea
sonasfor5.0GT/s.Now,though,theOrderedSettakestheformofablock,
asshowninFigure1612onpage740.Asbefore,itgivesalowfrequency
patterninalternatingbytesof00handFFh,whichappearsasarepeating
stringof8zerosfollowedby8ones.

Inaddition,EIEOSissentsoastoallowareceiverduringLTSSMRecovery
statetoestablishBlockLockafterwhichtheLinktransitionstotheL0state.
SeethesectionBlockAlignmentonpage 411andAchievingBlockAlign
mentonpage 438.

InGen3mode,EIEOSistobesent:

BeforethefirstTS1afterenteringtheConfiguration.Linkwidth.Startor
Recovery.RcvrLockstate.
ImmediatelyafteranEDSFramingTokenwhenaDataStreamisend
ingifanEIOSisnotbeingsentandtheLTSSMisnotenteringRecov
ery.RcvrLock.
Afterevery32TS1s/TS2swheneverTS1sorTS2saresent.Thecountis
resettozerowhen:
anEIEOSissent
thefirstTS2isreceivedwhileineithertheRecovery.RcvrCfgorConfig
uration.CompleteLTSSMstate
a Downstream Port in Phase 2 of the Equalization sequence, or an
UpstreamPortinPhase3,receivestwoTS1swiththeResetEIEOSInter
valCountbitset.

739
PCIe 3.0.book Page 740 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Afterevery216TS1sduringtheEqualizationsequence,iftheResetEIEOS
Interval Count bit has prevented it from being sent. The spec states that
designs are allowed to satisfy this requirement by sending and EIEOS
within2TS1softhescramblingLFSRmatchingitsseedvalue.
AspartofanFTSsequence,CompliancePattern,orModifiedCompliance
pattern.

Figure1612:128b/130bEIEOSBlock

EIEOS
Sync Header 01
Byte 0 00000000
1 11111111
2 00000000
3 11111111
4 00000000

13 11111111
14 00000000
15 11111111

Receiver Entry to Electrical Idle


WhenaTransmitterentersElectricalIdle,theLinkpartnersReceiverresponds
basedonthedatarate,asdescribedinthefollowingsections.ReceiptofanEIOS
informstheReceiverthatthisisgoingtohappen,preparingittodetectwhenit
actuallydoeshappen.WhentheReceiverdetectsthisconditionitdegatesthe
errorlogictopreventreportingerrorscausedbyunreliableactivityontheLink
andarmsitsElectricalIdleExit detectorsoitwillbeready toresumenormal
activitywhentheTransmitterbeginstosenddataagain.TherearetwoElectri
calIdledetectionoptions.:

DetectingElectricalIdleVoltage.OnceanEIOShasbeenreceived,the
expectationisthattheTransmitterwillceasetransmissionveryquickly.In
the1.xspecversionsReceiversdetectthisbyobservingthattheincoming
voltagehasdroppedbelowthethresholdofavalidsignal.Thisisnttoodif
ficultat2.5GT/sbutitrequiresasquelchdetectcircuitthatconsumesspace
andpower.

740
PCIe 3.0.book Page 741 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

InferringElectricalIdle.However, at higher frequencies the signal


becomesincreasinglyattenuated,makingitdifficultforsquelchdetectlogic
to distinguish the levels. This is especially true for 8.0 GT/s, where its
expectedthattheReceivermayneedtoperformequalizationinternallyto
recover a good signal. To alleviate these detection problems, the 2.0 spec
introduced the concept of allowing aReceiverto infer when theLink has
gonetotheElectricalIdleconditionratherthantestingthevoltagelevel.In
thismodel,theabsenceofexpectedeventsisusedtoindicatethattheLink
is not signaling and can therefore be assumed to be in Electrical Idle, as
listedinTable1617.Bywayofexplanation,FlowControlUpdatesshould
arriveregularlywhiletheLinkisinL0,andSOSsareexpectedwithcertain
timing, too. For simplicity, a Receiver is allowed to check for one or the
otherorbothoftheseconditions.DuringLinktrainingtheTS1sandTS2s
shouldarriveregularly,sotheirabsencecanalsobetakentomeanthatthe
LinkisIdle.Forthelasttworowsofthetable,though,itspossiblethatno
Symbolshavebeenreceivedatall,andthatwillalsobeunderstoodtomean
theLinkisIdle.SinceElectricalIdletakesplacefortheoverallLinkandnot
for Lanes independently, theres no need for each Lane to measure these
times.Instead,anLTSSMcanjustuseonetimerincommonforalltheLanes
onthatLink.

Table1617:ElectricalIdleInferenceConditions

State 2.5GT/s 5.0 GT/s 8.0 GT/s


L0 Absence of an FC Update or SOS in a 128s window

Recovery.RcvrCfg Absence of a TS1 or TS2 in a 1280 UI Absence of a TS1


interval or TS2 in a 4ms
window

Recovery.Speed Absence of a TS1 or TS2 in a 1280 UI Absence of a TS1


(successful_speed_ interval or TS2 in a 4680
negotiation = 1b) UI interval

Recovery.Speed Absence of an exit Absence of an exit from Electrical


(successful_speed_ from Electrical Idle Idle in a 16000 UI interval
negotiation = 0b) in a 2000 UI interval

Loopback.Active Absence of an exit N/A N/A


(as a slave) from Electrical Idle
in a 128s window

741
PCIe 3.0.book Page 742 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

How the EIOS is recognized at the Receiver also depends on the encoding
scheme.ForGen1/Gen2mode,areceiverrecognizesanEIOSwhenitseestwo
ofthethreeIDLSymbols.ForGen3mode,itsrecognizedwhenSymbols03of
theincomingblockmatchtheEIOSpattern.

Receiver Exit from Electrical Idle


Receiversdetectavoltagedifferencetosignifyaresumptionofnormalsignal
ing. An exit from Electrical Idle will be detected when the differential
peaktopeak voltage exceeds the Electrical Idle Detect threshold, which is
allowedtobesetbetween65and175mVforalldatarates.
At2.5GT/snothingmoreisneeded,butathigherratesReceiversdonthaveto
rely on this detection circuit except when receiving EIEOS during certain
LTSSMstatesorduringthefourEIESymbolsthatprecedetransmissionofan
FTSsequenceat5.0GT/s.ThenumberandtimingofEIEOSstofacilitatedetec
tion of Electrical Idle exit depends on the Link state. For more on this, see
ActiveStatePowerManagement(ASPM)onpage 735.
In Electrical Idle, the Receivers PLL looses clock synchronization. When the
TransmitterexitsElectricalIdle,itsendsFTSstoexitfromL0s,orTS1/TS2sto
exitfromallotherLinkstates.Doingsosuppliestheneededtransitiondensity
fortheCDRlogictoresynchronizethereceiverPLLandachieveBitLockand
SymbolLockorBlockAlignment.
Figure1613illustratestheLinkstatetransitionsandhighlightsthetransitions
betweenL0,L0s,andL1.NotethatthereisnodirectpathfromL0stoL1,sothe
LinkmustbereturnedtotheL0statebeforechangingbetweenthem.

Figure1613:ASPMLinkStateTransitions

L0

Recovery L2/L3
L0s L1 Ready LDn

L2 L3

742
PCIe 3.0.book Page 743 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

TheLinkCapabilityregisterspecifiesadevicessupportforActiveStatePower
Management.Figure1614illustratestheASPMSupportfieldwithinthisregis
ter. In earlier spec versions, not all 4 options were available, but the 2.1 spec
filledinallofthem.Notethatbit22indicateswhetheralltheoptionsareavail
able.

Figure1614:ASPMSupport

Link Capabilities Register


31 24 23 22 21 20 19 18 17 15 14 12 11 10 9 4 3 0

Port Number

ASPM Optionality
Compliance
0 0 No ASPM Support
0 1 L0s Supported
1 0 L1 Supported
1 1 L0s & L1 supported
Active State PM Support

SoftwarecanenableanddisableASPMviatheActiveStatePMControlfieldof
theLinkControlRegisterasillustratedinFigure1615onpage744.Thepossi
blesettingsarelistedinTable 1618onpage 743.Note:Thespecrecommends
thatASPMbedisabledforallcomponentsinapathusedforIsochronoustrans
actionsiftheadditionallatenciesassociatedwithASPMexceedthelimitsofthe
isochronoustransactions.

Table1618:ActiveStatePowerManagementControlFieldDefinition

Setting Description

00b L0sandL1ASPMdisabled

01b L0senabledandL1disabled

743
PCIe 3.0.book Page 744 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table1618:ActiveStatePowerManagementControlFieldDefinition(Continued)

Setting Description

10b L1enabledandL0sdisabled

11b BothL0sandL1enabled

Figure1615:ActiveStatePMControlField

15 12 11 10 9 8 7 6 5 4 3 2 1 0

RsvdP

Link Autonomous Bandwidth


Interrupt Enable

Link Bandwidth Management


Interrupt Enable
Hardware Autonomous
Width Disable

Enable Clock
Power Management
Extended Synch
Common Clock
Configuration
Retrain Link

Link Disable
Read Completion
Boundary Control

RsvdP
Active State
PM Control

L0s State
L0sisaLinkpowerstatethatcanonlybeenteredunderhardwarecontrolandis
appliedtoasingledirectionoftheLink.Forexample,alargevolumeoftraffic
inconventionalPCbasedsystemsresultsfromFunctionssendingdatatomain
system memory. As a result, the upstream lanes carry heavy traffic while the
downstreamlanesmaycarryverylittle.Thesedownstreamlanescanenterthe
L0sstatetoconservepowerduringstretchesofidlebustime.

744
PCIe 3.0.book Page 745 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Entry into L0s


ATransmitterinitiatesachangefromL0toL0safterdetectingaperiodofidle
timethatisimplementationspecific.

EntryintoL0s.EntryismanagedforasingledirectionoftheLinkbased
ondetectingaperiodofLinkidletime.PortsarerequiredtoenterL0safter
detectingidletimeofnogreaterthan7s.

IdleisdefineddifferentlyforEndpointsandSwitches.Thereasonforthisis
a desire to minimize recovery time as Link recovery time propagates
through Switches. For example, if a Switch upstream port was in a low
power state and now sees activity, it means that a TLP is probably on its
waydowntotheSwitch.Wherewillthepacketneedtoberouted?Itwillgo
tooneofthedownstreamports,butratherthanwaittoreceivethepacket
anddeterminewhichportwillbethetargetbeforestartingtowakeitup,
thelowestlatencyapproachwouldbetowakeallthedownstreamportsso
thattheonethatturnsouttobethetargetwillbereadyasquicklyaspossi
ble.

Basicrulesregardingidletime:
EndpointPortorRootPort:
NoTLPsarependingtransmissionoralackofFlowControlcredits
istemporarilyblockingthem.
NoDLLPsarependingtransmission.
UpstreamSwitchPort:
ThereceivelaneofalldownstreamportsarealreadyinL0s.
NoTLPsarependingtransmissionoralackofFlowControlcredits
istemporarilyblockingthem.
NoDLLPsarependingtransmission.
DownstreamSwitchPort:
TheSwitchsUpstreamPortsReceiveLanesareinL0s.
NoTLPsarependingtransmissionoralackofFlowControlcredits
istemporarilyblockingthem.
NoDLLPsarependingfortransmission

TheTransactionandDataLinkLayersareunawareofwhetherthePhysical
LayertransmitterhasenteredL0s,buttheidleconditionsthattriggeratran
sitiontoL0smustbecontinuouslyreportedfromtheTransactionandLink
layerstothePhysicalLayersoitcanmaketimelychoicesaboutthis.Note
thataportmustalwaystolerateL0sonitsreceiver,evenifsoftwarehasdis
abled ASPM. This allows a device at the other end of the Link that is
enabledforASPMtostilltransitiononesideoftheLinktotheL0sstate.

745
PCIe 3.0.book Page 746 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

FlowControlCreditsMustbeDelivered.One situation that qualifies


asidletimeisapendingTLPthatisblockedduetoinsufficientFCcredits.
Whenflowcontrolcreditsarereceivedthatallowdeliveryofthepending
TLP,thetransmittingportmustinitiateareturntoL0.Also,ifthereceive
bufferassociatedwiththetransmitterinL0smakesadditionalflowcontrol
credits available, the transmitter must return to L0 and deliver the
FC_UpdateDLLPtotheneighbor.
TransmitterInitiatesEntrytoL0s.When sufficient idle time has been
observedbyaTransmitter,itforcesatransitionfromL0toL0sbysending
an electrical idle ordered set (EIOS) to the receiver and stopping trans
mission.Thetransmitterandreceiverarenowintheirelectricalidlestates
andhavereducedpowerconsumption.Synchronizationbetweenthetrans
mitterandreceiverhasbeenlostandretrainingwillberequiredforrecov
ery.ThespecrequiresthatthePLLlogicinthereceivermustremainactive
(powered)toallowquickrecoveryfromL0sbacktoL0.

Exit from L0s State


Ifthetransmitterdetectsthattheidleconditionisnolongertrue,itmustinitiate
theexitfromL0stoL0.Thespecencouragesdesignerstomonitoreventsthat
giveanearlyindicationthatanL0sexitisimminentandstarttherecoverypro
cesstospeedupthetransitionbacktoL0.Forexample,iftheReceiverofthe
portreceivesanonpostedRequest,theTransmitterknowsthatitwillsoonbe
askedtosendaCompletioninresponse.Consequently,theTransmittercango
aheadandstarttheexitprocesssotheLinkstateisL0bythetimeitisaskedto
delivertheCompletion.

TransmitterInitiatesL0sExit.ToexitL0s,theTransmittersendsoneor
more Fast Training Sequence (FTS) Ordered Sets. The number of these
requiredbytheLinkpartnersReceiverwascommunicatedearlierduring
Link training (N_FTS field in the TS1s and TS2s used in training). After
sendingtherequestednumberofFTSs,oneSOSisdelivered.Thereceiver
should be able to establish bit lock and symbol lock or Block lock, and
shouldbereadytoresumenormaloperation.

ActionsTakenbySwitchesthatReceiveL0sExit. A switch that


receivesanL0stoL0transitionsequenceononeportmayalsoneedtoini
tiateanL0sexittootherofitsports.Twospecificcasesareconsidered:

Switch Downstream Port Receives L0s to L0 transition. The switch


mustsignalanL0stoL0onitsupstreamportifitiscurrentlyintheL0s
statebecausethepacketcomingupfromtheEndpointordownstream
switchwillmostlikelyneedtogoupstreamtotheRootComplex.

746
PCIe 3.0.book Page 747 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

SwitchUpsteamPortReceivesL0stoL0transition.Theswitchmust
signalanL0stoL0transitiononalldownstreamportscurrentlyinthe
L0s state because it doesnt want to wait until the packet arrives to
beginwakingthetargetpath.
SwitchportsthatwereputintoL1byasoftwarechangetothedevicepower
state remain unaffected by L0s to L0 transitions. However, once the
upstreamLinkhascompletedthetransitiontoL0,asubsequenttransaction
maytargetthisport,causingatransitionfromL1toL0.

L1 ASPM State
TheoptionalL1ASPMstateprovidesdeeperpowersavingsthanL0s,buthasa
greaterrecoverylatency.ThisstateresultsinbothdirectionsoftheLinkgoing
intotheL1stateandresultsinLinkandTransactionlayerdeactivationwithin
eachdevice.
Entryintothisstateisrequestedbyanupstreamport,suchasfromanEndpoint
ortheupstreamportofaswitch(upstreamportsareshadedasshowninFigure
1616).The downstream port responds tothis request and either agrees togo
into L1 or rejects the request through a negotiation process with the down
streamcomponent.ExitingL1ASPMcanbeinitiatedbyeitherthedownstream
orupstreamport.
Figure1616:OnlyUpstreamPortsInitiateL1ASPM

747
PCIe 3.0.book Page 748 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Downstream Component Decides to Enter L1 ASPM


ThespecdoesnotpreciselydefineallconditionsunderwhichanEndpointor
upstreamportofaswitchdecidestoattemptentryintotheL1ASPMstatebut
doessuggestthatonecasemightbewhenbothsidesoftheLinkhavebeenin
L0sforapresetamountoftime.Therequirementsgiveninclude:

ASPML1entryissupportedandenabled
DevicespecificrequirementsforenteringL1havebeensatisfied
NoTLPsarependingtransmission
NoDLLPsarependingtransmission
If the downstream component is a switch, then all of the switchs down
stream ports must be in the L1 or higherpower conservation state before
theupstreamportcaninitiateL1entry.

Negotiation Required to Enter L1 ASPM


BecauseofthelongerlatencyrequiredtorecoverfromL1ASPM,anegotiation
process is employed to ensure that the port at the other end of the Link is
enabledforL1ASPMandispreparedtoenterit.Thenegotiationinvolvessend
ingseveralpackets:

PM_Active_State_Request_L1DLLPissuedbythedownstreamportto
startthenegotiationprocess.
PM_Request_AckDLLPreturnedbytheupstreamportwhenallofits
requirementstoenterL1ASPMhavebeensatisfied.
PM_Active_State_Nak message TLP returned by the upstream port
whenitisunabletoentertheL1ASPMstate.

TheupstreamcomponentmayormaynotacceptthetransitiontotheL1ASPM
state.Thefollowingscenariosdescribeavarietyofcircumstancesthatresultin
bothconditions.

Scenario 1: Both Ports Ready to Enter L1 ASPM State


Figure1617onpage750summarizesthesequenceofeventsthatmustoccurto
enabletransitiontotheL1ASPMstate.Thisscenarioassumesthatalltransac
tionshavecompletedinbothdirectionsandnonewtransactionrequirements
emergeduringthenegotiation.

DownstreamComponentRequestsL1State.If the downstream com


ponentwishestotransitiontotheL1state,itcansendtherequesttoenter
L1afterthefollowingstepshavecompleted:

748
PCIe 3.0.book Page 749 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

1. TLPschedulingisblockedattheTransactionLayer.
2. TheLinkLayerhasreceivedacknowledgementforthelastTLPithad
previouslysentandthereplaybufferisempty.
3. Sufficientflowcontrolcreditsareavailabletoallowtransmissionofthe
largest possible packet for any FC type. This ensures that the compo
nentcanissueaTLPimmediatelyuponexitingtheL1state.

The downstream component then delivers the PM_


Active_State_Request_L1tonotifytheupstreamcomponentoftherequest
toentertheL1state.Thisissentrepeatedlyuntiltheupstreamcomponent
respondseitheraPM_Request_ACKDLLPoraPM_Active_State_NAK
message.

UpstreamComponentResponsetoL1ASPMRequest.Down
stream ports (i.e., ports of an upstream component that face downward)
mustacceptarequesttoenteralowpowerL1stateifallofthefollowing
conditionsaretrue:

ThePortsupportsASPML1entryandisenabledtodoso
NoTLPisscheduledfortransmission
NoAckorNakDLLPisscheduledfortransmission

UpstreamComponentAcknowledgesRequesttoEnterL1.The
upstreamcomponentsendsaPM_Request_ACKtonotifythedownstream
componentofitsagreementtoentertheL1ASPMstateafterit:

1. BlockschedulingofanynewTLPs.
2. ReceiveacknowledgementforthelastTLPpreviouslysent(meaningits
replaybufferisempty).
3. Ensure enough flow control credits are available to send the largest
possiblepacketforanyFCtypesothatitcanissueaTLPimmediately
afterexitingtheL1state.

TheUpstreamcomponentthensendsPM_Request_Ackcontinuouslyuntil
it detects the EIOS on its receive lanes, indicating that the downstream
devicehasenteredElectricalIdle.

DownstreamComponentSeesAcknowledgement.WhentheDown
stream component sees the PM_Request_Ack, it stops sending the
PM_Active_State_Request_L1,disablesDLLPandTLPtransmission,sends
theEIOSandplacesitstransmitlanesintoElectricalIdle.

UpstreamComponentReceivesElectricalIdle. When the Upstream


componentreceivestheEIOS,itstopssendingthePM_Request_AckDLLP,

749
PCIe 3.0.book Page 750 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

disablesDLLPandTLPtransmission,sendsEIOSandplacesitsowntrans
mitlanesintoElectricalIdle.

Figure1617:NegotiationSequenceRequiredtoEnterL1ActiveStatePM

Device Function

PCIe-Core
Hardware/Software
6. Device blocks new TLP
Interface
scheduling
7. ACK received for last TLP
Transaction Layer (Retry Buffer empty)
5. PM_Active_State_Request L1 8. All FC credits sufficient to send a
received Data Link Layer maximum-sized transaction

12. Electrical Idle ordered set received 9. PM_Request_ACK sent


Causing TLP and DLLP transmission Physical Layer continuously until electrical
to be disabled idle ordered set is received
(RX) (TX)

11. Electrical Idle ordered set


is sent and transmitter goes (Link) 13. Transmit lanes are placed into
to Electrical idle Electrical idle
(TX) (RX)

Physical Layer
4. PM_Active_State_Request L1 sent
continuously until PM_Request_ACK
received from the opposite port Data Link Layer 10. PM_Request_ACK received,
3. All FC credits sufficient to send causing TLP and DLLP Packet
a maximum-sized transaction transmission to be disabled
Transaction Layer
2. ACK received for last TLP
(Retry Buffer empty)
PCIe-Core
1. Device blocks new TLP scheduling Hardware/Software
Interface

Device Core

Downstream Component

Scenario 2: Upstream Component Transmits TLP Just Prior to


Receiving L1 Request
Thisscenariopresumesthattheupstreamcomponenthasjustbeeninstructed
byitscorelogictosendaTLPdownstreambeforeitreceivestherequesttoenter
L1fromthedownstreamdevice.Severalnegotiationrulesdefinetheactionsto
ensurethatthissituationismanagedcorrectly.

750
PCIe 3.0.book Page 751 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

TLPMustBeAcceptedbyDownstreamComponent.Notethatafter
thedownstreamdevicesendsthePM_Active_State_L1DLLPitmustwait
for a response from the upstream component. While waiting, the down
stream component must be able to accept TLPs and DLLPs from the
upstreamdevice.AlthoughitwontsendanyTLPs,itmustbeabletosend
DLLPsasneeded,suchasACKsforincomingTLPs.Inthiscase,twopossi
bilitiesexist:
anACKisreturnedtoverifysuccessfulreceiptoftheTLP.
aNAKisreturnedifaTLPtransmissionerrorisdetected.Theresulting
retryoftheTLPisallowedduringtheL1negotiation.
UpstreamComponentReceivesRequesttoEnterL1. The spec
requires that the upstream component immediately accept or reject the
requesttoentertheL1state.However,itfurtherstatesthatpriortosending
aPM_Request_ACKitmust:
1. BlockschedulingofnewTLPs
2. WaitforacknowledgementofthelastTLPpreviouslysent,ifnecessary,
and retry TLPs that receive a NAK, unless a Link Acknowledgement
timeoutconditionoccurs.

Once alloutstanding TLPs have beenacknowledged, and all othercondi


tions are satisfied, the upstream device must return a PM_Request_ACK
DLLP.

Scenario 3: Downstream Component Receives TLP During


Negotiation
Duringthenegotiationsequencethedownstreamdevicemaybeinstructedto
sendanewTLPupstream.However,adevicethatbeginstheL1ASPMnegotia
tion process must block new TLP scheduling. This prevents a race condition
betweengoingintoL1andsendinganewTLPthatwouldprevententryintoL1.
Consequently, once the downstream device has scheduled delivery of the
PM_Request_L1itmustcompletethetransitiontoL1ifaPM_Request_ACKis
received.SendinganewTLPwillhavetowaituntilL1hasbeenentered,after
whichthedevicecaninitiateatransitionfromL1backtoL0tosendtheTLP.

Scenario 4: Upstream Component Receives TLP During Nego-


tiation
If the upstream component needs to send a TLP or DLLP after sending the
PM_Request_Ack,itmustfirstcompletethetransitiontoL1.Itcantheninitiate
achangefromL1toL0tosendthepacket.

751
PCIe 3.0.book Page 752 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Scenario 5: Upstream Component Rejects L1 Request


Figure 1618 on page 752 summarizes the negotiation sequence when the
upstreamcomponentrejectstherequesttoentertheL1ASPMstate.Thenegoti
ationbeginsnormallyasthedownstreamcomponentrequestsL1.However,the
upstreamdevicereturnsaPM_Active_State_NakTLPtorejecttherequest.The
reasonsforrejectingtherequesttoenterL1include:

L1ASPMnotsupportedorsoftwarehasnotenabledthisfeature
OneormoreTLPsarescheduledfortransferacrosstheLink
ACKorNAKDLLPsarescheduledfortransfer

Once the rejection message has been sent, the upstream component can con
tinuesendingTLPsandDLLPsasneeded.Therejectiontellsthedownstream
componentthatL1isnotanoptionatpresent,andsoitmusttransitiontoL0s
instead,ifpossible.

Figure1618:NegotiationSequenceResultinginRejectiontoEnterL1ASPMState

Device Function

PCIe-Core
Hardware/Software
Interface

Transaction Layer 6. PM_Active_State_NAK


TLP request sent

5. PM_Active_State_Request L1
Data Link Layer
received

Physical Layer
(RX) (TX)

8. Transmit link of downstream


device is transitioned to (Link)
L0s state assuming all
(TX) (RX)
conditions met
Physical Layer

4. PM_Active_State_Request L1 sent
continuously until response received Data Link Layer 7. PM_Active_State_NAK received

3. All FC credits sufficient to send


a maximum-sized transaction Transaction Layer

2. ACK received for last TLP


PCIe-Core
(Retry Buffer empty)
Hardware/Software
1. Device blocks TLP scheduling at Interface
Transaction Layer
Device Core

Downstream Component

752
PCIe 3.0.book Page 753 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Exit from L1 ASPM State


EithercomponentcaninitiatethetransitionfromL1backtoL0whenitneedsto
usetheLink.Theprocedureisthesameineithercaseanddoesntinvolveany
negotiation. When switches are involved in exiting from L1 the spec requires
thatotherswitchportsintheASPMlowpowerstatesmustalsotransitiontothe
L0 state if they are in the possible path of the packet that will be sent. These
issuesarediscussedinsubsequentsections.

L1ASPMExitSignaling.ThespecstatesthatexitfromL1isinvokedby
exiting electrical idle, which begins by sending TS1s. The receiving port
respondsbysendingTS1sbacktotheoriginatingdeviceandthePhysical
LayerfollowsitsLTSSMprotocoltocompletetheRecoverystateandreturn
theLinktoL0.RefertoRecoveryStateonpage 571fordetails.

SwitchReceivesL1ExitfromDownstreamComponent. As pic
tured in Figure 1619, the Switch must respond to L1 exit on the down
stream port by returning TS1s and, within 1s (from signal L1 Exit
downstream),itmustalsoexitL1onitsupstreamLinkifitwasinthatstate.

753
PCIe 3.0.book Page 754 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1619:SwitchBehaviorWhenDownstreamComponentSignalsL1Exit

Root Complex

6. RC signals L1 L1 ASPM State


exit to Switch F
5. Within 1s of
PM State D0
step 4, Switch F
Switch signals L1 Exit to RC

(F)
L1 ASPM State L1 State
4. Switch F signals L1
exit to Switch C L1 ASPM
State
3. Within 1s of step 2,
PM State D0 Switch C signals PM State D1
PM State L1 Exit to Switch F
PCIe D0 PCI-XP
Endpoint Switch Endpoint
(D) (C) (E)
L1 ASPM State
L1 State 1. EP B signals
L1 Exit to Switch C
2. Switch C signals
L1 Exit to EP B
PM State D2 PM State D0
PCIe PCIe
Endpoint Endpoint
(A) (B)

Presumablythereasonthedownstreamcomponentistransitioningbackto
L0isbecauseitspreparingtosendaTLPupstream.SinceL1exitlatencies
arerelativelylong,aswitchmustnotwaituntilitsDownstreamPortLink
hasfullyexitedtoL0beforeinitiatinganL1exittransitiononitsUpstream
Port Link. This prevents accumulated latencies that would otherwise
resultifallL1toL0transitionsoccurredinasequentialfashion.

SwitchReceivesL1ExitfromUpstreamComponent. In this case,


theswitchmustrespondwithTS1sbackupstream,andwithin1sitmust
also send TS1s to all downstream ports that are in the L1 ASPM state to
returnthemtoL0.Asinthepreviousexample,thegoalistominimizethe

754
PCIe 3.0.book Page 755 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

overall exit latency of returning to the L0 state for every Link in the path
fromtheinitiatortothetargetofthetransaction.Figure1620onpage755
summarizestheserequirements.TheLinkbetweenSwitchFandEndPoint
(EP)EisintheL1statebecausesoftwareputEPEintotheD1state,which
caused the Link to transition to L1. Only Links in the L1 ASPM state are
transitioned to L0 as a result of the Root Complex (RC) initiating the exit
fromL1ASPM.

Figure1620:SwitchBehaviorWhenUpstreamComponentSignalsL1Exit

Root Complex

1. RC signals L1 Exit L1 ASPM State


to Switch F 2. Switch F signals
PM State D0 L1 Exit to RC

3. Within 1s of Switch
step 2, Switch F (F)
signals L1 Exit to
EP D & Switch C
L1 State
L1 ASPM State
L1 ASPM
State
4b. EP D signals 4a. Switch C signals
L1 Exit to Switch F L1 Exit to Switch F
PM State PM State D1
PM State D0 PCIe D0 PCIe
Endpoint Switch Endpoint
(D) (C) (E)
L1 ASPM State
L1 State
6. EP B signals
5. Within 1s of step L1 Exit to Switch C
4a, Switch C signals
L1 Exit to EP B
PM State D3 PM State D0
PCIe PCIe
Endpoint Endpoint
(A) (B)

755
PCIe 3.0.book Page 756 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ASPM Exit Latency


PCI Express provides mechanisms to ensure that the ASPM exit latencies for
L0sandL1dontexceedtherequirementsofthedevices.Alldevicesreporttheir
L0sandL1exitlatencies,andEndpointsalsoreportthetotalacceptablelatency
theycantolerateforthiswhenperformingaccessestoandfromtheRootCom
plex.Thisacceptablelatencyisbasedonthedatabuffersizewithinthedevice.
IfthechainofdevicesthatresidebetweentheEndpointandtargetdevicehave
a total latency that exceeds the acceptable latency reported by the Endpoint,
softwarecandisableASPMforagivenEndpoint.

Theexitlatenciesreportedbyadevicewillchangedependingonwhetherthe
devicesoneachendofaLinkshareacommonreferenceclockornot.Conse
quently, the Link Status register includes a bit called Slot Clock that specifies
whetherthecomponentusesanexternalreferenceclockprovidedbytheplat
form, or an independent reference clock (perhaps generated internally). Soft
ware checks these bits in devices at both ends of each Link to determine
whethertheybothuseitandthusshareacommonclock.Ifso,softwaresetsthe
CommonClockbittoreportthisinbothdevices.Figure1621onpage757illus
tratestheregistersandrelatedbitfieldsinvolvedinmanagingtheASPMexit
latency.

Reporting a Valid ASPM Exit Latency


Becausetheclockconfigurationaffectstheexitlatencythatadevicewillexperi
ence,devicesmustreportthesourceoftheirreferenceclockviatheSlotClock
statusbitwithintheLinkStatusregister.Thisbitisinitializedbythecomponent
toreportthesourceofitsreferenceclock.Ifthisbitissetto1,theclockusesthe
platformgeneratedreferenceclockandifitscleared(0)anindependentclockis
used.

If system firmware or software determines that both components on the Link


usetheplatformclockthenthereferenceclockswithinbothdeviceswillbein
phase.ThisresultsinshorterexitlatenciesfromL0sandL1,andisreportedin
the Common Clock field of the Link Control register. Components must then
updatetheirreportedexitlatenciestoreflectthecorrectvalue.Notethatifthe
clocks are not common then the default values will be correct and no further
actionisrequired.

L0sExitLatencyUpdate.Exit latency for L0s is reported in the Link


Capability register based on the default assumption that a common clock
implementationdoesnotexist.L0sexitlatencyisalsoreportedintheTS1s

756
PCIe 3.0.book Page 757 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

used during Link training as the number of FTS Ordered Sets (N_FTS)
requiredtoexitL0s.Ifsoftwarethendetectsacommonclockimplementa
tion,itsetstheCommonClockfieldwritestotheRetrainLinkbitintheLink
Control register to force Link training to repeat. During retraining new
N_FTSvaluesarereportedandintheL0sLatencyfieldoftheLinkCapabil
ityregister.

L1ExitLatencyUpdate.FollowingLinkretraining,newvalueswillalso
bereportedintheL1Latencyfield.

Figure1621:Config.RegistersforASPMExitLatencyManagementandReporting

757
PCIe 3.0.book Page 758 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Calculating Latency from Endpoint to Root Complex


Figure1622onpage759illustratesanEndpointwhosetransactionsmusttrans
versetwoswitchestoreachtheRootComplex.PresumingthatallLinksinthe
pathareintheL1state,letstaketheexamplethatEndpointBneedstosenda
packettomainmemory.

1. First,itbeginsthewakesequencebyinitiatingaTS1orderedsetonitsLink
attimeT.TheL1exitlatencyforEPBisamaximumof8s,butSwitchC
hasamaximumexitlatencyof16s.Therefore,theexitlatencyforthisLink
is16s.
2. Within1sofdetectingtheL1exitonLinkB/C,SwitchCsignalsL1exiton
LinkC/FatT+1s.
3. LinkC/FcompletesitsexitfromL1in16s,atT+17s.
4. SwitchFsignalsanexitfromL1totheRootComplexwithin1sofdetect
ingL1exitfromSwitchC(T+2s).
5. LinkF/RCcompletesexitfromL1in8s,completingatT+10s.
6. TotallatencytotransitionpathtotargetbacktoL0=T+17s.

758
PCIe 3.0.book Page 759 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Figure1622:ExampleofTotalL1Latency

Root Complex

RC L1 latency (8s)
5. Exit to L0 also takes 8s
L1 State

PM State D0 4. Within 1s of detected L1 exit


from Switch C, Switch F signals
Switch L1 Exit to RC

Switch F, L1 latency (8s) (F)


3. Exit to L0 takes 16s L1 State

L1 State
2. Within 1s of detecting,
PM State D0 L1 Exit from EP B, Switch
PM State C signals Exit to Switch F
PCIe D0 PCI-XP
PM State D1
Endpoint Switch Endpoint
(D) (C) (E)
Switch C, L1 latency (16s)

1. Exit to L0 takes 16s


L1 State L1 State because the switch takes
longer than the endpoint

PM State D2 PM State D0
PCIe PCIe EP B, L1 latency (8s)
Endpoint Endpoint
(A) (B)
T T+16
Link B/C starts L1 exit at T and takes 16s T+17
T+1
Link C/F starts L1 exit at T+1 and takes 16s
T+2 T+10
Link F/RC starts L1 exit at T+1 and takes 8s

759
PCIe 3.0.book Page 760 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Software Initiated Link Power Management


When software initiates configuration writes to change the power state for
power conservation, devices must respond by transitioning their Link to the
correspondinglowpowerstate.

D1/D2/D3Hot and the L1 State


ThespecrequiresthatwhenallFunctionswithinadevicehavebeenplacedinto
anyofthelowpowerstates(D1,D2,orD3hot),thedevicemustinitiateatransi
tiontotheL1stateasshowninFigure1623.AdevicereturnstoL0asaresultof
software initiating a configuration access to the device or a device initiated
event.

Figure1623:DevicesTransitiontoL1WhenSoftwareChangestheirPowerLevelfromD0

L0

L2/L3
L0s L1 L2 Ready L3

UponreceivingaconfigurationwritetothePowerStatefieldofthePMCSRreg
ister, a device initiates the change from L0 to L1 by sending a PM_Enter_L1
DLLPtotheupstreamcomponent.

Entering the L1 State


TheproceduretoplacetheLinkintoanL1stateisillustratedinFigure1624on
page762.Thestepsinthefigurearedescribedingreaterdetailinthefollowing
list:

1. OnceadevicerecognizesthatallitsFunctionsareintheD2state,itmust
preparetotransitiontheLinkintoL1.ThisbeginswithblockingnewTLPs
frombeingscheduled.

760
PCIe 3.0.book Page 761 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

2. A TLP mayfrom thedownstreamEndpoint maynot havebeenacknowl


edged prior to receiving the request to enter D2. The device must not
respondtoarequesttochangetheLinkpoweruntilalloutstandingTLPs
havebeenacknowledged.Inotherwords,theReplayBuffermustbeempty
beforeproceedingtotheL1state.
3. Because of the long latencies involved in returning the Link to its active
state, a device must be able to send a maximumsized TLP immediately
uponreturn totheactivestate. SincealackofFlowControlcreditscould
blockthis,theEndpointmusthavesufficientcreditstopermittransmission
ofthebiggestpacketsupportedforeachFlowControltypebeforeentering
L1.
4. Whentherequirementslistedabovehavebeenmet,theEndpointsendsa
PM_Enter_L1 DLLP to the upstream device. This instructs the upstream
component to put the Link into L1. The PM_Enter_L1 is repeated until a
PM_Request_ACKDLLPisreceivedfromtheupstreamdevice.
5. WhentheupstreamcomponentreceivesPM_Enter_L1,itbeginsitsprepa
rationbyperformingsteps6,7,and8.Thisisthesamepreparationasper
formedbythedownstreamcomponentpriortosignalingtheL1transition.
6. AllnewTLPschedulingisblocked.
7. In the event that a previous TLP has not yet been acknowledged, the
upstream device will wait until all transactions in the Replay Buffer have
beenacknowledged.
8. SufficientFlowControlcreditsmustbeaccumulatedtoensurethatthelarg
estTLPcanbetransmittedforeachFlowControltype.
9. TheupstreamcomponentsendsaPM_Request_ACKDLLPtoconfirmthat
itsreadytoentertheL1state.ThisDLLPisrepeateduntilanElectricalIdle
orderedsetisreceived,indicatingthatitsbeenaccepted.
10. Whenthedownstreamcomponentreceivestheacknowledgement,itsends
an EIOS and places its transmit lanes into electrical idle (transmitter is in
HiZstate).
11. TheupstreamcomponentrecognizestheEIOSandplacesitstransmitlanes
intoelectricalidle.TheLinkhasnowenteredtheL1state.

761
PCIe 3.0.book Page 762 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1624:ProcedureUsedtoTransitionaLinkfromtheL0toL1State

Device Function
6. Device blocks new TLP
PCIe-Core scheduling
Hardware/Software
Interface
7. ACK received for last TLP
Transaction Layer (Retry Buffer empty)
5. PM_Enter_ L1 DLLP is 8. All FC credits sufficient to send a
received maximum-sized transaction
Data Link Layer
9. PM_Request_ACK sent
12. Electrical Idle ordered set received continuously until electrical
Causing TLP and DLLP transmission Physical Layer idle ordered set is received
to be disabled (RX) (TX)

11. Electrical Idle ordered set


is sent and transmitter goes (Link) 13. Transmit lanes are placed into
to Electrical idle Electrical idle
(TX) (RX)

4. PM_Enter_L1 DLLP is sent Physical Layer


continuously until PM_Request_ACK
is received from the opposite port
Data Link Layer 10. PM_Request_ACK received,
3. All FC credits sufficient to send causing TLP and DLLP Packet
a maximum-sized transaction transmission to be disabled
2. ACK received for last TLP Transaction Layer
(Retry Buffer empty)
PCIe-Core
Hardware/Software
1. Device blocks new TLP scheduling Interface

Device Core
Downstream Component

Exiting the L1 State


AnexitfromtheL1statecanbeinitiatedbyeithertheupstreamordownstream
component,asdiscussedbelow.Thissectionalsosummarizesthesignalingpro
tocolusedtoexitL1.

UpstreamComponentInitiates.Software may need to use a device


whichiscurrentlyinalowpowerstate,andthatmeansthePowerManage
ment software must issue a configuration write to change its power state
back to D0. When the configuration Request is ready to be sent from the
upstreamcomponent(aRootPortordownstreamSwitchPort)theportwill
exittheelectricalidlestateandinitiateretrainingtoreturntheLinktothe

762
PCIe 3.0.book Page 763 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

L0state.OncetheLinkisactive,theconfigurationwritecanbedeliveredto
thedevicetotransitionitbacktoD0,atwhichpointitsreadyfornormal
use.

DownstreamComponentInitiatesL1toL0Transition. In the L1
statethereferenceclockandpowerarestillappliedtodevicesontheLink.
Thatallowsadownstreamdevicetobedesignedtomonitorexternalevents
andtriggeraPowerManagementEvent(PME)whenitoccurs.Inconven
tionalPCI,thisisreportedbyasidebandPME#signal,andsystemboard
logic usually uses it to generate an interrupt that informs the CPU of the
need to bring the device back to full operation. PCIe eliminates the side
bandsignalandinsteadsendsaninbandmessagetoreportthePME(see
ThePMEMessageonpage 769fordetails).

TheL1ExitProtocol.IntheL1statebothdirectionsoftheLinkareinthe
electricalidlestate.AdevicesignalsanexitfromL1bychangingfromelec
tricalidleandsendingTS1s.WhentheLinkneighbordetectstheexitfrom
electrical idle it sends TS1s back. This sequence triggers both devices to
entertheRecoverystateand,whenthathascompleteditsoperation,both
deviceswillhavereturnedtotheL0state.

L2/L3 Ready Removing Power from the Link


Once software has placed all Functions within a Device into the D3hot state
power can be safely removed from the device. A typical application for this
would be to place all devices in the system into D3 and then remove power
from them all to achieve the lowest power consumption. However, the spec
does not give details of the actual mechanism that would be used to remove
clockandpowerorrequirethataparticularsequencebefollowed,allowingfor
avarietyofimplementations.

Thestatetransitionstopreparedevicesforpowerremovalinvolvetheprelimi
narystepsofenteringL1andthenreturningtoL0beforearrivingattheL2/L3
ReadystateasillustratedinFigure1625onpage764.

763
PCIe 3.0.book Page 764 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1625:LinkStatesTransitionsAssociatedwithPreparingDevices
forRemovaloftheReferenceClockandPower

L2/L3 Ready Handshake Sequence


The spec does require a handshake sequence when transitioning to the L2/L3
Ready state. This ensures that all devices are ready for reference clock and
power removal, and also that inband PME messages being sent to the Root
Complexwontaccidentallybelostwhenpowerisremoved.

Considerthefollowingexampleofthehandshakesequencerequiredforremov
ingthereferenceclockandpowerfromPCIedevicesinthefabric.Thisexample
assumes a systemwide power down is being initiated, but the sequence can
alsoapplytoindividualdevices.Thestepsaresummarizedbelowandshown
inFigure1626onpage766.Theoverallsequenceisrepresentedintwoparts
labeledAandB.TheLinkstatetransitionsinvolvedinthecompletesequence
include:

L0>L1(whensoftwareplacesadeviceintoD3)
L1>L0(whensoftwareinitiatesaPME_Turn_Offmessage)
L0 > L2/L3 Ready (resulting from the completion of the PME_Turn_Off
handshake sequence, which culminates in a PM_Enter_L23 DLLP being
sentbythedeviceandtheLinkgoingtoelectricalidle)

ThefollowingstepsdetailthesequenceillustratedinFigure1626onpage766.

1. Power Management software first places all Functions in the PCIe fabric
intotheirD3state.
2. AlldevicestransitiontheirLinkstotheL1statewhentheyenterD3.
3. Power Management software initiates a PME_Turn_Off TLP message,

764
PCIe 3.0.book Page 765 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

whichisbroadcastfromallRootComplexportstoalldevices.Thisprevents
PME Messages from being lost in case they were in progress upstream
whenpowerwasremoved.NotethatdeliveryofthisTLPcauseseachLink
totransitionbacktoL0soitcanbeforwardeddownstream.
4. AlldevicesmustreceiveandacknowledgethePME_Turn_Offmessageby
returningaPME_TO_ACKTLPmessagewhileintheD3state.
5. Switches collect the PME_TO_ACK messages from all of their enabled
downstream ports and forward just one aggregated PME_TO_ACK mes
sage upstream toward the Root Complex. Thats because these messages
havetheroutingattributesetasGatherandRoutetotheRoot.
6. After sending the PME_TO_ACK, when it is ready to have the reference
clockandpowerremoved,devicessendaPM_Enter_L23DLLPrepeatedly
untilaPM_Request_ACKDLLPisreturned.TheLinksthatentertheL2/L3
Ready state last are those attached to the device originating the
PME_Turn_Offmessage(theRootComplexinthisexample).
7. ThereferenceclockandpowercanfinallyberemovedwhenallLinkshave
transitionedtotheL2/L3state,butnotsoonerthan100nsafterthat.Ifauxil
iarypower(VAUX)issuppliedtothedevices,theLinktransitionstoL2.If
noAUXpowerisavailabletheLinkswillbeintheL3state.

765
PCIe 3.0.book Page 766 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1626:NegotiationforEnteringL2/L3ReadyState

Root Complex
1. Software has previously placed all functions 2. Software generates a PME_Turn_Off
into the D3 state and all have transitioned their broadcast message to tempoarily disable
link to L1 as required. PME Messages.

L1 State L0 State

3. As the PME_Turn_Off message


PM State D3 (F) reaches the downstream root port and
Switch downstream ports of each switch an
L1 to L0 transition must occur to
transmit the message.

L1 State L0 State
L1 State L0 State

PM State D3 L1 L0 PM State D3
PM State
PCIe D3
(C) PCIe
Endpoint Switch Endpoint
5. Switches wait until all down-
(D)their ACK
stream ports have sent (E)
message. They then return a single
aggregate message upstream.

L1 State L0 State L1 State L0 State

PM State D3 4. Each device receives the PM State D3


PCI-XP message and
PCI_Turn_Off PCIe
A sends a PME_TO_ACK
Endpoint
message.
(A)
Endpoint
(B)
PME_Turn_Off Message

PME_TO_ACK Message

Root Complex
8. When all links attached to the device that originated the
PME_Turn_Off have entered the L2/L3 Ready state, the
reference clock and power can be removed, but no sooner
than 100ns after observing L2/L3 Ready on all links. L0 State L2/L3 Ready State

PM State D3 (F)
Switch
L0 State L2/L3 Ready State
L0 State L2/L3 Ready State

PM State D3 PM State PM State D3


PCIe
6. After each downstream component has sent the D3
(C) PCIe
Endpoint
PCI_TO_ACK, they send the PM_Enter_L23 DLLP
Switch Endpoint
7. Switches wait until all downstream ports
repeatedly until they receive a PME_Request_Ack. have transitioned to the L2/L3 Ready state
(D)
This causes the downstream device to issue an (E)
before sending the PM_Enter_L23 DLLP
electrical idle ordered set, after which it enters idle. upstream.
The upstream device detects electrical idle and also
enters idle. The link is now in the L1/L3 Ready state.

L0 State L2/L3 Ready State L0 State L2/L3 Ready State

PM State D3 PM State D3
PCIe PCIe
B Endpoint
(A)
Endpoint
(B)
PM_Enter_L23 DLLP

766
PCIe 3.0.book Page 767 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Exiting the L2/L3 Ready State Clock and Power Removed


As illustrated in the state diagram in Figure 1627, a device exits the L2/L3
Readystatewhenpowerisremovedandhasonlytwochoices.WhenVAUXis
availablethetransitionistoL2,otherwisethetransitionistoL3.

Link state transitions are normally controlled by the LTSSM in the Physical
Layer. However, transitions to L2 and L3 result from main power being
removedandtheLTSSMisnotoperationalthen.Consequently,thespecrefers
toL2andL3aspseudostatesdefinedforexplainingtheresultingconditionofa
devicewhenpowerisremoved.

Figure1627:StateTransitionsfromL2/L3ReadyWhenPowerisRemoved

The L2 State
Some devices are designed to monitor external events and initiate a wakeup
sequencetorestorepowertohandlethem.Sincemainpowerisremoved,these
devicewillneedapowersourcelikeVAUXtobeabletomonitortheeventsand
tosignalawakeup.

The L3 State
Inthisstatethedevicehasnopowerandthereforenomeansofcommunication.
Recoveryfromthisstaterequiresthesystemtorestorepowerandthereference
clock.Thatcausesdevicestoexperienceafundamentalreset,afterwhichtheyll
needbeinitializedbysoftwaretoreturntonormaloperation.

767
PCIe 3.0.book Page 768 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Link Wake Protocol and PME Generation


The wake protocol provides a method for an Endpoint to reactivate the
upstream Link and request that software return it to D0 so it can perform
requiredoperations.PCIePMisdesignedtobecompatiblewithPCIPMsoft
ware,althoughthemethodsaredifferent.

Ratherthanusingasidebandsignal,PCIedevicesuseaninbandPMEmessage
tonotifyPMsoftwareoftheneedtoreturnthedevicetoD0.Theabilitytogen
erate PME messages may optionally be supported in any of the low power
states.RecallthatadevicereportswhichPMstatesitsupportsforPMEmessage
delivery.

PME messages can only be delivered when the Link state is L0. The latency
involvedinreactivatingtheLinkisbasedonadevicesPMandLinkstate,but
canincludethefollowing:

1. Linkisinnoncommunicating(L2)statewhenaLinkisintheL2stateit
cannot communicate because the reference clock and main power have
been removed. No PME message can be sent until clock and power are
restored,aFundamentalResetisasserted,andtheLinkisretrained.These
eventswillbetriggeredwhenadevicesignalsawakeup.Thismayresultin
allLinksbeingreawakenedinthepathbetweenthedeviceneedingtocom
municateandtheRootComplex.
2. Linkisincommunicating(L1)statewhenaLinkisintheL1stateclock
andmainpowerarestillactive;thus,adevicesimplyexitstheL1state,goes
totheRecoverystatetoretraintheLink,andreturnstheLinktoL0.Once
theLinkisinL0thePMEmessageisdelivered.Notethatthedevicesnever
sendaPMEmessagewhileintheL2/L3Readystatebecauseentryintothat
stateonlyoccursafterPMEnotificationhasbeenturnedoff,inpreparation
for clock and power to be removed. (See L2/L3 Ready Handshake
Sequenceonpage 764.)
3. PMEisdelivered(L0)IftheLinkisintheL0state,thedevicetransfers
thePMEmessagetotheRootComplex,notifyingPowerManagementsoft
ware that the device has observed an event that requires the device be
placedbackintoitsD0state.NotethatthemessagecontainstheRequester
ID(Bus#,Device#,andFunction#)ofthedevice.Thisquicklyinformssoft
warewhichdeviceneedsservice.

768
PCIe 3.0.book Page 769 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

The PME Message


The PME message is delivered by devices that support PME notification. The
messageformatisillustratedinTable 1628onpage 769.Themessagemaybe
initiatedbyadeviceinalowpowerstate(D1,D2,D3hot,andD3cold)andissent
immediatelyuponreturnoftheLinktoL0.

Figure1628:PMEMessageFormat

CPU

Root Complex

PME Switch
Message

PME Message Request TLP


Framing Sequence Framing
Header Digest LCRC
(STP) Number (End)
PCIe
Endpoint
Route to Root Complex
+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x1 1 0 0 0 0 tr H D P 0 0 0 0
Byte 4 Requester ID Tag Message Code
0001 1000
Byte 8 Reserved
Byte 12 Reserved

769
PCIe 3.0.book Page 770 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ThePMEmessageisaTransactionLayerPacketthathasthefollowingcharac
teristics:

TCandVCarezero(noQoSapplies)
RoutedimplicitlytotheRootComplex
HandledasPostedTransaction
Relaxed Ordering is not permitted, forcing all transactions in the fabric
betweenthesignalingdeviceandtheRootComplextobedeliveredtothe
RootComplexaheadofthePMEmessage

The PME Sequence


DevicesmaysupportPMEinanyofthelowpowerstatesasspecifiedinthePM
Capabilities register. This register also specifies the amount of VAUX current
usedbythedeviceifitsupportswakeupintheD3coldstate.Thebasicsequence
ofeventsassociatedwithsendingaPMEtosoftwareisspecifiedbelowandpre
sumesthatthedeviceandsystemareenabledtogeneratePMEandtheLinkhas
alreadybeentransitionedtotheL0state:

1. ThedeviceissuesthePMEmessageonitsupstreamport.
2. PMEmessagesareimplicitlyroutedtotheRootComplex.Switchesinthe
path transition their upstream ports to L0 if necessary and forward the
packetupstream.
3. A root port receives the PME and forwards it to the Power Management
Controller.
4. The controller informs power management software, typically with an
interrupt.SoftwareusestheRequesterIDinthemessagetoreadandclear
the PME_Status bit in the PMCSR and return the device to the D0 state.
Depending on the degree of power conservation, the PCI Express driver
mayalsoneedtorestorethedevicesconfigurationregisters.
5. PMSoftwaremayalsocallthedevicedriverintheeventthatdevicecontext
waslostasaresultofbeingplacedinalowpowerstate.Ifso,devicesoft
warerestoresinformationwithinthedevice.

PME Message Back Pressure Deadlock Avoidance


Background
TheRootComplextypicallystoresthePMEmessagesitreceivesinaqueue,and
callsPMsoftwaretohandleeachone.APMEisheldinthisqueueuntilPMsoft

770
PCIe 3.0.book Page 771 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

ware reads the PME_Status bit from the requesting devices PMCSR register.
Once the configuration read transaction completes, this PME message can be
removedfromtheinternalqueue.

The Problem
Deadlockcanoccurifthefollowingscenariodevelops:

1. Incoming PME Messages have filled the PME message queue but other
PMEmessageshavebeenissueddownstreamfromthesamerootport.
2. PM software initiates a configuration read request from the Root to read
PME_StatusfromtheoldestPMErequester.
3. ThecorrespondingsplitcompletionmustpushallpreviouslypostedPME
messagesaheadofitbasedontransactionorderingrules.
4. TheRootComplexcannotacceptanewPMEmessagebecausethequeueis
full,sothepathistemporarily blocked.Butthatalsomeansthattheread
completion cant reach the Root Complex to clear the older entry in the
queue.
5. Noprogresscanbemadeanddeadlockoccurs.

The Solution
The problem is avoided if the Root Complex always accepts new PME mes
sages,evenwhentheywouldoverflowthequeue.Inthiscase,theRootsimply
discards the later PME messages. To prevent a discarded PME message from
beinglostpermanently,adevicethatsendsaPMEmessageisrequiredtomea
sure a timeout interval, called the PME Service Timeout. If the devices
PME_Statusbitisnotclearedwith100ms(+50%/5%),itassumesitsmessage
musthavebeenlostanditreissuesthemessage.

The PME Context


DevicesthatgeneratePMEmustcontinuetopowerportionsofthedevicethat
areusedfordetecting,signaling,andhandlingPMEevents,referredtocollec
tivelyasthePMEcontext.DevicesthatsupportPMEintheD3coldstateuseaux
iliary power to maintain the PME context when the main power is removed.
ItemsthataretypicallypartofthePMEcontextinclude:

PME_Statusbit(required)setwhenadevicesendsaPMEmessageand
clearedbyPMsoftware.DevicesthatsupportPMEintheD3coldstatemust
implementthePME_Statusbitassticky,meaningthatthevaluesurvives
afundamentalreset.

771
PCIe 3.0.book Page 772 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

PME_Enablebit(required)thisbitmustremainsettocontinueenabling
aFunctionsabilitytogeneratePMEmessagesandsignalwakeup.Devices
that support PME in the D3cold state must implement PME_Enable as
sticky,meaningthatthevaluesurvivesafundamentalreset.
Devicespecificstatusinformationforexample,adevicemightpreserve
eventstatusinformationincaseswhereseveraldifferenttypesofeventscan
triggeraPME.
Applicationspecific information for example, modems that initiate
wakeupwouldpreserveCallerIDinformationifsupported.

Waking Non-Communicating Links


WhenadevicethatsupportsPMEintheD3coldstateneedstosendaPMEmes
sage,itmustfirsttransitiontheLinktoL0.Thisissometimesreferred toasa
wakeup. PCI Express defines two methods of triggering the wakeup of
noncommunicatingLinks:

BeaconaninbandindicatordrivenbyAUXpower
WAKE#SignalasidebandsignaldrivenbyAUXpower

Inbothcases,PMsoftwaremustbenotifiedtorestoremainpowerandtheref
erenceclock.Thisalsocausesafundamentalresetthatforcesadeviceintothe
D0uninitializedstate.OncetheLinktransitionstoL0,thedevicesendsthePME
message.SincearesetisrequiredtoreactivatetheLink,devicesmustmaintain
PMEcontextacrosstheresetsequencedescribedabove.

Beacon
This signaling mechanism is designed to operate on AUX power and doesnt
require much power. The beacon is simply a way of notifying the upstream
component that software should be notified of the wakeup request. When
switchesreceiveabeacononadownstreamport,theyinturnsignalbeaconon
theirupstreamport.Ultimately,thebeaconreachestherootcomplex,whereit
generatesaninterruptthatcallsPMsoftware.

Someformfactorsrequirebeaconsupportforwakingthesystemwhileothers
dont. The spec requires compliance with the formfactor specs, and doesnt
require beacon support for devices if their formfactor doesnt. However, for
universal components designed for use in a variety of formfactors, beacon
supportisrequired.SeeBeaconSignalingonpage 483fordetails.

772
PCIe 3.0.book Page 773 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

WAKE#
PCI Express provides a sideband signal called WAKE# as a alternative to the
beaconthatcanberouteddirectlytotheRootortoothersystemlogictonotify
PMsoftware.InspiteofthedesiretominimizethepincountofaLink,themoti
vationforaddingthisextrapiniseasytounderstand.Thereasonisthatacom
ponent must consume auxiliary power to be able to recognize a beacon on a
downstreamportandthenforwardittoanupstreamport.Inabatterypowered
systemauxiliarypowerisjealouslyguardedbecauseitdrainsthebatteryeven
when the system isnt doing any work. The preferred solution in that case
would be to bypass as many components as possible when delivering the
wakeupnotification,andtheWAKE#pinservesthatpurposeverywell.Onthe
otherhand,ifpowerisnotaconcernthentheWAKE#pinmightbeconsidered
lessdesirable.

A hybrid implementation may also be used. In this case, WAKE# is sent to a


switch, which in turn sends the beacon on its upstream port. The options are
illustratedinFigure1629onpage774AandB.Notethatwhenasserted,the
WAKE#signalremainslowuntilthePME_Statusbitisclearedbysoftware.

ThissignalmustbeimplementedbyATXorATXbasedconnectorsandcardsas
wellasbytheminicardformfactor.Norequirementisspecifiedforembedded
devicestousetheWAKE#signal.

773
PCIe 3.0.book Page 774 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1629:WAKE#SignalImplementations

Root Complex

L2 State

(F) PM State D3
Switch

L2 State L2 State

PM State
PM State D3 PCIe D3 PCIe PM State D3
Endpoint (C) Endpoint
(D) Switch (E)

L2 State L2 State
WAKE#

A Card Slots

Root Complex

L2 State

(F)
Switch PM State D3

Beacon signaling used from L2 State


switch to Root Complex.

PM State D3 PM State PM State D3


PCIe D3 PCIe
Endpoint Endpoint
(C)
(D) Switch (E)

L2 State WAKE#

B Card Slots

774
PCIe 3.0.book Page 775 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Auxiliary Power
Devices that support PME in the D3cold state must support the wakeup
sequenceandareallowedbythePCIPMspectoconsumethemaximumauxil
iarycurrentof375mA(otherwiseonly20mA).Theamountofcurrenttheyneed
is reported in the Aux_Current field of the PM Capability registers. Auxiliary
powerisenabledwhenthePME_EnablebitissetwithinthePMCSRregister.

PCIExpressextendstheuseofauxiliarypowerbeyondthelimitationsgivenby
PCIPM. Now, any Device may consume the maximum auxiliary current if
enabledbysettingtheAuxPowerPMEnablebitoftheDeviceControlregister,
illustrated in Figure 1630 on page 775. This gives devices the opportunity to
supportotherthingslikeSMBuswhileinalowpowerstate.AsinPCIPMthe
amountofcurrentconsumedbyadeviceisreportedintheAux_Currentfieldin
thePMCregister.

Figure1630:AuxiliaryCurrentEnableforDevicesNotSupportingPMEs

15 14 12 11 10 9 8 7 5 4 3 2 1 0

Bridge Config. Retry Enable/


Initiate Function-Level Reset
Max Read Request Size
Enable No Snoop

Aux Power PM Enable

Phantom Functions Enable

Extended Tag Field Enable


Max Payload Size
Enable Relaxed Ordering
Unsupported Request
Reporting Enable
Fatal Error Reporting Enable
Non-Fatal Error
Reporting Enable
Correctable Error
Reporting Enable

775
PCIe 3.0.book Page 776 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Improving PM Efficiency

Background
Asprocessorsandothersystemcomponentsacquirebetterpowermanagement
mechanisms,peripheralslikePCIecomponentsstarttoappearasabiggercon
tributor to power consumption in PC systems. Earlier generations of PCIe
allowed some software and hardware power management, but coordinating
PM decisions with the system was not a high priority and consequently soft
warevisibilityandcontrolwaslimited.

One problem that can arise from this lack of coordination happens when the
systemgoesintoasleepstatebutthedevicesremainoperational.Suchdevices
caninitiateinterruptsorDMAtrafficthatwouldrequirethesystemtowakeup
tohandlethem,eventhoughttheywerelowpriorityevents,andthusdefeatthe
goalofpowerconservation.

It can also happen that the system is unaware of how long the devices can
affordtowaitfromthetimetheyrequestsystemservice(likeamemoryread)
untiltheygetaresponse.Withoutthatinformation,softwareisoftenforcedto
assume that the response time must always be minimal and therefore power
managementpoliciescantaffordenoughtimetodomuch.However,ifthesys
temwasawareoftimewindowswhenafastresponsewasnotneeded,itcould
bemoreaggressivewithpowermanagementandstayinalowpowerstatefora
longertimewithoutriskingperformanceproblems.The2.1specrevisionadded
twonewfeaturestoaddresstheseproblems.

OBFF (Optimized Buffer Flush and Fill)


The first of these mechanisms is Optimized Buffer Flush and Fill, which pro
videsamechanismforEndpointstobemadeawareofthesystempowerstate
andthereforethebesttimestododatatransferstoandfromthesystem.

The Problem
Theproblemwithbusmastercapabledevicesisthatiftheyrenotawareofthe
systempowerstatus,theymayinitiatetransactionsattimeswhenitwouldbe
bettertowait.ThediagraminFigure1631onpage777illustratestheproblem
in simpleterms: there are many components initiating events andas a result,

776
PCIe 3.0.book Page 777 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

thetimeswithoutactivitywhenthesystemisidleandcangotosleeparefew
and shortlived. In contrast, Figure 1632 on page 777 illustrates an improve
mentinwhichthesameeventsaregroupedandservicedtogethersothatthe
timeswhenthesystemisidleenoughtogotosleeparebothmorefrequentand
oflongerduration.Clearly,thiswouldresultinbetterpowerconservationand
fortunately, its not difficult to implement. PCIe components simply need to
understandwhattheyshoulddobasedonthesystempowerstate,andtheyll
needawaytolearnwhatthatstatecurrentlyis.

Figure1631:PoorSystemIdleTime

System Idle System Idle


Window Window

System Events

Endpoint A
Events

Endpoint B
Events

Endpoint C
Events
Time

Figure1632:ImprovedSystemIdleTime

System Idle System Idle System Idle


Window Window Window

System Events

Endpoint A
Events

Endpoint B
Events

Endpoint C
Events
Time

LTR could also be used to inform system software of acceptable latency for
the endpoints between accesses, suggesting a limit on this idle time.

777
PCIe 3.0.book Page 778 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

The Solution
OBFF is an optional hint that a system can use to inform components about
optimaltimewindowsfortraffic.Itsjustahint,though,sobusmastercapable
devicescanstillinitiatetrafficwhenevertheylike.Ofcourse,powerconsump
tionwillbenegativelyaffectediftheydo,sooverridingtheOBFFhintsshould
beavoidedasmuchaspossible.Theinformationiscommunicatedinoneoftwo
ways:bysendingmessagestotheEndpointsorbytogglingtheWAKE#pin.If
both options are available, using the pin is strongly recommended because it
avoidsthecounterproductivestepofusingexcesspower,possiblyacrosssev
eralLinks,toinformacomponentaboutthecurrentsystempowerstate.Infact,
theOBFFmessageshouldonlybeusediftheWAKE#pinisnotavailable.
Figure1633onpage778givesanexampleshowingamixofbothcommunica
tiontypes.Usingthepinisrequiredifitsavailable,butinthisexampleitsnot
anoptionbetweenthetwoswitches.Toworkaroundthisproblem,theupper
switchcantranslatethestatereceivedontheWAKE#pinintoamessagegoing
downstream.Itshouldperhapsbenotedherethatswitchesarestronglyencour
agedtoforwardallOBFFindicationsdownstreambutnotrequiredtodoso.It
maybenecessary,especiallywhenusingmessages,todiscardorcollapsesome
indicationsandthatispermitted.

Figure1633:OBFFSignalingExample

Root Complex

WAKE#

Endpoint
Switch Endpoint

OBFF
Message
Endpoint

WAKE# Switch

Endpoint Endpoint

778
PCIe 3.0.book Page 779 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

UsingtheWAKE#Pin.Thispin,previouslyonlyusedtoinformthesys
tem that a component needed to have power restored, is given an extra
meaningasthesimplestandlowestpoweroptionforcommunicatingsys
tem power status to PCIe components. Its optional, and the protocol is
fairlysimple:theWAKE#pintogglestocommunicatethesystemstate.As
seeninFigure1634onpage779,thereareseveraltransitionsbutonlythree
states,whicharedescribedbelow:

1. CPUActivesystemawake;alltransactionsOK.Thisiseverycompo
nentsinitialstate.
2. OBFFsystemmemorypathavailable;transferstoandfrommemory
areOK,butothertransactionsshouldwaitforahigherpowerstate.
3. Idlewaitforahigherstatebeforeinitiating.

Figure1634:WAKE#PinOBFFSignaling

Transition Event OBFF Message Code

Idle OBFF OBFF

Idle CPU Active CPU Active

OBFF or CPU Active Idle Idle

OBFF CPU Active CPU Active

CPU Active OBFF OBFF

WhentheCPUActiveorOBFFstateisindicated,itsrecommendedthatthe
platformnotreturntotheIdlestateforatleast10ssoastogivecompo
nentsenoughtimetodeliverthepacketstheymayhavebeenqueuingup
whileinthepreviousIdlestate.However,sincethattimingisntrequired,
its also recommended that Endpoints not assume theyll have a certain
amountoftimeinaCPUActiveorOBFFwindow.Alongthesamelines,the
platformisallowedtoindicatethatitsgoingtoIdlebeforeitactuallydoes

779
PCIe 3.0.book Page 780 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

soastogivecomponentsadvancenoticethatitstimetofinish.Thecasethis
earlynoticeisspecificallydesignedtoavoidishavinganEndpointstarta
transfer just as the platform goes to Idle, causing an immediate exit from
theIdlestate.Thespecstronglyrecommendsthatthisshouldbetheonly
reason for an early indication of the Idle state and also that this advance
noticetimeshouldbeasshortaspossible.

Interestingly, the WAKE# pin can still be used for its original purpose of
allowing a component to wake the system, and its no surprise that this
might confuse other components that are monitoring that pin for OBFF
information.Thatcouldresultinsuboptimalbehaviorinpowerorperfor
mance,butthisisconsideredarecoverablesituationsonostepsweretaken
toguardagainstit.Tocoverallofthesecases,anytimethesignalisunclear
thedefaultstatewillbeCPUActive.

UsingtheOBFFMessage.As mentioned earlier, OBFF information can


becommunicatedusingamessage,althoughitsrecommendthatthisonly
beusediftheWAKE#pinisnotavailable.Thesemessagesonlyflowdown
streamfromtheRoot.ThemessagecontentsareshowninFigure1635on
page 781, including the Routing type 100b (pointtopoint) and an OBFF
Codethatgivesthefollowingvalues(allothercodesarereserved):

1. 1111bCPUActive
2. 0001bOBFF
3. 0000bIdle

Ifareservedcodeisreceived,componentsmusttreatitasCPUActive.If
a Port receives an OBFF message but doesnt support OBFF or hasnt
enabledityet,itmusttreatitasanUnsupportedRequest(Completionsta
tusUR).

780
PCIe 3.0.book Page 781 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Figure1635:OBFFMessageContents

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
001 10100 tr H D P
Byte 4 Requester ID Tag Message Code
0001 0010

Byte 8 Reserved for Error Messages


OBFF
Byte 12 Reserved for Error Messages Code

Point-to-Point 0000b = Idle


0001b = OBFF
1111b = CPU Active

SupportforOBFFisindicatedviatheDeviceCapability2register(Figure
1636onpage782),andenabledusingtheDeviceControl2register(Figure
1637 on page 783). Note that both the pin and message options may be
available. However, the pin method is preferred because it is the lower
poweroption.

Note that there are two variations for enabling a component to forward
OBFFmessages,andthedifferencebetweenthemhastodowithhandlinga
targetedLinkthatsnotinL0.InVariationA,themessagewillonlybesent
iftheLinkisinL0.Ifitsnot,themessageissimplydroppedtoavoidthe
costofwakingtheLink.ThisispreferredforDownstreamPortswhenthe
Device below it is not expected to have timecritical communication
requirementsandcanindicateitsneedfornonurgentattentionbysimply
returningtheLinktoL0.ForVariationB,themessagewillalwaysbefor
warded and the Link will be returned to L0. This variation is preferred
when the downstream Device can benefit from timely notification of the
platformstate.

781
PCIe 3.0.book Page 782 Sunday, September 2, 2012 11:25 AM

Figure1636:OBFFSupportIndication

Device Capability 2 Register


31 24 23 22 21 20 19 18 14 13 12 11 10 9 8 7 6 5 4 3 0

RsvdP RsvdP

Max End-End
TLP Prefixes
End-End TLP
Prefix Supported
Extended Fmt
Field Supported
TPH Completer Supported
LTR Mechanism Supported
No RO-enabled PR-PR Passing
128-bit CAS Completer Supported
OBFF Support 64-bit AtomicOp Completer Supported
00 Not supported 32-bit AtomicOp Completer Supported

01 Message only AtomicOp Routing Supported


ARI Forwarding Supported
10 WAKE# only
Completion Timeout Disable Supported
11 Both Completion Timeout Ranges Supported

WhenusingWAKE#,enablinganyRootPorttoassertitisconsideredaglo
bal enable unless there are multiple WAKE# signals, in which case only
thoseassociatedwiththatPortareaffected.WhenusingtheOBFFmessage,
enablingaRootPortonlyenablesthemessagesonthatPort.Theexpecta
tioninthespecisthatallRootPortswouldnormallybeenabledifanyof
them are, so astoensurethat thewhole platformwasenabled. However,
selectivelyenablingsomePortsandnotothersispermitted.

When enabling Ports for OBFF, the spec recommends that all Upstream
PortsbeenabledbeforeDownstreamPorts,andRootPortsbeenabledlast
of all. For unpopulated hot plug slots this isnt possible. For that case
enablingOBFFusingtheWAKE#pintotheslotispermitted,butitsrecom
mendedthattheDownstreamPortabovetheslotnotbeenabledtodeliver
OBFFmessages.
PCIe 3.0.book Page 783 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Figure1637:OBFFEnableRegister

Device Control 2 Register


15 14 13 11 10 9 8 7 6 5 4 3 0

RsvdP

End-End TLP Prefix Blocking


LTR Mechanism Enable
IDO Completion Enable
IDO Request Enable
AtomicOp Egress Blocking
AtomicOp Requester Enable
ARI Forwarding Enable
Completion Timeout Disable
Completion Timeout Value

OBFF Enable
00 Disabled
01 Enabled with Message signaling Variation A
10 Enabled with Message signaling Variation B
11 Enabled using WAKE# signaling

Finally,letsreferbacktotheearlierexampleinFigure1633onpage778to
considerwhattheseregistersmightlooklikeforthatcase.TheDownstream
Port of the switch that connects to the lower switch will have a value for
OBFFSupportof01bMessageOnly,whileitsUpstreamPortmighthavea
value of 11b Both. These values might be hard coded into the device or
hardwareinitializedinsomeotherfashiontomakethemvisibletosoftware
after a reset. The Downstream Port would need to have an OBFF Enable
value of 01b or 10b Enabled with Message variation A or B so it could
deliver an OBFF message. The Upstream Port would expect to have an
OBFFEnablevalueof11bEnabledwithWAKE#signaling.Thespecpoints
out that when a switch is configured to use the different methods when
goingfromonePorttoanother,itsrequiredtomakethetranslationandfor
wardtheindications.

783
PCIe 3.0.book Page 784 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

LTR (Latency Tolerance Reporting)


ThesecondnewfeatureaddedtoimprovePMefficiencyiscalledLatencyToler
ance Reporting (LTR). This optional capability allows devices to report the
delaytheycantoleratewhenrequestingservicefromtheplatformsothatPM
policies for platform resourceslikemain memorycantake thatintoconsider
ation.Ifsoftwaresupportsit,thisprovidesgoodperformancefordeviceswhen
they need it and lower power for the system when they dont need a fast
response.Onesimplewayofusingthisinformationwouldbetoallowthesys
temtopostponewakinguptoservicearequestaslongasthelatencytolerance
wasstillmet.

Themeaningoflatencytoleranceisnotmadeexplicitlyclearinthespec,but
somethingsarementionedthatmightplayintoit.Forexample,thelatencytol
erancemayaffectacceptableperformanceoritmayimpactwhetherthecompo
nentwillfunctionproperlyatall.Clearly,suchadistinctionwouldmakeabig
differenceindesigningaPMpolicy.Similarly,thedevicemayusebufferingor
other techniques to compensate for latency sensitivity and knowledge of that
wouldbeusefulforsoftware.

LTR Registers
TheLTRcapabilityinadeviceisdiscoveredusinganewbitinthePCIeDevice
Capability2Register,asshowninFigure1638onpage785,andenabledinthe
DeviceControl2Register,illustratedinFigure1639onpage785.Thespecpre
scribes a sequence for enabling LTR, too: devices closest to the Root must be
enabled first, working down to the Endpoints. An Endpoint must not be
enabledunlessitsassociatedRootPortandallintermediateswitchesalsosup
port LTR and have been enabled to service it. Its permissible for some End
pointstosupportLTRwhileothersdonot.IfaRootPortorswitchDownstream
PortreceivesanLTRmessagebutdoesntsupportitorhasntbeenenabledyet,
themessagemustbetreatedasanUnsupportedRequest.Itsrecommendedthat
Endpoints send an LTR message shortly after being enabled to do so. Its
stronglyrecommendedthatEndpointsnotsendmorethantwoLTRmessages
within any 500 s period unless required by the spec. However, if they do,
DownstreamPortsmustproperlyhandlethemandnotgenerateanerrorbased
onthat.

784
PCIe 3.0.book Page 785 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Figure1638:LTRCapabilityStatus

Device Capability 2 Register


31 24 23 22 21 20 19 18 14 13 12 11 10 9 8 7 6 5 4 3 0

RsvdP RsvdP

Max End-End
TLP Prefixes
End-End TLP
Prefix Supported
Extended Fmt
Field Supported
TPH Completer Supported
LTR Mechanism Supported
O

Figure1639:LTREnable

Device Control 2 Register


15 14 13 11 10 9 8 7 6 5 4 3 0

RsvdP

End-End TLP Prefix Blocking


LTR Mechanism Enable
IDO Completion Enable
IDO Request Enable
AtomicOp Egress Blocking
AtomicOp Requester Enable
ARI Forwarding Enable
Completion Timeout Disable
Completion Timeout Value

ThetargetforLTRinformationistheRootComplex.Participatingdownstream
devicesallreporttheirvaluesbutthePortjustusesthesmallestvaluethatwas
reportedasthelatencylimitforalldevicesaccessedthroughthatPort.TheRoot
isnotrequiredtohonorrequestedservicelatenciesbutisstronglyencouraged
todoso.

785
PCIe 3.0.book Page 786 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

LTR Messages
The LTR message itself has the format shown in Figure 1640 on page 788,
where it can be seen that the Routing type 100b (pointtopoint) and the LTR
messagecodeis00010000b.Twolatencyvaluesarereported,oneforRequests
that must be snooped and another for Requests that will not be snooped and
thereforeshouldcompletemorequickly.Asseeninthediagram,theformatfor
bothisthesameandincludesthefollowingfields:

LatencyValueandScalecombinetogiveavalueintherangefrom1nsto
about34seconds.Settingthesefieldstoall zeros indicates that anydelay
will affect the device and thus the best possible service is requested. The
meaningofthelatencyisdefinedasfollows:
ForReadRequests,itsthedelayfromsendingtheENDsymbolinthe
RequestTLPuntilreceivingtheSTPsymbolinthefirstCompletionTLP
forthatRequest.
ForWriteRequests,itrelatestoFlowControlbackpressure.Ifawrite
hasbeenissuedbutthenextwritecantproceedduetoalackofFlow
Controlcredits,thelatencyisthetimefromthelastsymbolofthatwrite
(END)untilthefirstsymboloftheDLLPthatgivesmorecredits(SDP).
In other words, this represents the time within which the Root Port
shouldbeabletoacceptthenextwrite.
Requirementcanbesetfornone,orone,orbothtoindicatewhetherthat
latencyvalueisrequired.Ifadevicedoesntimplementoneofthesetraffic
typesorhasnoservicerequirementsforit,thenthisbitmustbeclearedfor
the associated field. If a device has reported requirements but has since
beendirectedintoadevicepowerstatelowerthanD0,orifitsLTREnable
bithasbeencleared,thedevicemustsendanotherLTRmessagereporting
thattheselatenciesarenolongerrequired.

Guidelines Regarding LTR Use


EndpointshaveafewguidelinesregardingtheuseofLTR:
1. ItsrecommendedthattheysendanupdatedLTRmessageeverytimetheir
service requirements change, and the spec spends some time going over
examplesofthis.The bottom line hereis thatdevicesneed totakeallthe
delays into account when making a change to the service requirements.
Thataccountingincludestimeforthereferenceclocktoberestoredifwas
turnedoff,fortheLinktobebroughtbacktoL0,fortheLTRmessagetobe
delivered,andfortheplatformtopreparetohandlethenewrequirement.
2. If the latency tolerance is being reduced, its recommended that the LTR
messagebesentfarenoughaheadofthefirstassociatedRequesttoensure
thattheplatformisready.

786
PCIe 3.0.book Page 787 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

3. Ifthelatencytoleranceisbeingincreased,thentheLTRmessagetoreport
that should immediately follow the final Request that used the previous
latencyvalue.
4. Toachievethebestoverallplatformpowerefficiency,itsrecommendedthat
Endpoints buffer Requests as much as they can and then send them in
burststhatareaslongastheEndpointcansupport.

MultiFunction Devices (MFDs) have a few rules of their own. For example,
theymustsendaconglomeratedLTRmessageasfollows:
1. Reportedlatencyvaluesmustreflectthelowestvaluesassociatedwithany
Function.Thesnoopandnosnooplatenciescouldbeassociatedwithdiffer
ent Functions, but if none of them have a requirement for snoop or
nosnooptraffic,thentherequirementbitforthattypemustnotbeset.
2. MFDs must send a new LTR message upstream if any of the Functions
changesitsvaluesinawaythataffectstheconglomeratedvalue.

Switches have a similar set of rules related to LTR. Basically, they collect the
messagesfromDownstreamPortsthathavebeenenabledtouseLTRandsend
aconglomeratedmessageupstreamaccordingtothefollowingrules:
1. IftheSwitchsupportsLTR,itmustsupportitonallofitsPorts.
2. The Upstream Port is allowed to send LTR messages only when the LTR
Enablebitissetorshortlyaftersoftwarehascleareditsoitcanreportthat
anypreviousrequirementsarenolongerineffect.
3. TheconglomeratedLTRvalueisbasedonthelowestvaluereportedbyany
participatingDownstreamPort.IftheRequirementbitisclear,oraninvalid
valueisreported,thelatencyisconsideredeffectivelyinfinite.
4. IfanyDownstreamPortreportsthatanLTRvalueisrequired,theRequire
mentbitwillbesetforthattypeintheLTRmessageforwardedupstream.
5. TheLTRvaluesreportedupstreammusttakeintoaccountthelatencyofthe
Switchitself.IftheSwitchlatencychangesbasedonitsoperationalmode,it
mustnotbeallowedtoexceed20%oftheminimumvaluereportedonall
Downstream Ports. The value reported on the Upstream Port is the mini
mumreportedvalueonalltheDownstreamPortsminustheSwitchsown
latency,althoughthevaluecantbelessthanzero.
6. IfaDownstreamPortgoestoDL_Downstatus,previouslatenciesforthat
Port must be treated as invalid. If that changes the conglomerated values
upstreamthenanewmessagemustbesenttoreportthat.
7. IfaDownstreamPortsLTREnablebitiscleared,anylatenciesassociated
withthatPortmustbeconsideredinvalid,whichmayalsoresultinanew
LTRmessagebeingsentupstream.
8. If any Downstream Ports receive new LTR values that would change the
conglomeratedvalue,theSwitchmustsendanewLTRmessageupstream
toreportthat.

787
PCIe 3.0.book Page 788 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Finally,theRootComplexalsohasafewrulesrelatedtoLTR:
1. TheRCisallowedtodelayprocessingofadeviceRequestaslongasitsatis
fiestheservicerequirements.Oneapplicationofthismightbetobufferup
severalRequestsfromanEndpointandservicethemallinabatch.
2. If the latency requirements are updated while a series of Requests is in
progress,thenewvaluesmustbecomprehendedbytheRCpriortoservic
ing the next Request, and within less time than the previously reported
latencyrequirements.

Figure1640:LTRMessageFormat

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1
Byte 0
Fmt Type R TC Rsv T E Attr AT Length (Reserved)
001 10100 000 DP 00 00
Message Code
Byte 4 Requester ID Tag 0001 0000

Byte 8 Reserved

Byte 12 No-Snoop Latency Snoop Latency

Point-to-Point
15 14 13 12 10 9 0
Latency
Rsv Latency Value
Scale

Requirement
Scale:
000 - x 1ns 001 - x 32 ns
010 - x 1K ns 011 - x 32K ns
100 - x 1M ns 101 - x 32M ns
110 - x not permitted

788
PCIe 3.0.book Page 789 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

LTR Example
To illustrate the concepts discussed so far, consider the example topology
shown in Figure 1641 on page 789. Here, the Endpoint on the lower left has
deliveredanLTRmessagetotheSwitchreportingaSnoopLatencyrequirement
of 1200ns. At this point, none of the other Endpoints connected to the Switch
has reported an LTR value, so that becomes the conglomerated value to be
reportedupstream.However,theSwitchhasaninternallatencyof50nssothat
must be subtracted from the value to be reported, resulting in the Upstream
PortsendinganLTRmessagereporting1150nstotheRootPort.

Figure1641:LTRExample

Conglomerate 1150 ns
value

Conglomerate
value 1200 ns

1200 ns

Next, the Legacy Endpoint delivers an LTR message with a large latency
requirementof5000ns,asshowninFigure1642onpage790.Sincethisislarger
thanthecurrentconglomeratevaluefortheSwitch,noLTRmessageissentfor
thiscase.

789
PCIe 3.0.book Page 790 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1642:LTRChangebutnoUpdate

Conglomerate 1150 ns
value

Conglomerate 1200 ns
value

5000 ns

In the nextstage,themiddle Endpoint reports its LTR valueas 700ns. This is


smallerthanthecurrentconglomeratevalue,sotheSwitchcalculatesthenew
valueof650nsbysubtractingitsinternallatencyandforwardsthatupstreamas
anLTRmessage.ThatmakesthecurrentlatencyrequirementforthatRootPort
650ns,asseeninFigure1643onpage791.

Finally, the Link to the middle Endpoint stops working for some reason as
showninFigure1644onpage791,andtheSwitchPortreportsDL_Down.Con
sequently,theLTRvalueforthatPortmustbeconsideredinvalid.Sinceitsvalue
was being used as the current conglomerate value, the conglomerate will be
updatedtothelowestvaluethatisstillvalid,whichisthe1200nsreportedby
the leftmost Endpoint. The Switch will then subtract its internal latency and
report1150nstotheRootPortwithanewLTRmessage.

790
PCIe 3.0.book Page 791 Sunday, September 2, 2012 11:25 AM

Chapter16:PowerManagement

Figure1643:LTRChangewithUpdate

Conglomerate 650 ns
value

Conglomerate
value 700 ns

700 ns

Figure1644:LTRLinkDownCase

Conglomerate 1150 ns
value

Conglomerate 1200
700 ns
1150 ns
value

791
PCIe 3.0.book Page 792 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

792
PCIe 3.0.book Page 793 Sunday, September 2, 2012 11:25 AM

17 InterruptSupport
The Previous Chapter
Thepreviouschapterprovidesanoverallcontextforthediscussionofsystem
power management and a detailed description of PCIe power management,
whichiscompatiblewiththePCIBusPMInterfaceSpecandtheAdvancedConfig
urationandPowerInterface(ACPI)spec.PCIedefinesextensionstothePCIPM
specthatfocusprimarilyonLinkPowerandeventmanagement.Anoverview
oftheOnNowInitiative,ACPI,andtheinvolvementoftheWindowsOSisalso
provided.

This Chapter
This chapter describes the different ways that PCIe Functions can generate
interrupts.TheoldPCImodelusedpinsforthis,butsidebandsignalsareunde
sirableinaserialmodelsosupportfortheinbandMSI(MessageSignaledInter
rupt)mechanismwasmademandatory.ThePCIINTx#pinoperationcanstill
be emulated using PCIe INTx messages for software backward compatibility
reasons.BoththePCIlegacyINTx#methodandthenewerversionsofMSI/MSI
Xaredescribed.

The Next Chapter


ThenextchapterdescribesthreetypesofresetsdefinedforPCIe:Fundamental
reset (consisting of cold and warm reset), hot reset, and functionlevel reset
(FLR).TheuseofasidebandresetPERST#signaltogenerateasystemresetis
discussed,andsoistheinbandTS1basedHotResetdescribed.

793
PCIe 3.0.book Page 794 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Interrupt Support Background

General
ThePCIarchitecturesupportedinterruptsfromperipheraldevicesasameans
ofimprovingtheirperformanceandoffloadingtheCPUfromtheneedtopoll
devices to determine when they require servicing. PCIe inherits this support
largely unchanged from PCI, allowing software backwards compatibility to
PCI.Weprovideabackgroundtosysteminterrupthandlinginthischapter,but
the reader who wants more details on interrupts is encouraged to look into
thesereferences:

ForPCIinterruptbackground,refertothePCIspecrev3.0ortochapter14
ofMindSharestextbook:PCISystemArchitecture(www.mindshare.com).
TolearnmoreaboutLocalandIOAPICs,refertoMindSharestextbook:x86
InstructionSetArchitecture.

Two Methods of Interrupt Delivery


PCIusedsidebandinterruptwiresthatwereroutedtoacentralinterruptcon
troller.Thismethodworkedwellinsimple,singleCPUsystems,buthadsome
shortcomingsthatmotivatedmovingtoanewermethodcalledMSI(Message
SignaledInterrupts)withanextensioncalledMSIX(eXtented).

Legacy PCI Interrupt Delivery ThisoriginalmechanismdefinedforthePCI


busconsistsofuptofoursignalsperdeviceorINTx#(INTA#,INTB#,INTC#,
and INTD#) as shown in Figure 171 on page 795. In this model, the pins are
sharedbywireORingthemtogether,andtheydeventuallybeconnectedtoan
input on the 8259 PIC (Programmable Interrupt Controller). When a pin is
asserted,thePICinturnassertsitsinterruptrequestpintotheCPUaspartofa
processdescribedinTheLegacyModelonpage 796.

PCIesupportsthisPCIinterruptfunctionalityforbackwardcompatibility,buta
designgoalforserialtransportsisto minimizethe pincount.Asaresult,the
INTx#signalswerenotimplementedassidebandpins.Instead,aFunctioncan
generateaninbandinterruptmessagepackettoindicatetheassertionordeas
sertionofapin.Thesemessagesactasvirtualwires,andtargettheinterrupt
controllerinthesystem(typicallyintheRootComplex),asshowninFigure17
2onpage796.ThispicturealsoillustrateshowanolderPCIdeviceusingthe

794
PCIe 3.0.book Page 795 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

pinscanworkinaPCIesystem;thebridgetranslatestheassertionofapininto
aninterrupt emulationmessage(INTx)goingupstreamto theRootComplex.
TheexpectationisthatPCIedeviceswouldnotnormallyneedtousetheINTx
messagesbut,atthetimeofthiswriting,inpracticetheyoftendobecausesys
temsoftwarehasnotbeenupdatedtosupportMSI.

Figure171:PCIInterruptDelivery

6ODYH
$
%XV %XV ,QWHUUXSW
'HYLFH 'HYLFH &RQWUROOHU
).4$ ,54
,17$ 3&, ,54 ,54
,17% WR
3&, ,54
%ULGJH ,54
).4! ,54
,54
%XV ).4! ,54
'HYLFH 0DVWHU
,54 $
,17$ ,QWHUUXSW
&RQWUROOHU
,54
,54

,54 ,QWHUUXSW
,54 WR&38
,54
,54
,54

MSI I nterrupt Delivery MSI eliminates the need for sideband signals by
usingmemorywritestodelivertheinterruptnotification.ThetermMessage
SignaledInterruptcanbeconfusingbecauseitsnameincludesthetermMes
sagewhichisatypeofTLPinPCIe,butanMSIinterruptisaPostedMemory
WriteinsteadofaMessagetransaction.MSImemorywritesaredistinguished
from other memory writes only by the addresses they target, which are typi
callyreservedbythesystemforinterruptdelivery(e.g.,x86basedsystemstra
ditionallyreservetheaddressrangeFEEx_xxxxhforinterruptdelivery).

Figure 172 illustrates the delivery of interrupts from various types of PCIe
devices.AllPCIedevicesarerequiredtosupportMSI,butsoftwaremayormay
notsupportMSI,inwhichcase,theINTxmessageswouldbeused.Figure172
alsoshowshowaPCIetoPCIBridgeisrequiredtoconvertsidebandinterrupts
fromconnectedPCIdevicestoPCIesupportedINTxmessages.

795
PCIe 3.0.book Page 796 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Figure172:InterruptDeliveryOptionsinPCIeSystem

CPU

Root Complex Memory


Interrupt Controller

INTx
MSI or Message
INTx Message

PCIe
Switch
MSI or MSI or Bridge
INTx Message INTx Message to PCI
or PCI-X
INTx#

PCIe Legacy
PCI/PCI-X
Endpoint Endpoint

The Legacy Model

General
To illustrate the legacy interrupt delivery model, refer to Figure 173 on page
797andconsidertheusualstepsinvolvedininterruptdeliveryusingthelegacy
methodofinterruptpins:

1. The device generates an interrupt by asserting its pin to the controller. In


oldersystemsthiscontrollerwastypicallyanIntel8259PICthathad15IRQ
inputsandoneINTRoutput.ThePICwouldthenassertINTRtoinformthe
CPUthatoneormoreinterruptswerepending.

796
PCIe 3.0.book Page 797 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

2. OncetheCPUdetectstheassertionofINTRandisreadytoactonit,itmust
identifywhichinterruptactuallyneedsservice,andthatisdonebytheCPU
issuing a special command on the processor bus called an Interrupt
Acknowledge.
3. Thiscommand isrouted by thesystemtothePIC,which returnsan8bit
valuecalledtheInterruptVectortoreportthehighestpriorityinterruptcur
rentlypending.Auniquevectorwouldhavebeenprogrammedearlierby
systemsoftwareforeachIRQinput.
4. The interrupt handler then uses the vector as an offset into the Interrupt
Table (an area set up by software to contain the start addresses of all the
InterruptServiceRoutines,ISRs),andfetchestheISRstartaddressitfinds
atthatlocation.
5. ThataddresswouldpointtothefirstinstructionoftheISRthathadbeenset
uptohandlethisinterrupt.Thishandlerwouldbeexecuted,servicingthe
interrupt and telling its device to deassert its INTx# line and then would
returncontroltothepreviouslyinterruptedtask.

Figure173:LegacyInterruptExample

INTR Memory
CPU
5
Interrupt Interrupt Service
Vector Routine (ISR)
Acknowledge
4
North Bridge
Interrupt Table (ISR
starting addresses)
PCI Bus

2 3
Bridge
Data Buffer
South Bridge

1 PCI Bus
Interrupt Controller
(PIC) INTA#
Device

797
PCIe 3.0.book Page 798 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Changes to Support Multiple Processors


ThismodelworkswellforsingleCPUsystems,buthasalimitationthatmakes
it suboptimal in a multiCPU system. The problem is that the INTR pin can
onlybeconnectedtooneCPU.Ifmultipleprocessorsarepresentthenonlyone
ofthemwillseetheinterruptsandwillhavetoservicethemallwhiletheother
CPUs wont see any of them. To obtain the best performance, such systems
really need an even distribution of the system tasks across all the processors,
referredtoasSMP(SymmetricMultiProcessing)butthepinmodelwontsup
portit.

ToachievebetterSMP,anewmodelwasneeded,andtowardthisendthePIC
wasmodifiedtobecometheIOAPIC(AdvancedProgrammableInterruptCon
troller). The IO APIC was designed to have a separate small bus, called the
APICBus,overwhichitcoulddeliverinterruptmessages,asshowninFigure
174 on page 799. In this model, the message contained the interrupt vector
number,sotherewasnoneedfortheCPUtosendanInterruptAcknowledge
downintotheIOworldtofetchit.TheAPICBusconnectedtoanewinternal
logic block within the processors called the Local APIC. The bus was shared
amongalltheagentsandanyofthemcouldinitiatemessagesonitbut,forour
purposes,theinterestingpartisitsuseforinterruptdeliveryfromperipherals.
Thoseinterruptscouldnowbestaticallyassignedbysoftwaretobeservicedby
differentCPUs,multipleCPUsorevendynamicallyassignedbytheIOAPIC.

798
PCIe 3.0.book Page 799 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

Figure174:APICModelforInterruptDelivery

Local Local
APIC APIC
CPU CPU

Memory
APIC
bus North Bridge

PCI Bus

Bridge
Write Buffer
South Bridge

PCI Bus
Interrupt Controller
(IO APIC) INTA#
Device

Thatmodel,knownastheAPICmodel,wassufficientforseveralyearsbutstill
dependedonsidebandpinsfromtheperipheraldevicestowork.Anotherlimi
tationofthismodelwasthenumberofIRQs(interruptrequestlines)intotheIO
APIC. Without a very large number of IRQs, peripheral devices had to share
IRQs which means added latency anytime that IRQ is asserted because there
couldbemultipledevicesthatcouldhaveasserteditandsoftwaremustevalu
ate all of them. This technique of linking multiple ISRs together was often
referredtoasinterruptchaining.Eventually,becauseofthisissueandacouple
otherminorissues,anotherimprovementcamealong.

Why not have the peripheral devices themselves send interrupt messages
directlytotheLocalAPICs?Allthatisneededisacommunicationspathwhich
alreadyexistsintheformofthePCIbusandtheprocessorbus.SotheAPICbus
waseliminatedandallinterruptsweredeliveredtotheLocalAPICsintheform
of memory writes, referred to as MSIs or Message Signaled Interrupts. These
MSIsweretargetingaspecialaddressthatthesystemunderstoodtobeaninter
ruptmessagetargetingtheLocalAPICs.(Thisspecialaddressaddresswastra

799
PCIe 3.0.book Page 800 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

ditionally FEEx_xxxxh for x86based systems.) Even the IO APIC was


programmedtosenditsinterruptnotificationsovertheordinarydatabususing
memory writes (MSI). Now it simply sends an MSI memory write across the
databustargetingthememoryaddressofthedesiredprocessorsLocalAPIC,
andthathastheeffectofnotifyingtheprocessoroftheinterrupt.
ThismodelisknownasthexAPICmodel,andsinceitisnotbasedonsideband
signalswhichgointoaninterruptcontrollerwithalimitednumberofinputs,
the need to share interrupts is almost eliminated. More information can be
foundaboutthismodelinAnMSISolutiononpage 827.
PCIaddedMSIsupportasanoptionyearsagoandPCIemadethatcapabilitya
requirement.AperipheralthatcangenerateMSItransactionsonitsownopens
newoptionsforhandlinginterrupts,suchasgivingeachFunctiontheabilityto
generatemultipleuniqueinterruptsinsteadofjustone.

Legacy PCI Interrupt Delivery


This section provides more detail on legacy PCI interrupt delivery. Readers
familiarwithPCImaywishtoproceedtoVirtualINTxSignalingonpage 805
to learn more about how PCIe emulates this legacy model, or to The MSI
Modelonpage 812tolearnmoreaboutthatmethod.
PCIdevicesthatuseinterruptshavetwooptions.Theymayuseeither:
INTx#activelowlevelsignalsthatcanbesharedandweredefinedinthe
originalspec.
MessageSignaledInterruptsthatwereaddedasanoptionwiththe2.2ver
sionofthespec.MSIneedsnomodificationforuseinaPCIesystem.

Device INTx# Pins


APCIdevicecanimplementupto4INTx#signals(INTA#,INTB#,INTC#,and
INTD#).MorethanonepinisavailablebecausePCIdevicescansupportupto8
functions, each of which is allowed to drive one (but only one) interrupt pin.
WhenPCIwasdeveloped,atypicalsystemusedachipsetthatincludedthe15
input8259PIC,sothatshowmanyIRQs(whichmaptointerruptvectors)that
were available to the system. However, many of those were already used for
system purposes like the system timer, keyboard interrupt, mouse interrupt,
andsoon.Inaddition,somepinswerereservedforISAcardsthatcouldstillbe
plugged into these older systems. Consequently, the PCI spec writers consid
eredthatonlyfourIRQswouldreliablybeavailablefortheirnewbus,andso
thespeconlysupportedfourinterruptpins.However,asyouprobablyknow,
therearetypicallymorethanfourPCIdevicesonaPCIbusandevenasingle
devicecouldhavemorethanfourfunctionsinside,eachwantingitsowninter

800
PCIe 3.0.book Page 801 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

rupt.ThesereasonsarewhythePCIinterruptsweredesignedtobelevelsensi
tive and shareable. These signals could simply be wireORed together to get
down to a handful of resulting outputs, each one representing interrupt
requests. Since they are shared, when an interrupt is detected, the interrupt
handlersoftwarewillneedtogothroughthelistoffunctionsthataresharing
thesamepinandtesttoseewhichonesneedservicing.

Determining INTx# Pin Support


PCIfunctionsindicatesupportforanINTx#signalintheirconfigurationhead
ers. The readonly Interrupt Pin register illustrated in Figure 175 indicates
whetheranINTx#issupportedbythisfunctionandifso,whichinterruptpin
willitassertwhenrequestinganinterrupt.

Figure175:InterruptRegistersinPCIConfigurationHeader

Byte DW
3 2 1 0
Device Vendor 00
ID ID
Status Command 01
Register Register
Revision 02
Class Code
ID
Header Latency Cache 03 00h = IRQ0
BIST Type Timer Line
Size
04 01h = IRQ1
Base Address 0
02h = IRQ2
Base Address 1
05 RW
03h = IRQ3
06 access
Base Address 2 04h = IRQ4
07 05h = IRQ5
Base Address 3
08 :
Base Address 4 :
:
09
Base Address 5 FEh = IRQ254
10
CardBus CIS Pointer FFh = IRQ255
Subsystem 11
Subsystem ID
Vendor ID
Expansion ROM 12
Base Address
Capabilities 13
Reserved Pointer RO 00h = No INTx# pin used
14
Reserved access 01h = INTA#
15 02h = INTB#
Max_Lat Min_Gnt Interrupt Interrupt
Pin Line
03h = INTC#
04h = INTD#

801
PCIe 3.0.book Page 802 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Interrupt Routing
The Interrupt Line register shown in Figure 175 on page 801 gives the next
informationthatadriverneedstoknow:theinputpinofthePICtowhichthis
pin has been connected. The PIC is programmed by system software with a
uniquevectornumberforeachinputpin(IRQ).Thevectorforthehighestprior
ityinterruptassertedisreportedtotheprocessorwhothenusesthatvectorto
indexintoacorrespondingentryintheinterruptvectortable.Thisentrypoints
to the interruptingdevices interrupt serviceroutine which the processor exe
cutes.

TheplatformdesignerassignstheroutingofINTx#pinsfromdevices.Theycan
be routed in a variety of ways, but ultimately each INTx# pin connects to an
inputoftheinterruptcontroller.Figure176onpage803illustratesanexample
inwhichseveralPCIdeviceinterruptsareconnectedtotheinterruptcontroller
throughaprogrammablerouter.Allsignalsconnectedtoagiveninputofthe
programmable routerwill bedirectedto aspecific inputoftheinterrupt con
troller.Functionswhoseinterruptsareroutedtoacommoninterruptcontroller
inputwillallhavethesameInterruptLinenumberassignedtothembyplat
formsoftware(typicallyfirmware).Inthisexample,IRQ15hasthreePCIINTx#
inputsfromdifferentdevicesconnectedtoit.Consequently,thefunctionsusing
theseINTx#lineswillshareIRQ15andwillthereforeallcausethecontrollerto
sendthesamevectorwhenqueried.ThatvectorwillhavethethreeISRsforthe
differentFunctionschainedtogether.

Associating the INTx# Line to an IRQ Number


Based on system requirements, the router is programmed to connect its four
inputstofouravailablePICinputs.Oncethisisdone,theroutingoftheINTx#
pin associated with each function is known and the Interrupt Line number is
written by software into each Function. The value is ultimately read by the
Functionsdevicedriversoitwillknowwhichinterrupttableentryithasbeen
assigned.ThatstheplacewherethestartingaddressofitsISRwillbewritten,a
processreferredtoashookingtheinterrupt.Whenthisfunctionlatergener
atesaninterrupt,theCPUwillreceivethevectornumberthatcorrespondsto
the IRQ specified in the Interrupt Line register. The CPU uses this vector to
indexintotheinterruptvectortabletofetchtheentrypointoftheinterruptser
viceroutineassociatedwiththeFunctionsdevicedriver.

802
PCIe 3.0.book Page 803 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

Figure176:INTxSignalRoutingisPlatformSpecific

INTA#

INTA#
INTB#
ISA
Slave
Programmable
8259A
Interrupt Interrupt
Router Controller
INTA#
IRQ8
IRQ9 (IRQ2)
IRQ10
INTA# IRQ11
INTB# IRQ12 ISA
INTC# Input 0# IRQ13 Master
INTD# IRQ14 8259A
Input 1# IRQ15
Input 2# Interrupt
Controller
INTA# Input 3#
IRQ0
IRQ1
Interrupt
IRQ3 to CPU
INTA#
INTB# IRQ4
IRQ5
IRQ6
IRQ7
INTA#

INTx# Signaling
TheINTx#linesareactivelowsignalsimplementedasopendrainwithapul
lupresistorprovidedoneachlinebythesystem.Multipledevicesconnectedto
thesamePCIinterruptrequestsignallinecanassertitsimultaneouslywithout
damage.

WhenaFunctionsignalsaninterruptitalsosetstheInterruptStatusbitlocated
intheStatusregisteroftheconfigheader.Thisbitcanbereadbysystemsoft
waretoseeifaninterruptiscurrentlypending.(SeeFigure178onpage805.)

Interrupt Disable. The2.3PCIspecaddedanInterruptDisablebit(Bit10)


totheCommandregisteroftheconfigheader.SeeFigure177onpage804.The
bitisclearedatresetpermittingINTx#signalgeneration,butsoftwaremaysetit

803
PCIe 3.0.book Page 804 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

topreventthat.NotethattheInterruptDisablebithasnoeffectonMessageSig
nalledInterrupts(MSI).MSIsareenabledviatheCommandRegisterintheMSI
Capability structure. Enabling MSI automatically has the effect of disabling
interruptpinsoremulation.

Interrupt Status. ThePCI2.3specaddedareadonlyInterruptStatusbitto


theconfigurationstatusregister(picturedinFigure178onpage805).Afunc
tion must set this status bit when an interrupt is pending. In addition, if the
InterruptDisablebitintheCommandregisteroftheheaderiscleared(i.e.inter
ruptsenabled),thenthefunctionsINTx#signalisassertedwhenthisstatusbit
isset.ThisbitisunaffectedbythestateoftheInterruptDisablebit.

Figure177:ConfigurationCommandRegisterInterruptDisableField

15 11 10 9 8 7 6 5 4 3 2 1 0

Reserved R

Interrupt Disable, was Reserved


Fast Back-to-Back Enable

SERR# Enable
Reserved, was Stepping Control
Parity Error Response
VGA Palette Snoop Enable
Memory Write and Invalidate Enable
Special Cycles
Bus Master
Memory Space
IO Space

804
PCIe 3.0.book Page 805 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support


Figure178:ConfigurationStatusRegisterInterruptStatusField

15 14 13 12 11 10 9 8 7 6 5 4 3 2 0

R Reserved
Interrupt Status
Capabilities List
66MHz-Capable
Reserved
Fast Back-to-Back Capable
Master Data Parity Error
DEVSEL Timing
Signalled Target-Abort
Received Target-Abort
Received Master-Abort
Signalled System Error
Detected Parity Error

Virtual INTx Signaling


General
IfcircumstancesmaketheuseofMSInotpossibleinaPCIetopology,theINTx
signaling model would be used. Following are two examples of devices that
wouldneedtobeabletouseINTxmessages:

PCIeto(PCI or PCIX) bridges Most PCI devices will use the INTx# pins
becauseMSIsupportisoptionalforthem.SincePCIedoesntsupportsideband
interrupt signaling, the inband messages are used instead. The interrupt con
troller understands themessageand deliversan interruptrequest tothe CPU
whichwouldincludeapreprogrammedvectornumber.

BootDevicesPCsystemscommonlyusethelegacyinterruptmodelduring
thebootsequencebecauseMSIusuallyrequiresOSlevelinitialization.Gener
ally,aminimumofthreesubsystemsareneededforbooting:anoutputtothe
operatorsuchasvideo,aninputfromtheoperatorwhichistypicallythekey
board,andadevicethatcanbeusedtofetchtheOS,typicallyaharddrive.PCIe
devices involved in initializing the system are called boot devices. Boot
devices will use legacy interrupt support until the OS and device drivers are
loaded,afterwhichitspreferabletheyuseMSI.

805
PCIe 3.0.book Page 806 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Virtual INTx Wire Delivery


Figure 179 on page 806 illustrates a system with a PCIe Endpoint and a PCI
ExpresstoPCIBridge.IfweassumesoftwarehasnotenabledMSIontheEnd
point,itwilldeliverinterruptrequestswithINTxmessages.Inthisexample,the
bridge is propogating pinbased interrupts from connected PCI devices with
INTxmessages.Ascanbeseen,thebridgesendsanINTBmessagestosignal
the assertion and deassertion of its INTB# input from the PCI bus. The PCIe
Endpoint is shown signaling an INTA using emulation messages. Note that
INTx#signalinginvolvestwomessages:
Assert_INTx messages indicate a hightolow transition (from inactive to
active)ofthevirtualINTx#signal.
Deassert_INTxmessagesindicatealowtohightransition.
WhenaFunctiondeliversanAssert_INTxmessage,italsosetsitsInterruptSta
tus bit in the Configuration Status register, just as it would if it asserted the
physicalINTx#pin(seeFigure178onpage805).

Figure179:ExampleofINTxMessagestoVirtualizeINTA#INTD#
SignalTransitions

CPU

Root Complex
Memory
Interrupt Controller

Assert_INTA Switch Assert_INTB

Deassert_INTA Deassert_INTB

INT A#
PCIe PCIe- INTB#
PCI(X) INTC#
Endpoint INTD#
Bridge

PCI(X)

806
PCIe 3.0.book Page 807 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

INTx Message Format


Figure1710onpage807depictstheformatoftheINTxmessageheader.The
interruptcontrolleristheultimatedestinationofthesemessages,howeverthe
routingmethodemployedisnotRoutetotheRootComplex,butisactually
LocalTerminateatReceiverasshowninFigure1710.Therearetworeasons
forthis.Thefirstisbecauseeachbridge(includingSwitchPortsandRootPorts)
alongtheupstreampathmaymapthevirtualinterruptwiretoadifferentvir
tualinterruptwireacrossthebridge(e.g.,aSwitchPortreceivesAssert_INTA
but maps it to Assert_INTB when propogating it upstream). More info about
thisINTxmappingcanbefoundinINTxMappingonpage 808.

Thesecondreasonforthelocalroutingtypeofthesemessagesisduetothefact
that were emulating a pinbased signal. If a port receives an assert interrupt
message that maps to INTA on its primary side and it has already sent an
Assert_INTAmessageupstreambecauseofapreviousinterrupt,thenthereis
no reason to send another one. INTA is already seen as asserted. More info
aboutthiscollapsingofINTxmessagescanbefoundinINTxCollapsingon
page 810.

Figure1710:INTxMessageFormatandType

+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
001 10100 tr H D P
Byte 4 Requester ID Tag Message Code

Byte 8 Reserved for INTx Messages

Byte 12 Reserved for INTx Messages

Local - Terminate at Receiver 20h = Assert_INTA


21h = Assert_INTB
22h = Assert_INTC
23h = Assert_INTD
24h = Deassert_INTA
25h = Deassert_INTB
26h = Deassert_INTC
27h = Deassert_INTD

807
PCIe 3.0.book Page 808 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Mapping and Collapsing INTx Messages


INTx Mapping
SwitchesmustadheretotheINTxmappingdefinedbythePCIspec,shownin
Table 171onpage 809.Thismappingdefinesthevirtualconnectionthatexists
wheninterruptsareroutedacrossaPCItoPCIbridge.Themappingisbased
ontheINTxmessagetypeandtheDevicenumberfromtheRequesterIDfield
inthemessage.

Refer to Figure 1711 on page 810 for this example. The assert interrupt mes
sagesreceivedonthetwodownstreamswitchportsarebothINTAmessages.
ThevirtualPCItoPCIbridgeateachoftheingressportswillmapbothINTA
messagestoINTA,meaningnochange.ThisisbecausetheDevicenumberof
bothoriginatingEndpointdevicesiszero(whichiscontainedintheinterrupt
messageitselfaspartoftheRequesterID,ReqID).Table171showsthatinter
rupts messagescomingfrom Device0map to thesameINTxmessage onthe
other side of the bridge (i.e., internal to the Switch both INTA messages are
mappedtoINTA).Soeachdownstreamportwillpropogatetheinterruptmes
sagesupstreamwithoutchangingtheirvirtualwire.However,thepropogated
interruptmessagesnolongerhavetheReqIDoftheoriginalrequester,theynow
havetheReqIDoftheportthatispropogatingtheinterruptmessage.

Next, the upstream Switch Port receives the propogated interrupt messages.
TheINTAinterruptfromport2:1:0isgoingtobemappedtoanINTBmessage
when progopated upstream because the interrupt message indicates it came
fromDevice1(ReqID2:1:0).Theotherinterruptbeingpropogatedbyport2:2:0
is going to be mapped to an INTC message when sent from the upstream
SwitchPorttotheRootPort.RefertoTable171toconfirmthesemappings.

ThereasonforthisinterruptmappingisthesameasitwasforPCI:toavoidas
much as possible having multiple functions sharing the same INTx# pin. As
statedpreviously,singlefunctiondevicesarerequiredtouseINTAifusingleg
acyinterrupts.SoifalltheFunctionsdownstreamofaRootPortusedINTAand
therewasnomappingacrossbridges,theywouldallberoutedtothesameIRQ.
Which means anytime one of the Functions asserted INTA, all the Functions
wouldhavetobechecked.Thiswouldresultinsignificantinterruptservicing
latenciesfortheFunctionsattheendofthelist.Thisinterruptmappingmethod
is a crude attempt at distributing interrupts (especially INTA) across all four
INTxvirtualwiresbecauseeachINTxvirtualwirecanbemappedtoaseparate
IRQattheinterruptcontroller.

808
PCIe 3.0.book Page 809 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

Table171:INTxMessageMappingAcrossVirtualPCItoPCIBridges

DeviceNumberof INTxMessage INTxMessage


DeliveringINTx TypeatInput TypeatOutput

0,4,8,12etc. INTA INTA

INTB INTB

INTC INTC

INTD INTD

1,5,9,13etc. INTA INTB

INTB INTC

INTC INTD

INTD INTA

2,6,10,14etc. INTA INTC

INTB INTD

INTC INTA

INTD INTB

3,7,11,15etc. INTA INTD

INTB INTA

INTC INTB

INTD INTC

809
PCIe 3.0.book Page 810 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Figure1711:ExampleofINTxMapping

CPU

Root Complex
Memory
Interrupt Controller

Assert_INTB (ReqID 1:0:0)


Assert_INTC (ReqID 1:0:0)

INTA from Dev 1 maps to INTB 1:0:0 INTA from Dev 2 maps to INTC

Assert_INTA (ReqID 2:1:0)


Assert_INTA (ReqID 2:2:0)
Switch
INTA from Dev 0 maps to INTA 2:1:0 2:2:0 INTA from Dev 0 maps to INTA

Assert_INTA (ReqID 3:0:0)


Assert_INTA (ReqID 4:0:0)

3:0:0 4:0:0

PCIe PCIe
Endpoint Endpoint

INTx Collapsing
PCIeSwitchesmustensurethatINTxmessagesaredeliveredupstreaminthe
correct fashion. Specifically, interrupt routing of legacy PCI implementations
mustbehandledsuchthatsoftwarecandeterminewhichinterruptsarerouted
to which interrupt controller inputs. INTx# lines may be wireORed and be
routed to the same IRQ input on the interrupt controller, and when multiple
devicessignalinterruptsonthesameline,onlythefirstassertionisseenbythe
interrupt controller. Similarly, when one of these devices deasserts its INTx#
line,thelineremainsasserteduntilthelastoneisturnedoff.Thesesameprinci
plesapplytoPCIeINTxmessages.
Insomecases,however,twooverlappingINTxmessagesmaybemappedtothe
same INTx message by a virtual PCI bridge at the egress port, requiring the
messagestobecollapsed.ConsiderthefollowingexampleillustratedinFigure
1712onpage811.

810
PCIe 3.0.book Page 811 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

When the upstream Switch Port maps the interrupt messages for delivery on
theupstreamlink,bothinterruptswillbemappedasINTB(basedonthedevice
numbersofthe downstream SwitchPorts). Note thatbecause these two over
lappingmessagesarethesametheymustbecollapsed.
Collapsingensuresthattheinterruptcontrollerwillneverreceivetwoconsecu
tiveAssert_INTxorDeassert_INTxmessagesforthesharedinterrupts.Thisis
equivalenttoINTxsignalsbeingwireORed.

Figure1712:SwitchUsesBridgeMappingofINTxMessages

CPU

Root Complex
Memory
Interrupt Controller

Assert_INTB (1:0:0)
3
Deassert_INTB (1:0:0)
1:0:0

Switch
2:1:0 2:5:0
Assert_INTA (3:0:0) Assert_INTA (4:0:0)

Deassert_INTA (3:0:0) 1 2 Deassert_INTA (4:0:0)


3:0:0 4:0:0

PCIe PCIe
Endpoint Endpoint
Deassert_INTA (3:0:0)

1
Assert_INTA (3:0:0)
(blocked by 1:0:0)
2
Assert_INTA (4:0:0) Deassert_INTA (4:0:0)
(blocked by 1:0:0)
3
Assert_INTB (1:0:0) Deassert_INTB (1:0:0)
caused by Assert_INTA (4:0:0) caused by Deassert_INTA (3:0:0)

811
PCIe 3.0.book Page 812 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

INTx Delivery Rules


The rules associated with the delivery of INTx messages have some unique
characteristics:
Assert_INTxandDeassert_INTxareonlyissuedintheupstreamdirection.
Switches that are collapsing interrupts will only issue INTx messages
upstreamwhenthereisachangeoftheinterruptstatus.
DevicesoneithersideofalinkmusttrackthecurrentstateofINTAINTD
assertion.
ASwitchtracksthestateofthefourvirtualwiresforeachofitsdownstream
ports,andmaypresentacollapsedsetofvirtualwiresonitsupstreamport.
TheRootComplexmusttrackthestateofthefourvirtualwires(AD)for
eachdownstreamport.
INTxsignalingmaybedisabledwiththeInterruptDisablebitintheCom
mandRegister.
IfanyINTxvirtualwiresareactiveanddeviceinterruptsarethendisabled,
acorrespondingDeassert_INTxmessagemustbesent.
IfadownstreamSwitchPortgoestoDL_Downstatus,anyactiveINTxvir
tualwiresmustbedeasserted,andtheupstreamportupdatedaccordingly
(Deassert_INTxmessagerequiredifthatINTxwasinactivestate).

The MSI Model


APCIeFunctionindicatesMSIsupportviatheMSICapabilityregisters.Each
Function must implement either the MSI Capability Structure or the MSIX
(eXtended MSI, see The MSIX Model on page 821) Capability Structure, or
both. The MSI Capability registers are set up by configuration software and
include:
Targetmemoryaddress
DataValuetobewrittentothataddress
Thenumberofuniquemessagesthatcanbeencodedintothedata
SeeMemoryRequestHeaderFieldsonpage 188forareviewoftheMemory
WriteTransactionHeader.NotethatMSIsalwayshaveadatapayloadof1DW.

The MSI Capability Structure


The MSI Capability Structure resides in the PCIcompatible config space area
(first256bytes).TherearefourvariationsoftheMSICapabilityStructurebased
onwhetheritsupports64bitaddressingoronly32bitandwhetheritsupports

812
PCIe 3.0.book Page 813 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

pervectormaskingornot.NativePCIedevicesarerequiredtosupport64bit
addressing.AllfourvariationsoftheMSICapabilityStructurecanbefoundin
Figure1713onpage813.

Figure1713:MSICapabilityStructureVariations

32-bit Address
31 16 15 8 7 0

Message Control Next Capability Capability ID


Pointer (05h) DW0

Message Address [31:0] DW1

Message Data DW2

64-bit Address
31 16 15 8 7 0

Message Control Next Capability Capability ID


Pointer (05h) DW0

Message Address [31:0] DW1

Message Address [63:32] DW2

Message Data DW3

32-bit Address with Per-Vector Masking


31 16 15 8 7 0

Message Control Next Capability Capability ID


Pointer (05h) DW0

Message Address [31:0] DW1

Reserved Message Data DW2

Mask Bits DW3

Pending Bits DW4

64-bit Address with Per-Vector Masking


31 16 15 8 7 0

Message Control Next Capability Capability ID


Pointer (05h) DW0

Message Address [31:0] DW1

Message Address [63:32] DW2

Reserved Message Data DW3

Mask Bits DW4

Pending Bits DW5

813
PCIe 3.0.book Page 814 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Capability ID
A Capability ID value of 05h identifies the MSI capability and is a readonly
value.

Next Capability Pointer


The second byte of the register is a readonly value that gives the dword
alignedoffsetfromthetopofconfigspacetothenextCapabilityStructureinthe
linkedlistofstructuresorelsecontains00htoindicatetheendofthelinkedlist.

Message Control Register


Figure1714on page814 and Table 172on page 814 illustratethelayout and
usageoftheMessageControlregister.

Figure1714:MessageControlRegister

15 9 8 7 6 4 3 1 0
Reserved

MSI Enable
Multiple Message Capable
Multiple Message Enable
64-bit Address Capable
Per-vector Masking Capable

Table172:FormatandUsageofMessageControlRegister

Bit(s) FieldName Description

0 MSIEnable Read/Write.Stateafterresetis0,indicatingthatthe
devicesMSIcapabilityisdisabled.
0=FunctionisdisabledfromusingMSI.Itmust
useMSIXorelseINTxMessages.
1=FunctionisenabledtouseMSItorequest
serviceandwontuseMSIXorINTxMessages.

814
PCIe 3.0.book Page 815 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

Table172:FormatandUsageofMessageControlRegister(Continued)

Bit(s) FieldName Description

3:1 MultipleMessage ReadOnly.Systemsoftwarereadsthisfieldto


Capable determinehowmanymessages(interruptvectors)
theFunctionwouldliketouse.Therequested
numberofmessagesisapoweroftwo,thereforea
Functionthatwouldlikethreemessagesmust
requestthatfourmessagesbeallocatedtoit.

ValueNumberofMessagesRequested
000b1
001b2
010b4
011b8
100b16
101b32
110bReserved
111bReserved

6:4 MultipleMessage Read/Write.AftersystemsoftwarereadstheMulti


Enable pleMessageCapablefield(previousrowinthis
table)toseehowmanymessages(interruptvec
tors)arerequestedbytheFunction,itprogramsa
3bitvalueinthisfieldindicatingtheactualnum
berofmessagesallocatedtotheFunction.The
numberallocatedcanbeequaltoorlessthanthe
numberactuallyrequested.Thestateofthisfield
afterresetis000b.

ValueNumberofMessagesRequested
000b1
001b2
010b4
011b8
100b16
101b32
110bReserved
111bReserved

815
PCIe 3.0.book Page 816 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Table172:FormatandUsageofMessageControlRegister(Continued)

Bit(s) FieldName Description

7 64bitAddress ReadOnly.
Capable 0=Functiondoesnotimplementtheupper32
bitsoftheMessageAddressregister;onlya32
bitaddressispossible.
1=Functionimplementstheupper32bitsofthe
MessageAddressregisterandiscapableofgen
eratinga64bitmemoryaddress.

8 PerVector ReadOnly.
MaskingCapable 0=FunctiondoesnotimplementtheMaskBit
registerorthePendingBitregister;software
doesNOThavetheabilitytomaskindividual
interruptswiththiscapabilitystructure.
1=FunctiondoesimplementtheMaskBitregis
terorthePendingBitregister;softwaredoes
havetheabilitytomaskindividualinterrupts
withthiscapabilitystructure.

15:9 Reserved ReadOnly.Alwayszero.

Message Address Register


Thelowertwobitsofthe32bitMessageAddressregisterarezeroandcannot
bechanged,forcingtheaddressassignedbysoftwaretobedwordaligned.Typ
ically,thiswouldbetheaddressoftheLocalAPICinthesystemCPU.Inx86
based systems (Intelcompatible), this address has traditionally been
FEEx_xxxxh where the lower 20 bits indicate which Local APIC is being tar
getedaswellassomeotherinfoabouttheinterruptitself.Itisimportanttonote
thathowtheaddressisinterpretedisplatformspecificandisnotdictatedinthe
PCIorPCIespecs.

The register containing bits [63:32] of the Message Address are required for
nativePCIExpressdevicesbutisoptionalforlegacyendpoints.Thisregisteris
presentifBit7oftheMessageControlregisterisset.Ifso,itisaread/writereg
isterusedinconjunctionwiththeMessageAddress[31:0]registertoenablea
64bitmemoryaddressforinterruptdeliveryfromthisFunction.

816
PCIe 3.0.book Page 817 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

Message Data Register


Systemsoftwarewritesabasemessagedatapatternintothis16bit,read/write
register. When the Function generates an interrupt request, it writes a 32bit
data value to the memory address specified in the Message Address register.
Theupper16bitsofthisdataarealwayssettozero,whilethelower16bitsare
suppliedbytheMessageDataregister.
If more than one message has been assigned to the Function, it modifies the
lowerbits(thenumberofmodifiablebitsdependsonhowmanymessageshave
beenassignedtotheFunctionbyconfigurationsoftware)oftheMessageData
registervaluetoformtheappropriatevaluefortheeventitwishestoreport.As
an example, refer to Basics of Generating an MSI Interrupt Request on
page 820.

Mask Bits Register and Pending Bits Register


IftheFunctionsupportspervectormasking(indicatedinbit[8]oftheMessage
Controlregister)thentheseregistersarepresent.Themaxnumberofinterrupt
messages(itnerruptvectors)thatcanberequestedandassignedtoaFunction
usingMSIis32.Sothesetworegistersare32bitsinlengthwitheachpotential
interruptmessagehavingitsownmaskandpendingbit.Ifbit[0]oftheMask
Bitsregisterisset,theninterruptmessage0ismasked(thisisthebasevector
fromthisFunction).Ifbit[1]isset,theninterruptmessage1ismasked(thisis
thebasevector+1).

Whenaninterruptmessageismasked,theMSIforthatvectorcannotbesent.
Instead,thecorrespondingPendingBitisset.Thisallowssoftwaretomaskindi
vidualinterruptsfromaFunctionandthenperiodicallypolltheFunctiontosee
ifthereareanymaskedinterruptsthatarepending.

Ifsoftwareclearsamaskbitandthecorrespondingpendingbitisset,theFunc
tion must send the MSI request at that time. Once the interrupt message has
beensent,theFunctionwouldclearthependingbit.

Basics of MSI Configuration


ThefollowinglistspecifiesthestepstakenbysoftwaretoconfigureMSIinter
ruptsforaPCIExpressdevice.RefertoFigure1715onpage819.

1. Atstartuptime,enumerationsoftwarescansthesystemforallPCIcompat
ibleFunctions(seeSingleRootEnumerationExampleonpage 109fora
discussionoftheenumerationprocess).

817
PCIe 3.0.book Page 818 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

2. OnceaFunctionisdiscoveredsoftwarereadstheCapabilitiesListPointer,
tofindthelocationofthefirstcapabilitystructureinthelinkedlist.
3. If the MSI Capability structure (Capability ID of 05h) is found in the list,
softwarereadstheMultipleMessageCapablefieldinthedevicesMessage
Controlregistertodeterminehowmanyeventspecificmessagesthedevice
supportsandifitsupportsa64bitmessageaddressoronly32bit.Software
thenallocatesanumberofmessagesequaltoorlessthanthatandwrites
thatvalueintotheMultipleMessageEnablefield.Ataminimum,onemes
sagewillbeallocatedtothedevice.
4. Software writes the base message data pattern into the devices Message
Dataregisterandwritesadwordalignedmemoryaddresstothedevices
MessageAddressregistertoserveasthedestinationaddressforMSIwrites.
5. Finally, software sets the MSI Enable bit in the devices Message Control
register, enabling it to generate MSI writes and disabling other interrupt
deliveryoptions.

818
PCIe 3.0.book Page 819 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

Figure1715:DeviceMSIConfigurationProcess

Scan PCI bus(es)


until device
discovered

New
Capabilities N
?
Y
MSI N
Capable
?
Y
Determine number of
messages requested
and assign number
of messages to device

Write base data


pattern into
Message Data
Register

Assign Memory
Address to Message
Address Register

Enable device to
use MSI with
MSI Enable bit
in Message Control
Register

819
PCIe 3.0.book Page 820 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Basics of Generating an MSI Interrupt Request


Figure1716onpage821illustratesthecontentsofanMSIMemoryWriteTrans
actionHeaderandDatafield.Keypointsinclude:

Format field must be 011b for native functions, indicating a 4DW header
(64bitaddress)withData,butitmaybe010bforLegacyEndpoints,indi
catinga32bitaddress.
TheAttributebitsforNoSnoopandRelaxedOrderingmustbezero.
Lengthfieldmustbe01htoindicatemaximumdatapayloadof1DW.
First BE field must be 1111b, indicating valid data in all four bytes of the
DW,eventhoughtheuppertwobyteswillalwaysbezeroforMSI.
LastBEfieldmustbe0000b,indicatingasingleDWtransaction.
Address fields within the header come directly from the address fields
withintheMSICapabilityregisters.
Lower16bitsoftheDatapayloadarederivedfromthedatafieldwithinthe
MSICapabilityregisters.

Multiple Messages
IfsystemsoftwareallocatedmorethanonemessagetotheFunction,themulti
plevaluesarecreatedbymodifyingthelowerbitsoftheassignedMessageData
valuetosendadifferentmessageforeachdevicespecificeventtype.

Asanexample,assumethefollowing:

Fourmessageshavebeenallocatedtoadevice.
Adatavalueof49A0hhasbeenassignedtothedevicesMessageDatareg
ister.
Memory address FEEF_F00Ch has been written into the devices Message
Addressregister.
Whenoneofthefoureventsoccurs,thedevicegeneratesarequestbyper
formingadwordwritetomemoryaddressFEEF_F00Chwithadatavalue
of 0000_49A0h, 0000_49A1h,0000_49A2h,or 0000_49A3h.Inotherwords,
the lower two bits of the data value are modified to specify which event
occurred.IfthisFunctionwouldhavebeenallocated8messages,thenthe
lowerthreebitscouldbemodified.Also,thedevicealwaysuses0000hfor
theupper2bytesofitsmessagedatavalue.

820
PCIe 3.0.book Page 821 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support


Figure1716:FormatofMemoryWriteTransactionforNativeDeviceMSIDelivery

MSI (Memory Write) Transaction


+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
011 00000 tr H D P 0 0 0000000001
Byte 4 Requester ID Tag Last DW First DW
0000 1111 Header
Byte 8 MSI Message Address [63:32]
Byte 12 MSI Message Address [31:0] 00

Byte 16 MSI Message Data 0000h Data

MSI Capability Structure


31 16 15 8 7 0

Message Control Next Capability Capability ID


Pointer (05h) DW0

Message Address [31:0] DW1

Message Address [63:32] DW2

Message Data DW3

The MSI-X Model

General
The3.0 revisionofthe PCIspecaddedsupport forMSIX,whichhasitsown
capabilitystructure.MSIXwasmotivatedbyadesiretoalleviatethreeshort
comingsofMSI:

32vectorsperfunctionarenotenoughforsomeapplications.
Havingonlyonedestinationaddressmakesstaticdistributionofinterrupts
acrossmultipleCPUsdifficult.Themostflexibilitywouldbeachievedifa
uniqueaddresscouldbeassignedforeachvector.

821
PCIe 3.0.book Page 822 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Inseveralplatforms,likex86basedsystems,thevectornumberoftheinter
rupt indicates its priority relative to other interrupts. With MSI, a single
Functioncouldbeallocatedmultipleinterrupts,butalltheinterruptvectors
wouldbecontiguous,meaningsimilarpriority.Thisisnotagoodsolutionif
some interrupts from this Function should be high priority and others
shouldbelowpriority.Abetterapproachwouldbeforsoftwaretodesig
nateauniquevector(messagedatavalue),thatdoesnothavetobecontigu
ous,foreachinterruptallocatedtotheFunction.

Keepingthosegoalsinmind,itseasytounderstandtheregisterchangesthat
wereimplementedtoprovidemorevectorswitheachvectorbeingassigneda
targetaddressandmessagedatavalue.

MSI-X Capability Structure


AsshowninFigure1717onpage822,theMessageControlregisterisquitedif
ferentfromMSI.Interestingly,eventhoughMSIXcansupportupto2048vec
torsperFunctionversusthe32forMSI,thenumberofconfigurationregisters
for MSIX is actually a little smaller than for MSI. Thats because the vector
information isnt contained here. Instead, its in a memory location (MMIO)
pointedtobytheTableBIR(BaseaddressIndicatorRegister),asshowninFig
ure1718onpage824.

Figure1717:MSIXCapabilityStructure

31 16 15 8 7 0

Message Control Next Capability Capability ID


Pointer (11h) DW0

MSI-X Table Offset Table DW1


BIR
Pending Bit Array (PBA) Offset PBA DW2
BIR

(BIR = BAR Index Register)

15 14 13 11 10 0

Rsvd Table Size in N-1 (RO)

Function Mask (RW)


MSI-X Enable (RW)

822
PCIe 3.0.book Page 823 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

Table173:FormatandUsageofMSIXMessageControlRegister

Bit(s) FieldName Description

10:0 TableSize ReadOnly.Thisfieldindicatesthenumberofinter


ruptmessages(vectors)thatthisFunctionsup
ports.ThevaluehereisinterpretedinanN1
fashion,soavalueof0means1vector.Avalueof7
means8vectors.Eachvectorhasitsownentryin
theMSIXTableanditsownbitinthePendingBit
Array.

13:11 Reserved ReadOnly.Alwayszero.

14 FunctionMask Read/Write.Thisfieldprovidessystemsoftwarean
easywaytomaskalltheinterruptsfromaFunc
tion.Ifthisbitiscleared,interruptscanstillbe
maskedindividuallybysettingthemaskbitwithin
eachvectorsMSIXtableentry.

15 MSIXEnable Read/Write.Stateafterresetis0,indicatingthatthe
devicesMSIXcapabilityisdisabled.
0=FunctionisdisabledfromusingMSIX.It
mustuseMSIorINTxMessages.
1=FunctionisenabledtouseMSIStorequest
serviceandwontuseMSIXorINTxMessages.

823
PCIe 3.0.book Page 824 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Figure1718:LocationofMSIXTable

Doubleword
Number MemoryAddress
Byte (in decimal) System Memory
Space
3 2 1 0
Device Vendor 00
ID ID
Status Command 01
Register Register

Class Code Revision 02


ID
BIST Header Latency C ache 03
Type Timer Line
Size

Base A ddress 0 04

05 Table BIR = 2
Base A ddress 1
06
MSI-X Table
Base A ddress 2

Base A ddress 3 07

08
Base A ddress 4 MSI-X Table
09 Offset
Base A ddress 5
10
CardBus CIS Pointer

Subsystem ID Subsystem 11
Vendor ID
Expansion R OM 12
Base Ad dress
Reserved Capab ilities 13
Poin ter

Reserved 14

Max_Lat Min_Gnt Interrupt Interrupt 15


Pin Line

Required configuration registers

MSI-X Table
TheMSIXTableitselfisanarrayofvectorsandaddresses,asshowninFigure
1719onpage825.EachentryrepresentsonevectorandcontainsfourDwords.
DW0andDW1supplyaunique64bitaddressforthatvector,whileDW2gives
aunique32bitdatapatternforit.DW3onlycontainsonebitatpresent:amask
bit for that vector, allowing each vector to be independently masked off as
needed.

824
PCIe 3.0.book Page 825 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

Figure1719:MSIXTableEntries

DW3 DW2 DW1 DW0

Vector Control Message Data Upper Address Lower Address Entry 0


Vector Control Message Data Upper Address Lower Address Entry 1
Vector Control Message Data Upper Address Lower Address Entry 2
. . . .
. . . .
Vector Control Message Data Upper Address Lower Address Entry N-1

Bit 0 is vector Mask Bit (R/W)

Pending Bit Array


Inmuchthesameway,thePendingBitArrayisalsolocatedwithinamemory
address.ItcanusethesameBIRvalue(sameBAR)astheMSIXTablewitha
differentoffset,oritcoulduseadifferentBARaltogether.Thearray,shownin
Figure1720,simplycontainsabitforeveryvectorthatwillbeused.Iftheevent
to trigger that interrupt occurs but its Mask Bit has been set, then an MSIX
transactionwillnotbesent.Instead,thecorrespondingpendingbitisset.Later,
ifthatvectorisunmaskedandthependingbitisstillset,theinterruptwillbe
generatedatthattime.

825
PCIe 3.0.book Page 826 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Figure1720:PendingBitArray

DW1 DW0

Pending Bits 0 - 63 QW 0
Pending Bits 64 - 127 QW 1
Pending Bits 128 - 191
.
.
Pending Bits QW (N-1)/64

Memory Synchronization When Interrupt Handler Entered

The Problem
There is a potential problem with any interrupt scheme when data is being
delivered. For example, if the device has previously sent data and wants to
reportthatwithaninterrupt,aunexpecteddelayondatadeliverycouldallow
the interrupt to arrive too soon. That might happen in the bridge data buffer
showninFigure1721onpage827,andtheresultisaracecondition.Thesteps
aresimilartoourearlierdiscussion(seeTheLegacyModelonpage 796):

1. Thefunctionwritesadatablocktowardmemory.Thewritecompleteson
thelocalbusasapostedtransaction,meaningthatthesenderhasfinished
allitneededtodoandthetransactionisconsideredcompleted.
2. Aninterruptisdeliveredtonotifysoftwarethatsomerequesteddataisnow
presentinmemory.However,thedatahasbeendelayedinthebridgefor
somereason.
3. Theinterruptvectorisfetchedasbefore.
4. TheISRstartingaddressisfetchedandcontrolispassedtoit.
5. The ISR reads from the target memory buffer but the data payload still
hasntbeendeliveredsoitfetchesstaledata,possiblycausinganerror.

826
PCIe 3.0.book Page 827 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

Figure1721:MemorySynchronizationProblem

INTR 5 Memory
CPU
Memory Buffer

Interrupt Service
4
Routine (ISR)
North Bridge
Interrupt Table (ISR
3 starting addresses)
PCI Bus

Bridge
Write Buffer
South Bridge
1
2 PCI Bus
Interrupt Controller
(PIC) INTA#
Device

One Solution
OnewaytoalleviatethisproblemtakesadvantageofPCItransactionordering
rules.IftheISRfirstsendsareadrequesttothedevicethatinitiatedtheinter
ruptbeforeitattemptstofetchthedata,theresultingreadcompletionwillfol
lowthesamepathbacktotheCPUthatanywritedatawouldhavetakenfrom
thatdevicetogettomemory.Transactionorderingrulesguaranteethataread
resultinabridgecannotpassapostedwritegoinginthesamedirection,sothe
endresultisthatthedatawillgetwrittenintomemorybeforethereadresult
willbeallowedtoreachtheCPU.Therefore,iftheISRwaitsforthereadcom
pletiontoarrivebeforeproceeding,itcanbesurethatanydatawillhavebeen
deliveredtomemoryandthustheraceconditionisavoided.Sincethereadis
basicallybeingusedasadataflushmechanism,itisntnecessaryforittoreturn
anydata.Inthatcasethereadcanbezerolengthandthedatareturnedisdis
carded.Forthatreason,thistypeofreadissometimescalledadummyread.

An MSI Solution
MSI can simplify this process, although there are some requirements for it to
work(refertoFigure1722onpage829).Ifthesystemallowsthedevicetogen

827
PCIe 3.0.book Page 828 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

erateitsownMSIwritesratherthangoingthroughanintermediarylikeanIO
APIC,thenthefollowingexamplecantakeplace:

1. Thedevicewritesthepayloaddatatowardmemoryanditisabsorbedby
thewritebufferinthebridge.
2. Thedevicebelievesthedatahasbeendeliveredandsignalsaninterruptto
notifytheCPU.Inthiscase,anMSIissentandusesthesamepathasthe
data.SincebothdataandMSIappearasmemorywritestothebridge,the
normaltransactionorderingruleswillkeeptheminthecorrectsequence.
3. The payload data is delivered to memory, freeing the path through the
bridgefortheMSIwrite.
4. The MSI write is delivered to the CPU Local APIC and the software now
knowsthatthepayloaddataisavailable.

Traffic Classes Must Match


An important point must be stressed here, however. Both the data and MSI
mustusethesameTrafficClassforthistowork.Recallthatpacketsthathave
beenassigneddifferentTCvaluesmayendupbeingmappedintodifferentVir
tualChannels,andthatpacketsindifferentVCshavenoorderingrelationship.
IfthedataweremappedtoVC0andtheMSIwasmappedtoVC1,thenthesys
temwouldbeunawareofanyorderingrelationshipbetweenthemandunable
toenforcememorycoherencyautomatically.

IfgivingbothpacketsthesameTCisnotpossible,thesystemwouldneedtouse
thedummyreadmethodinsteadandtheTCofthereadrequestwouldneed
tomatchtheTCofthedatawritepacket.Itshouldbeclearthatevenifthesame
TCisusedforboth,theuseoftheRelaxedOrderingbitmustbeavoided.Were
countingonthetransactionorderingrulestoachievememorysynchronization,
sotheymustnotberelaxed.

828
PCIe 3.0.book Page 829 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

Figure1722:MSIDelivery

Local Local
APIC APIC
CPU CPU
4
Memory
3
North Bridge

PCI Bus

Bridge
Write Buffer
South Bridge
1
2
PCI Bus
Interrupt Controller
(IO APIC)
Device

Interrupt Latency
The time from signaling an interrupt until software services the device is
referred to as the interrupt latency. In spite of its advantages, MSI, like other
interruptdeliverymechanisms,doesnotprovideinterruptlatencyguarantees.

MSI May Result In Errors


BecauseMSIsaredeliveredasMemoryWritetransactions,anerrorassociated
with deliveryofanMSIistreatedthesameas anyother MemoryWriteerror
condition.SeeECRCGenerationandCheckingonpage 657fortreatmentof
ECRCerrors,asoneexample.Theconcern,ofcourse,isthatifanerrorresultsin
theMSIpacketbeingunrecognizedthennointerruptwillbeseenbytheproces
sor.HowthisconditionwouldbehandledisoutsidethescopeofthePCIespec.

829
PCIe 3.0.book Page 830 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Some MSI Rules and Recommendations


1. Itistheintentofthespecthatmutuallyexclusivemessageswillbeassigned
toFunctionsbysystemsoftwareandthateachmessagewillbeconvertedto
anexclusiveinterruptondeliverytotheprocessor.
2. MorethanoneMSIcapabilityregistersetperFunctionisprohibited.
3. AreadoftheMessageAddressregisterproducesundefinedresults.
4. Reserved registers and bits are readonly and always return zero when
read.
5. SystemsoftwarecanmodifyMessageControlregisterbits,butthedevice
itselfisprohibitedfromdoingso.Inotherwords,modifyingthebitsbya
backdoormechanismisnotallowed.
6. Ataminimum,asinglemessagewillbeassignedtoeachdevice(assuming
softwaresupportsandplanstouseMSIinthesystem).
7. Systemsoftwaremustnotwritetotheupperhalfofthedwordthatcontains
theMessageDataregister.
8. If the device writes the same message multiple times, only one of those
messagesisguaranteedtobeserviced.Ifallofthemmustbeserviced,the
device must not generate the same message again until the previous one
hasbeenserviced.
9. Ifadevicehasmorethanonemessageassigned,anditwritesaseriesofdif
ferentmessages,itisguaranteedthatallofthemwillbeserviced.

Special Consideration for Base System Peripherals


Interruptsmayalsooriginateinembeddedlegacyhardware,suchasanIOCon
trollerHuborSuperIOdevice.Someofthetypicallegacydevicesrequiredin
suchsystemsinclude:

Serialports
Parallelports
KeyboardandMouseController
SystemTimer
IDEcontrollers

ThesedevicestypicallyrequireaspecificIRQlineintoaPICorIOAPIC,which
allowslegacysoftwaretointeractwiththemcorrectly.

UsingtheINTxmessagesdoesnotguaranteethatthedeviceswillreceivethe
IRQ assignment they require. The following example illustrates a system that
willsupporttheproperlegacyinterruptassignment.

830
PCIe 3.0.book Page 831 Sunday, September 2, 2012 11:25 AM

Chapter 17: Interrupt Support

Example Legacy System


Figure1723onpage831showsaolderPCIExpresssystemthatincludesanIO
ControllerHub(ICH)attachedtotheRootComplexviaaproprietaryHublink.
TheIOAPICembeddedwithintheICHcangenerateanMSIwhenitreceives
an interrupt request at its inputs. In such an implementation, software can
assignthelegacyvectornumbertoeachinputtoensurethatthecorrectlegacy
softwarewillbecalled.

Theadvantageofthisapproachisthatexistinghardwarecanbeusedtosupport
thelegacyrequirementsofaPCIeplatform.Thissystemalsorequiresthatthe
MSI subsystem be configured for use during the boot sequence. The example
illustrated eliminates the need for INTx messages unless a PCIe expansion
deviceincorporatesaPCIExpresstoPCIBridge.

Figure1723:PCIExpressSystemwithPCIBasedIOControllerHub

Processor

FSB
PCI Express
GFX

Root Complex
PCI Express DDR
Links SDRAM

Hub Link
IDE
CD HDD MSI
IO Controller Hub

USB 2.0 Interrupt INTA# - INTD#


Interrupt

4
Router

Controller
(APIC) PCI - 33MHz
LPC
1
Serial Interrupts Timer
IEEE Slots
S
IO AC97 1394
COM1 Link
COM2

Modem Audio Boot


Codec Codec Ethernet ROM

831
PCIe 3.0.book Page 832 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

832
PCIe 3.0.book Page 833 Sunday, September 2, 2012 11:25 AM

18 SystemReset
The Previous Chapter
ThepreviouschapterdescribesthedifferentwaysthatPCIeFunctionscangen
erateinterrupts.TheoldPCImodelusedpinsforthis,butsidebandsignalsare
undesirableinaserialmodelsosupportfortheinbandMSI(MessageSignaled
Interrupt)mechanismwasmademandatory.ThePCIINTx#pinoperationcan
stillbeemulatedusingPCIeINTxmessagesforsoftwarebackwardcompatibil
ityreasons.BoththePCIlegacyINTx#methodandthenewerversionsofMSI/
MSIXaredescribed.

This Chapter
This chapter describes the four types of resets defined for PCIe: cold reset,
warm reset, hot reset, and functionlevel reset. The use of a sideband reset
PERST#signaltogenerateasystemresetisdiscussed,andsoistheinbandTS1
usedtogenerateaHotReset.

The Next Chapter


ThenextchapterdescribesthePCIExpresshotplugmodel.Astandardusage
model is also defined for all devices and form factors that support hot plug
capability. Power is an issue for hot plug cards, too, and when a new card is
addedtoasystemduringruntime,itsimportanttoensurethatitspowerneeds
dontexceedwhatthesystemcandeliver.Amechanismwasneededtoquery
andcontrolthepowerrequirementsofadevice,PowerBudgetingprovidesthis.

Two Categories of System Reset


ThePCIExpressspecdescribesfourtypesofresetmechanisms.Threeofthese
werepartoftheearlierrevisionsofthePCIespecandarecollectivelyreferredto
nowasConventionalResets,andtwoofthemarecalledFundamentalResets.
Thefourthcategoryandmethod,addedwiththe2.0specrevision,iscalledthe
FunctionLevelReset.

833
PCIe 3.0.book Page 834 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Conventional Reset

Fundamental Reset
AFundamentalResetishandledinhardwareandresetstheentiredevice,re
initializingeverystatemachineandallthehardwarelogic,portstatesandcon
figurationregisters.Theexceptiontothisruleisagroupofsomeconfiguration
registerfieldsthatareidentifiedassticky,meaningtheyretaintheircontents
unlessallpowerisremoved.Thismakesthemveryusefulfordiagnosingprob
lemsthatrequirearesettogetaLinkworkingagain,becausetheerrorstatus
survives the reset and is available to software afterwards. If main power is
removedbutVauxisavailable,thatwillalsomaintainthestickybits,butifboth
mainpowerandVauxarelost,thestickybitswillberesetalongwitheverything
else.

AFundamentalResetwilloccuronasystemwidereset,butitcanalsobedone
forindividualdevices.

TwotypesofFundamentalResetaredefined:
Cold Reset: The result when the main power is turned on for a device.
Cyclingthepowerwillcauseacoldreset.
WarmReset(optional):Triggeredbyasystemspecificmeanswithoutshut
ting off main power. For example, a change in the system power status
mightbeusedtoinitiatethis.ThemechanismforgeneratingaWarmReset
isnotdefinedbythespec,sothesystemdesignerwillchoosehowthisis
done.

WhenaFundamentalResetoccurs:
Forpositivevoltages,receiverterminationsarerequiredtomeetthe
ZRXHIGHIMPDCPOS parameter.At2.5GT/s,thisisnolessthan10K.At
thehigherspeedsitmustbenolessthan10Kforvoltagesbelow200mv,
and20Kforvoltagesabove200mv.Thesearethevalueswhenthetermi
nationsarenotpowered.
Similarly for negative voltages, the ZRXHIGHIMPDCNEG parameter, the
valueisaminimumof1Kineverycase.
Transmitterterminationsarerequiredtomeettheoutputimpedance
ZTXDIFFDCfrom80to120forGen1andmaxof120forGen2andGen3,
butmayplacethedriverinahighimpedancestate.
ThetransmitterholdsaDCcommonmodevoltagebetween0and3.6V.

834
PCIe 3.0.book Page 835 Sunday, September 2, 2012 11:25 AM

Chapter18:SystemReset

WhenexitingfromaFundamentalReset:

Thereceiversingleendedterminationsmustbepresentwhenreceiverter
minationsareenabledsothatReceiverDetectworksproperly(4060for
Gen1andGen2,and50forGen3.BythetimeDetectisentered,
thecommonmodeimpedancemustbewithintheproperrangeof50

mustreenableitsreceiverterminationsZRXDIFFDCof100within5msof
FundamentalResetexit,makingitdetectablebytheneighborstransmitter
duringtraining.
ThetransmitterholdsaDCcommonmodevoltagebetween0and3.6V.

TwomethodsofdeliveringaFundamentalResetaredefined.First,itcanbesig
naled with an auxiliary sideband signal called PERST# (PCI Express Reset).
Second,whenPERST#isnotprovidedtoanaddincardorcomponent,aFun
damental Reset is generated autonomously by the component or addin card
whenthepoweriscycled.

PERST# Fundamental Reset Generation


AcentralresourcedevicesuchasachipsetinthePCIExpresssystemprovides
thisreset.Forexample,theIOControllerHub(ICH)chipinFigure181onpage
836 may generate PERST# based on the status of the system power supply
POWERGOOD signal, since this indicates that the main power is turned on
andstable.Ifpoweriscycledoff,POWERGOODtogglesandcausesPERST#to
assertanddeassert.,resultinginaColdReset.Thesystemmayalsoprovidea
methodoftogglingPERST#bysomeothermeanstoaccomplishaWarmReset.

ThePERST#signalfeedsallPCIExpressdevicesonthemotherboardincluding
theconnectorsandgraphicscontroller.DevicesmaychoosetousePERST#but
arenotrequiredtodoso.PERST#alsofeedsthePCIetoPCIXbridgeshownin
thefigure.Bridgesalwaysforwardaresetontheirprimary(upstream)busto
theirsecondary(downstream)bus,sothePCIXbusseesRST#asserted.

Autonomous Reset Generation


Adevicemustbedesignedtogenerateitsownresetinhardwareuponapplica
tionofmainpower.Thespecdoesntdescribehowthiswouldbedone,soaself
reset mechanism can be built into the device or added as external logic. For
example,anaddincardthatdetectsPowerOnmayusethateventtogeneratea
localresettoitsdevice.Thedevicemustalsogenerateanautonomousresetifit
detectsitspowergooutsideofthelimitsspecified.

835
PCIe 3.0.book Page 836 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Link Wakeup from L2 Low Power State


As an example of the need for an autonomous reset, a device whose main
powerhasbeenturnedoffaspartofapowermanagementpolicymaybeableto
request a return to full power if it was designed to signal a wakeup. When
powerisrestored,thedevicemustbereset.Thepowercontrollerforthesystem
mayassertthePERST#pintothedevice,asshowninFigure181onpage836,
butifitdoesnt,orifthedevicedoesntsupportPERST#,thedevicemustauton
omously generate its own Fundamental Reset when it senses main power re
applied.

Figure181:PERST#Generation

Processor

FSB

GFX Root Complex


DDR
PCI Express SDRAM
GFX PCI Express
POWERGOOD PRST#
PCI
IO Controller Hub
(ICH) IEEE
PERST# 1394

Add-In Add-In
Switch

PCI Express
PCI Express Link
SCSI
to-PCI-X
PRST#
PCI-X

Gigabit
Ethernet

836
PCIe 3.0.book Page 837 Sunday, September 2, 2012 11:25 AM

Chapter18:SystemReset

Hot Reset (In-band Reset)


AHotResetispropagatedinbandfromonelinkneighbortoanotherbysend
ingseveralTS1s(whosecontentsareshowninFigure182)withbit0ofsymbol
5 asserted. These TS1s are sent on all Lanes, using the previously negotiated
Link and Lane numbers, for 2 ms. Once its been sent, the Transmitter and
ReceiveroftheHotResetwillbothendupintheDetectLTSSMstate(seeHot
ResetStateonpage 612).

Figure182:TS1OrderedSetShowingtheHotResetBit

TS1 Training Control

K28.5 Bit 0 0 = De-assert Hot Reset


0 COM
1 Link # D0.0-D31.7, K23.7 (0-255) 1 = Assert Hot Reset
2 Lane # D0.0-D31.0, K23.7 (0-31) Bit 1 0 = De-assert Disable Link
# of FTS ordered sets required by
3 # FTS receiver to obtain bit and symbol lock 1 = Assert Disable Link
4 Rate ID
5 Train Ctl Bit 2 0 = De-assert Loopback
6 1 = Assert Loopback
D10.2 for TS1 Identifier Bit 3 0 = De-assert Disable Scrambling
TS ID
1 = Assert Disable Scrambling
13
14 TS ID D10.2 for TS1 Identifier Bit 4 0 = De-assert Compliance Receive
15 TS ID D10.2 for TS1 Identifier 1 = Assert Compliance Receive
Bit 5:7 Reserved

AhotresetisinitiatedinsoftwarebysettingtheSecondaryBusResetbitina
bridgesBridgeControlconfigurationregister,asshowninFigure185onpage
840.Consequently,onlydevicescontainingbridges,liketheRootComplexora
Switch,candothis.ASwitchthatreceiveshotresetonitsUpstreamPortmust
broadcast it to all of its Downstream Ports and reset itself. All devices down
streamofaswitchthatreceivethehotresetwillresetthemselves.

Response to Receiving Hot Reset


The devices LTSSM goes through the Recovery and Hot Reset state, and
thenbacktotheDetectstate,whereitstartstheLinkTrainingprocess.
Allofthedevicesstatemachines,hardwarelogic,portstatesandconfigura
tionregisters(exceptstickyregisters)initializetotheirdefaultconditions.

837
PCIe 3.0.book Page 838 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Switches Generate Hot Reset on Downstream Ports


ASwitchgeneratesahotresetonallofitsDownstreamPortswhen:
ItreceivesahotresetonitsUpstreamPort
For a Switch or Bridge Upstream Port, if the Data Link Layer reports a
DL_Down state, the effect is very similar to a hot reset. This can happen
when the Upstream Port has lost its connection with an upstream device
duetoanerrorthatisnotrecoverablebythePhysicalLayerorDataLink
Layer.
SoftwaresetstheSecondaryBusResetbitoftheBridgeControlconfigura
tionregisterassociatedwiththeUpstreamPort,asshowninFigure183on
page838.

Figure183:SwitchGeneratesHotResetonOneDownstreamPort

Processor Processor

FSB

PCI Express
GFX
GFX Root Complex
DDR
SDRAM
Secondary Bus Reset
Bit Set
Switch A Switch C
1

10Gb PCI Express


Switch B SCSI
Ethernet to-PCI
Slots

PCI
Gb
Add-In IEEE
Ethernet S
IO 1394
COM1
COM2

Bridges Forward Hot Reset to the Secondary Bus


If a bridge such as a PCI ExpresstoPCI(X) bridge detects a hot reset on its
UpstreamPort,itmustassertthePRST#signalonitssecondaryPCI(X)bus,as
illustratedinFigure184onpage839.

Software Generation of Hot Reset


SoftwaregeneratesaHotResetonaspecificportbywritinga1followedby0to
the Secondary Bus Reset bit in the Bridge Control register of that associated

838
PCIe 3.0.book Page 839 Sunday, September 2, 2012 11:25 AM

Chapter18:SystemReset

portsconfigurationheader(seeFigure185onpage840).Considertheexample
showninFigure183onpage838.SoftwaresetstheSecondaryBusResetregis
terofSwitchAsleftDownstreamPort,causingittosendTS1OrderedSetswith
theHotResetbitset.SwitchBreceivesthisHotResetonitsUpstreamPortand
forwardsittoallitsDownstreamPorts.

Figure184:SwitchGeneratesHotResetonAllDownstreamPorts

Processor Processor

FSB

PCI Express
GFX
GFX Root Complex
DDR
SDRAM

Secondary Bus Reset


1
Bit is Set
Switch A Switch C

10Gb PCI Express


Switch B SCSI
Ethernet to-PCI
Slots
PRST#
PCI
Gb
Add-In IEEE
Ethernet S
IO 1394
COM1
COM2

IfsoftwaresetstheSecondaryBusResetbitofaSwitchsUpstreamPort,then
theswitchgeneratesahotresetonallofitsDownstreamPorts,asshowninFig
ure184onpage839.Here,softwaresetstheSecondaryBusResetbitinSwitch
CsUpstreamPort,causingittosendTS1swiththeHotResetbitsetonallits
Downstream Ports. The PCIetoPCI bridge receives this Hot Reset and for
wardsitontothePCIbusbyassertingPRST#.

SettingtheSecondaryBusResetbitcausesaPortsLTSSMtotransitiontothe
Recovery state (for more on the LTSSM, see Overview of LTSSM States on
page 519)whereitgeneratestheTS1swiththeHotResetbitset.TheTS1sare
generated continuously for 2 ms and then the Port exits to the Detect state
whereitisreadytostarttheLinktrainingprocess.

839
PCIe 3.0.book Page 840 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ThereceiveroftheHotResetTS1s(alwaysdownstream)willgototheRecovery
state,too.WhenitseestwoconsecutiveTS1swiththeHotResetbitset,itgoes
totheHotResetstatefora2mstimeoutandthenexitstoDetect.BothUpstream
andDownstreamPortsareinitializedandendupintheDetectstate,readyto
beginLinktraining.IfthedownstreamdeviceisalsoaSwitchorBridge,itfor
wardstheHotResettoitsDownstreamPortsaswell,asshowninFigure183
onpage838.

Figure185:SecondaryBusResetRegistertoGenerateHotReset

Doubleword
Number
(in decimal)
Byte
15 12 11 10 9 8 7 6 5 4 3 2 1 0
3 2 1 0
Reserved 2.2 2.2 2.2 2.2 Device Vendor 00
ID ID
Status Command 01
Discard Timer SERR# Enable Register Register
Discard Timer Status Class Code Revision 02
ID
Secondary Discard Timeout Header Latency Cache 03
BIST Type Timer Line
Size
Primary Discard Timeout
Base Add ress 0 04
Fast Back-to-Back Enable
Secondary Bus Reset Base Add ress 1 05

Master Abort Mode Secondary Subordinate Secondary Primary 06


Latency Timer Bus Number Bus Number Bus Number

VGA Enable Secondary I/O I/O 07


ISA Enable Status Limit Base
SERR# Enable Memory Memory 08
Limit Base
Parity Error Response Prefetchable Prefetchable 09
Memory Limit Memory Base
Prefetchable Ba se 10
Upper 3 2 Bits
Prefetchable L imit 11
Upper 3 2 Bits
I/O Limit I/O Base 12
Upper 16 Bits Upper 16 Bits
Capability 13
Reserved Pointer

Expansion R OM Base Address 14

Bridge Interrupt Interrupt 15


Control Pin Line

Required configuration registers

Software Can Disable the Link


SoftwarecanalsodisableaLink,forcingittogointoElectricalIdleandremain
there until further notice. The reason for mentioning that at this point is that
disabling the Link also has the effect of causing a Hot Reset on downstream
components. Disabling is accomplished by setting the Link Disable bit in the
LinkControlRegisteroftheDownstreamPort,showninFigure186onpage
841.ThatcausesthePorttogototheRecoveryLTSSMstateandbeginsending
TS1swiththeDisablebitset.SincethiscanonlybecontrolledforDownstream
PortsiftheLinkhasbeendisabled,thisbitisreservedforUpstreamPorts(such
asEndpointsorSwitchUpstreamPorts).

840
PCIe 3.0.book Page 841 Sunday, September 2, 2012 11:25 AM

Chapter18:SystemReset

Figure186:LinkControlRegister

15 12 11 10 9 8 7 6 5 4 3 2 1 0

RsvdP

Link Autonomous Bandwidth


Interrupt Enable

Link Bandwidth Management


Interrupt Enable
Hardware Autonomous
Width Disable

Enable Clock
Power Management
Extended Synch
Common Clock
Configuration
Retrain Link

Link Disable
Read Completion
Boundary Control

RsvdP
Active State
PM Control

WhentheUpstreamPortrecognizesincomingTS1swiththeDisabledbitset,its
PhysicalLayersignalsLinkUp=0(false)totheLinkLayerandalltheLanesgoto
ElectricalIdle.Aftera2mstimeout,anUpstreamPortwillgotoDetect,buta
DownstreamPortwillremainintheDisabledLTSSMstateuntildirectedtoexit
fromit(suchasbyclearingtheLinkDisablebit),sotheLinkwillremaindis
abledandwillnotattempttraininguntilthen.

841
PCIe 3.0.book Page 842 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure187:TS1OrderedSetShowingDisableLinkBit

TS1 Training Control

K28.5 Bit 0 0 = De-assert Hot Reset


0 COM
1 Link # D0.0-D31.7, K23.7 (0-255) 1 = Assert Hot Reset
2 Lane # D0.0-D31.0, K23.7 (0-31) Bit 1 0 = De-assert Disable Link
# of FTS ordered sets required by
3 # FTS receiver to obtain bit and symbol lock 1 = Assert Disable Link
4 Rate ID
5 Train Ctl Bit 2 0 = De-assert Loopback
6 1 = Assert Loopback
D10.2 for TS1 Identifier Bit 3 0 = De-assert Disable Scrambling
TS ID
1 = Assert Disable Scrambling
13
14 TS ID D10.2 for TS1 Identifier Bit 4 0 = De-assert Compliance Receive
15 TS ID D10.2 for TS1 Identifier 1 = Assert Compliance Receive
Bit 5:7 Reserved

Function Level Reset (FLR)


The FLR capability allows software to reset just one Function within a multi
functiondevicewithoutaffectingtheLinkthatissharedbythemall.Itsimple
mentationisstronglyrecommendedbutisntrequired,sosoftwarewouldneed
toconfirmitsavailabilitybeforeattemptingtouseitbyexaminingtheDevice
Capabilitiesregister,asshowninFigure188onpage843.IftheFunctionLevel
ResetCapabilitybitisset,thenanFLRcanbeinitiatedbysimplysettingtheIni
tiateFunctionLevelResetbitintheDeviceControlRegisterasshowninFigure
189onpage843.

842
PCIe 3.0.book Page 843 Sunday, September 2, 2012 11:25 AM

Chapter18:SystemReset

Figure188:FunctionLevelResetCapability

Figure189:FunctionLevelResetInitiateBit

843
PCIe 3.0.book Page 844 Sunday, September 2, 2012 11:25 AM

ThespecmentionsafewexamplesthatmotivatetheadditionofFLR:

1. It can happen that software controlling a Function encounters a problem


and is no longer operating correctly. Preventing data corruption necessi
tatesaresetofthatFunction,butifotherFunctionswithinthatdeviceare
stillworkingproperlyitwouldnicetobeabletoresetjusttheonehaving
trouble.
2. In a virtualized environment, where applications can migrate from one
piece of hardware to another, its important that when an application is
moved off a Function that the Function doesnt retain any information
aboutwhatitwasdoing.Thispreventsinformationusedbyoneapplication
thatmightbeconsideredconfidentialfrombecomingvisibletothenewone
runningonthatFunction.Thesimplestwaytocleanupaftermigratingthe
previousapplicationissimplytoresettheFunction.
3. WhensoftwareisrebuildingasoftwarestackforaFunction,itissometimes
necessary to first put the Function into an uninitialized state. As before,
avoidingaresetofallFunctionssharingtheLinkisdesirable.
Anotherfeaturedoesntappearinthelistofcasesinthespecbutisstillamoti
vatingfactorinitsownright.Whileaconventionalresetwillreinitializeevery
thing within the device, it does not require that all external activity, such as
trafficonanetworkinterface,mustceaserightaway.FLRaddsthisrequirement
andistheonlyresetthatdoes.

FLRresetstheFunctionsinternalstateandregisters,makingitquiescent,but
doesntaffectanystickybits,orhardwareinitializedbits,orlinkspecificregis
terslikeCapturedPower,ASPMControl,Max_Payload_SizeorVirtualChannel
registers. If an outstanding Assert INTx interrupt message was sent, a corre
spondingDeassertINTxmessagemustbesent,unlessthatinterruptwasshared
byanotherFunctioninternallythatstillhasitasserted.Allexternalactivityfor
thatFunctionisrequiredtoceasewhenanFLRisreceived.

Time Allowed
AFunctionmustcompleteanFLRwithin100ms.However,softwaremayneed
to delay initiating an FLR if there are any outstanding split completions that
haventyetbeenreturned(indicatedbythefactthattheTransactionsPending
bitremainssetintheDeviceStatusregister).Inthatcase,softwaremusteither
waitforthemtofinishbeforeinitiatingtheFLR,orwait100msafterFLRbefore
attempting to reinitialize the Function. If this isnt managed, a potential data
corruptionproblemarises:aFunctionmayhavesplittransactionsoutstanding
butaresetcausesittolosetrackofthem.Iftheyarereturnedlatertheycouldbe
PCIe 3.0.book Page 845 Sunday, September 2, 2012 11:25 AM

Chapter18:SystemReset

mistakenforresponsestonewrequeststhathavebeenissuedsincetheFLR.To
avoidthisproblem,thespecrecommendsthatsoftwareshould:

1. CoordinatewithothersoftwarethatmightaccesstheFunctiontoensureit
doesntattemptaccessduringtheFLR.
2. CleartheentireCommandregister,therebyquiescingtheFunction.
3. EnsurethatpreviouslyrequestedCompletionshavebeenreturnedbypoll
ing the Transactions Pending bit in the Device Status register until its
cleared or waiting long enough to be sure the Completions wont ever be
returned. How long would be long enough? If Completion Timeouts are
beingused,waitforthetimeoutperiodbeforesendingtheFLR.IfComple
tionTimeoutsaredisabled,thenwaitatleast100ms.
4. InitiatetheFLRandwait100ms.
5. SetuptheFunctionsconfigurationregistersandenableitfornormalopera
tion.
WhentheFLRhascompleted,regardlessofthetiming,theTransactionPending
bitmustbecleared.

Behavior During FLR


The spec writers chose to describe the behavior of a Function reset in fairly
broadtermssoasnottoprecludeanyinternalstepsthatdesignersmightwish
totake.Thefollowingbehaviorsarelistedinthespec:

TheFunctionmustnotappeartoanexternalinterfaceasthoughitwasan
initializedadapterwithanactivehost.Thestepstoensurethatallactivity
on external interfaces is terminated will be design specific. An example
wouldbeanetworkadapterthatmustnotrespondtorequeststhatwould
requireanactivehostduringthistime.
The Function must not retain any softwarereadable state that might
include secret information left behind by some previous use of the Func
tion.Forexample,anyinternalmemorymustbeclearedorrandomized.
TheFunctionmustbeconfigurableasnormalbythenextdriver.
The Function must return a completion for the configuration write that
causedtheFLRandtheninitiatetheFLR.

WhileanFLRisinprogress:

Anyrequeststhatarriveareallowedtobesilentlydiscardedwithoutlog
ging them or signaling an error. Flow control credits must be updated to
maintainthelinkoperation,though.

845
PCIe 3.0.book Page 846 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Incoming completions can be treated as Unexpected Completions or


silentlydiscardedwithoutloggingthemorsignalinganerror.
TheFLRitselfmustbecompletedwithinthetimedescribedabove,butfur
ther initialization after that could take longer. If a configuration Request
comesinbeforeinitializationiscompleted,theFunctionmustreturnacom
pletionwithCRS(ConfigurationRetryStatus)status.Onceacompletionis
returnedwithanyotherstatus,aCRSstatuswillnotbelegalagainuntilthe
Functionisresetagain.

Reset Exit
Afterexitingtheresetstate,LinkTrainingandInitializationmustbeginwithin
20ms.Devicesmayexittheresetstateatdifferenttimes,sinceresetsignalingis
asynchronous,butmustbegintrainingwithinthistime.

To allow reset components to perform internal initialization, system software


mustwaitforatleast100msfromtheendofaresetbeforeattemptingtosend
ConfigurationRequeststothem.Ifsoftwareinitiatesaconfigurationrequestto
adeviceafterthe100mswaittime,butthedevicestillhasntfinisheditsselfini
tialization, it returns a Completion with status CRS. Since configuration
RequestscanonlybeinitiatedbytheCPU,theCompletionwillbereturnedto
theRootComplex.Inresponse,theRootmayreissuetheconfigurationRequest
automaticallyormakethefailurevisibletosoftware.Thespecalsostatesthat
software should only use 100ms wait periods if CRS Software Visibility has
beenenabled,sincelongtimeoutsorprocessorstallsmayotherwiseresult.

Devicesareallowedafull1.0second(0%/+50%)afteraresetbeforetheymust
give a proper response to a configuration request. Consequently, the system
mustbecarefultowaitthatlongbeforedecidingthatanunresponsivedeviceis
broken.ThisvalueisinheritedfromPCIandthereasonforthislengthydelay
may be that some devices implement configuration space as a local memory
thatmustbeinitializedbeforeitcanbeseencorrectlybyconfigurationsoftware.
Its initialization may involve copying the necessary information from a slow
serialEEPROM,andsoitmighttakesometime.

846
PCIe 3.0.book Page 847 Sunday, September 2, 2012 11:25 AM

19 HotPlugand
PowerBudgeting
The Previous Chapter
Thepreviouschapterdescribes three typesofresets defined for PCIe:Funda
mentalreset(consistingofcoldandwarmreset),hotreset,andfunctionlevel
reset (FLR). The use of a sideband reset PERST# signal to generate a system
resetisdiscussed,andsoistheinbandTS1basedHotResetdescribed.

This Chapter
This chapter describes the PCI Express hot plug model. A standard usage
model is also defined for all devices and form factors that support hot plug
capability. Power is an issue for hot plug cards, too, and when a new card is
addedtoasystemduringruntime,itsimportanttoensurethatitspowerneeds
dontexceedwhatthesystemcandeliver.Amechanismwasneededtoquery
the power requirements of a device before giving it permission to operate.
Powerbudgetingregistersprovidethat.

The Next Chapter


Thenextchapterdescribesthechangesandnewfeaturesthatwereaddedwith
the2.1revisionofthespec.Someofthesetopics,liketheonesrelatedtopower
management, are described in earlier chapters, but for others there wasnt
another logical place for them. In the end, it seemed best to group them all
togetherinonechaptertoensurethattheywereallcoveredandtohelpclarify
whatfeaturesarenew.

847
PCIe 3.0.book Page 848 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Background
Some systems using PCIe require high availability or nonstop operation.
Onlineservicesuppliersrequirecomputersystemsthatexperiencedowntimes
ofjustafewminutesayearorless.Therearemanyaspectstobuildingsuchsys
tems, but equipment reliability is clearly important. To facilitate these goals
PCIesupportstheHotPlug/HotSwapsolutionsforaddincardsthatprovide
threeimportantcapabilities:

1. amethodofreplacingfailedexpansioncardswithoutturningthesystemoff
2. keepingtheO/Sandotherservicesrunningduringtherepair
3. shuttingdownandrestartingsoftwareassociatedwithafaileddevice

Prior to the widespread acceptance of PCI, many proprietary Hot Plug solu
tions were developed to support this type of removal and replacement of
expansioncards.TheoriginalPCIimplementationdidnotsupporthotremoval
andinsertionofcards,buttwostandardizedsolutionsforsupportingthiscapa
bilityinPCIhavebeendeveloped.ThefirstistheHotPlugPCICardusedinPC
Servermotherboardandexpansionchassisimplementations.Theotheriscalled
Hot Swap and is used in CompactPCI systems based on a passive PCI back
planeimplementation.

Inbothsolutions,controllogicisusedtoelectricallyisolatethecardlogicfrom
thesharedPCIbus.Power,reset,andclockarecontrolledtoensureanorderly
powerdownandpowerupofcardsastheyareremovedandreplaced,andsta
tusandpowerLEDsinformtheuserwhenitssafetochangeacard.

Extending hot plug support to PCI Express cards is an obvious step, and
designers have incorporated some Hot Plug features asnative to PCIe. The
specdefinesconfigurationregisters,HotPlugMessages,andprocedurestosup
portHotPlugsolutions.

Hot Plug in the PCI Express Environment


PCIeHotPlugisderivedfromthe1.0revisionoftheStandardHotPlugCon
trollerspec(SHPC1.0)forPCI.ThegoalsofPCIExpressHotPlugareto:

SupportthesameStandardizedUsageModelasdefinedbytheStandard
Hot Plug Controller spec. This ensures that the PCI Express hot plug is
identical from the user perspective to existing implementations based on
theSHPC1.0spec

848
PCIe 3.0.book Page 849 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

Supportthesamesoftwaremodelimplementedbyexistingoperatingsys
tems.However,anOSusingaSHPC1.0compliantdriverwontworkwith
PCI Express Hot Plug controllers because they have a different program
minginterface.

The registers necessary to support a Hot Plug Controller are integrated into
individualRootandSwitchPorts.UnderHotPlugsoftwarecontrol,thesecon
trollersandtheassociatedportinterfacemustcontrolthecardinterfacesignals
toensureorderlypowerdownandpowerupascardsarechanged.Toaccom
plishthat,theyllneedto:

AssertanddeassertthePERST#signaltothePCIExpresscardconnector
Removeorapplypowertothecardconnector.
Selectively turn on or off the Power and Attention Indicators associated
withaspecificcardconnectortodrawtheusersattentiontotheconnector
andindicatewhetherpowerisappliedtotheslot.
Monitor slot events (e.g. card removal) and report them to software via
interrupts.

PCIExpressHotPlug(likePCI)isdesignedasanosurprisesHotPlugmeth
odology.Inotherwords,theuserisnotnormallyallowedtoinstallorremovea
PCI Express card without first notifying the system. Software then prepares
boththecardandslotandfinallyindicatestotheoperatorthestatusofthehot
plug process and notification that installation or removal may now be per
formed.

Surprise Removal Notification


Cards designed to the PCIe Card ElectroMechanical spec (CEM) implement
card presence detect pins (PRSNT1# and PRSNT2#) on the connector. These
pinsareshorterthantheotherssothattheybreakcontactfirst(whenthecardis
removedfromtheslot).Thiscanbeusedtogiveadvancednoticetosoftwareof
asurpriseremoval,allowingtimetoremovepowerbeforethesignalsbreak
contact.

Differences between PCI and PCIe Hot Plug


TheelementsneededtosupporthotplugareessentiallythesameinbothPCI
andPCIehotplugsolutions.Figure191onpage850showsthePCIhardware
andsoftwareelementsrequiredtosupporthotplug.PCIsolutionsimplementa
singlestandardizedhotplugcontrolleronthesystemboardthathandledallthe

849
PCIe 3.0.book Page 850 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

hotplugslotsonthebus.IsolationlogicisneededinthePCIenvironmentto
electricallydisconnectacardfromthesharedbuspriortomakingchangesto
avoidglitchingthesignalsonanactivebus.

PCIeusespointtopointconnections(seeFigure192onpage851)thatelimi
nate theneedfor isolation logicbutrequireaseparate hot plug controller for
each Port to which a connectoris attached.A standardized software interface
definedforeachRootandSwitchPortcontrolshotplugoperations.

Figure191:PCIHotPlugElements

850
PCIe 3.0.book Page 851 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

Figure192:PCIExpressHotPlugElements

851
PCIe 3.0.book Page 852 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Elements Required to Support Hot Plug


AsshowninFigure192onpage851thereareseveralpartsinvolvedinmaking
ahogplugenvironmentwork.Fordiscussion,letsbreakthesedownintosoft
wareandhardwareelements.

Software Elements
The following table describes the major software elements that support Hot
Plugcapability.

Table191:IntroductiontoMajorHotPlugSoftwareElements

SoftwareElement Suppliedby Description

UserInterface OSvendor AnOSsuppliedutilitythatpermitsthe


usertorequestthataconnectorbepow
eredofftoremoveacardorturnedonto
useacardthathasjustbeeninstalled.

HotPlugService OSvendor Aservicethatprocessesrequests


(referredtoasHotPlugPrimitives)
issuedbytheOS.Thisincludesrequests
to:
provideslotidentifiers
turncardpowerOnorOff
turnAttentionIndicatorOnorOff
readcurrentpowerofslot(OnorOff)
TheHotPlugServiceinteractswiththe
HotPlugSystemDrivertosatisfythe
requests.Theinterface(i.e.,API)with
theHotPlugSystemDriverisdefined
bytheOSvendor.

StandardizedHot SystemBoard Receivesrequests(HotPlugPrimitives)


PlugSystemDriver vendororOS fromtheHotPlugServicewithinthe
OS.InteractswiththehardwareHot
PlugControllerstoaccomplishrequests.

852
PCIe 3.0.book Page 853 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

Table191:IntroductiontoMajorHotPlugSoftwareElements(Continued)

SoftwareElement Suppliedby Description

DeviceDriver Adaptercard SomeHotPlugspecificcapabilities


vendor mustbeincorporatedinaHotPlug
capabledevicedriver.Thisincludes:
supportfortheQuiescecommand.
optionalsupportofthePausecom
mand.
SupportforStartcommandor
optionalResumecommand.

AHotPlugcapablesystemmayuseanOSthatdoesntsupportHotPlugcapa
bility. In that case, although the system BIOS would contain HotPlugrelated
software, the HotPlug Service would not be present. Assuming that the user
doesntattempthotinsertionorremovalofacard,thesystemwilloperateasa
standard,nonHotPlugsystem:

ThesystemstartupfirmwaremustensurethatallAttentionIndicatorsare
Off.
Thespecalsostates:theHotPlugslots mustbe in astatethat wouldbe
appropriateforloadingnonHotPlugsystemsoftware.

Hardware Elements
Table 192onpage 853liststhemajorhardwareelementsnecessarytosupport
PCIExpressHotPlugoperation.

Table192:MajorHotPlugHardwareElements

HardwareElement Description

HotPlugController Receivesandprocessescommandsissuedbythe
HotPlugSystemDriver.OneControllerisassoci
atedwitheachRootorSwitchPortthatsupports
hotplugoperation.ThePCIespecdefinesastan
dardsoftwareinterfacefortheHotPlugControl
ler.

853
PCIe 3.0.book Page 854 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table192:MajorHotPlugHardwareElements(Continued)

HardwareElement Description

CardSlotPowerSwitching Allowspowertoaslottobeturnedonoroffunder
Logic programcontrol.ControlledbytheHotPlugcon
trollerunderthedirectionoftheHotPlugSystem
Driver.

CardResetLogic HotPlugControllerdrivesthePERST#signaltoa
specificslotasdirectedbytheHotPlugSystem
Driver.

PowerIndicator Indicateswhetherpoweriscurrentlyactiveonthe
connector.ControlledbytheHotPluglogicassoci
atedwitheachportanddirectedbytheHotPlug
SystemDriver.

AttentionIndicator Drawsoperatorattentiontoaconnectorthatneeds
service.ControlledbytheHotPluglogicand
directedbytheHotPlugSystemDriver.

AttentionButton PressedbytheoperatortonotifyHotPlugsoft
wareofarequesttochangeacard.

CardPresentDetectPins Therearetwoofthese:PRSNT1#islocatedatone
endofthecardslotandPRSNT2#attheopposite
end.Thesepinsareshorterthantheotherssothat
theydisconnectfirstwhenacardisremoved.The
systemboardtiesPRSNT1#togroundandcon
nectsPRSNT2#asaninputtotheHotPlugCon
trollerwithapullupresistor.AdditionalPRSNT2#
pinsaredefinedforwiderconnectorstosupport
theinsertionandrecognitionofshortercards
installedintolongerconnectors.Thecarditself
shortsPRSNT1#toPRSNT2#,sothatthePRSNT2#
inputishighifacardisnotphysicallypluggedin
orlowifitis.

854
PCIe 3.0.book Page 855 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

Card Removal and Insertion Procedures


Thedescriptionsoftypicalcardremovalandinsertionthatfollowareintended
tobeintroductoryinnature.Itshouldbenotedthattheproceduresdescribedin
the following sections assume that the OS, rather than the HotPlug System
Driver,isresponsibleforconfiguringanewlyinstalleddevice.IftheHotPlug
System Driver has this responsibility, the HotPlug Service will call the Hot
PlugSystemDriverandinstructittoconfigurethenewlyinstalleddevice.

On and Off States


AslotintheOnstatehasthefollowingcharacteristics:

Powerisappliedtotheslot.
REFCLKison.
ThelinkisactiveorinanActiveStatePowerManagementstate.
ThePERST#signalisdeasserted.

AslotintheOffstatehasthefollowingcharacteristics:

Powertotheslotisturnedoff.
REFCLKisoff.
Thelinkisinactive.(DriverattherootofswitchportisinHiZstate)
ThePERST#signalisasserted.

Turning Slot Off


StepsrequiredtoturnoffaslotthatiscurrentlyintheOnstate:

1. Deactivatethelink.ThismayinvolveissuingaEIOStoentertheHiZstate.
2. AssertthePERST#signaltotheslot.
3. TurnoffREFCLKtotheslot.
4. Removepowerfromtheslot.

Turning Slot On
Stepstoturnonaslotthatiscurrentlyintheoffstate:

1. Applypowertotheslot.
2. TurnonREFCLKtotheslot

855
PCIe 3.0.book Page 856 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

3. DeassertthePERST#signaltotheslot.Thesystemmustmeetthesetupand
holdtimingrequirements(specifiedinthePCIExpressspec)relativetothe
risingedgeofPERST#.

OncepowerandclockhavebeenrestoredandPERST#removed,thephysical
layersatbothportswillperformlinktrainingandinitialization.Whenthelink
is active, the devices will initialize VC0 (including flow control), making the
linkreadytotransferTLPs.

Card Removal Procedure


Whenacardistoberemoved,anumberofstepsareneededtopreparesoftware
andhardwareforsaferemovalofthecard,andsettheindicatorsforthecard
beingprocessed.Theconditionoftheindicatorsduringnormaloperationare:

AttentionIndicator(AmberorYellow)Offduringnormaloperation.
PowerIndicator(Green)Onduringnormaloperation

SoftwaresendsrequeststotheHotPlugControllerusingconfigurationwrites
thattargettheSlotControlRegistersimplementedbyHotPlugcapableports.
Thesecontrolthepowertotheslotandthestateoftheindicators.

Thesequenceofeventsisasfollows:

1. Theoperatorrequestscardremovalbypressingtheslotsattentionbutton
orbyusingthesystemsuserinterfacetoselectthePhysicalSlotnumberof
the card to be removed. If the button was used, the HotPlug Controller
detectsthiseventanddeliversaninterrupttotherootcomplex.Theinter
ruptdirectstheHotPlugservicetocalltheHotPlugSystemDrivertoread
slotstatusinformationanddetecttheAttentionButtonrequest.
2. Next,theHotPlugServicecommandstheHotPlugSystemDrivertoblink
theslotsPowerIndicatorasvisualfeedbacktotheoperatorfor5seconds.If
thiswasinitiatedbypressingtheAttentionbutton,theoperatorcanpress
thebuttonasecondtimetocanceltherequestduringthis5secondinterval.
3. The Power Indicator continues to blink while the Hot Plug software vali
dates the request. If the card is currently in use for some critical system
operation,softwaremaydenytherequest.Inthatcase,itwillissueacom
mandtotheHotPlugcontrollertoturnthePowerIndicatorbackON.The
spec also recommends that software notify the operator, perhaps with a
message or by logging an entry indicating the reason the request was
denied.

856
PCIe 3.0.book Page 857 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

4. Iftherequestisvalidated,theHotPlugServiceutilitycommandsthecards
device driver to quiesce the device. That is, disable its ability to generate
new Requests and complete or terminate all outstanding Root or Switch
Portrequests.
5. SoftwarethenissuesacommandtodisablethecardsLinkviatheLinkCon
trolregisterintheRootorSwitchPorttowhichtheslotisattached.
6. Next,softwarecommandstheHotPlugControllertoturntheslotoff.
7. Followingsuccessfulpowerdown,softwareissuesthePowerIndicatorOff
Requesttoturnoffthepowerindicatorsotheoperatorknowsthecardmay
beremoved.
8. TheoperatorreleasestheMechanicalRetentionLatch,ifthereisone,caus
ing the Hot Plug Controller to remove all switched signals from the slot
(e.g.,SMBusandJTAGsignals).Thecardcannowberemoved.
9. TheOSdeallocatesthememoryspace,IOspace,interruptline,etc.thathad
beenassignedtothedeviceandmakestheseresourcesavailableforassign
menttootherdevicesinthefuture.

Card Insertion Procedure


The procedure for installing a new card basically reverses the steps listed for
cardremoval.Thefollowingstepsassumethattheslotwasleftinthesamestate
that it was in immediately after a card was removed from the connector (in
otherwords,thePowerIndicatorisintheOffstate,indicatingtheslotisready
forcardinsertion).

ThestepstakentoInsertandenableacardareasfollows:

1. The operator installs the card and secures the MRL. If implemented, the
MRL sensor will signal the HotPlug Controller that the latch is closed,
causingswitchedauxiliarysignalsandVauxtobeconnectedtotheslot.
2. Next, the operator notifies the HotPlug Service that the card has been
installedbypressingtheAttentionButtonorusingtheHotPlugUtilitypro
gramtoselecttheslot.
3. If the button was pressed, it signals the Hot Plug controller of the event,
resultinginstatusregisterbitsbeingsetandcausingasysteminterruptto
be sent to the Root Complex. Subsequently, Hot Plug software reads slot
statusfromtheportandrecognizestherequest.
4. TheHotPlugServiceissuesarequesttotheHotPlugSystemDrivercom
manding the Hot Plug Controller to blink the slots Power Indicator to
inform the operator that the card must not be removed. The operator is
granteda5secondabortinterval,fromthetimethattheindicatorsstartsto
blink,toaborttherequestbypressingthebuttonasecondtime.

857
PCIe 3.0.book Page 858 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

5. ThePowerIndicatorcontinuestoblinkwhileHotPlugsoftwarevalidates
the request. Note that software may fail to validate the request (e.g., the
securitypolicysettingsmayprohibittheslotbeingenabled).Iftherequest
isnotvalidated,softwarewillissueacommandtotheHotPlugcontroller
toturnthePowerIndicatorbackOFF.Thespecrecommendsthatsoftware
notify the operator via a message or by logging an entry indicating the
causeoftherequestdenial.
6. TheHotPlugServiceissuesarequesttotheHotPlugSystemDrivercom
mandingtheHotPlugControllertoturnthesloton.
7. Oncepowerisapplied,softwareissuesacommandtoturnthePowerIndi
catorON.
8. Oncelinktrainingiscomplete,theOScommandsthePlatformConfigura
tion Routine to configure the card function(s) by assigning the necessary
resources.
9. TheOSlocatestheappropriatedriver(s)(usingtheVendorIDandDevice
ID,ortheClassCode,ortheSubsystemVendorIDandSubsystemIDcon
figuration register values as search criteria) for the function(s) within the
PCIExpressdeviceandloadsit(orthem)intomemory.
10. The OS then calls the drivers initialization code entry point, causing the
processortoexecutethedriversinitializationcode.Thiscodefinishesthe
setup of the device and then sets the appropriate bits in the devices PCI
configurationCommandregistertoenablethedevice.

Standardized Usage Model

Background
Systemsbasedontheoriginal1.0versionofthePCIHotPlugspecimplemented
hardware and software designs that varied widely because the spec did not
definestandardizedregistersoruserinterfaces.Consequently,customerswho
purchased Hot Plug capable systems from different vendors were confronted
withawidevariationinuserinterfacesthatrequiredretrainingoperatorswhen
newsystemswerepurchased.Furthermore,everyboarddesignerwasrequired
towritesoftwaretomanagetheirimplementationspecifichotplugcontroller.
The1.1revisionofthePCIHotPlugController(HPC)specdefines:

astandarduserinterfacethateliminatesretrainingofoperators
a standard programming interface for the hot plug controller, which per
mits astandardizedhotplugdriverto be incorporatedintotheoperating
system. PCI Express implements registers not defined by the HPC spec,

858
PCIe 3.0.book Page 859 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

hencethestandardHotPlugControllerdriverimplementationsforPCIand
PCIExpressareslightlydifferent.

Standard User Interface


Theuserinterfaceincludesthefollowingfeatures:

AttentionIndicatorshowstheattentionstateoftheslotwithanLEDthat
ison,off,orblinking.Thespecdefinestheblinkingfrequencyas1to2Hz
and50%(+/5%)dutycycle.Thestateofthisindicatorisstrictlyundersoft
warecontrol.
Power Indicator (called Slot State Indicator in PCI HP 1.1) shows the
powerstatusoftheslotandalsocanbeon,off,orblinking(at1to2Hzand
50%(+/5%)dutycycle).Thisindicatoriscontrolledbysoftware;however,
thespecpermitsanexceptionintheeventofahardwarepowerfaultcondi
tion.
Manually Operated Retention Latch and Optional Sensor secures card
withinslotandnotifiesthesystemwhenthelatchisreleased
ElectromechanicalInterlock(optional)locksthecardorretentionlatchto
preventthecardfrombeingremovedwhilepowerisapplied.
SoftwareUserInterfaceallowsoperatortorequesthotplugoperation
Attention Button allows operator to manually request hot plug opera
tion.
Slot Numbering Identification provides visual identification of slot on
theboard.

Attention Indicator
Asmentioned inthe previoussection,thespecrequiresthe systemvendorto
includeanAttentionIndicatorassociatedwitheachHotPlugslot.Thisindica
tormustbelocatedincloseproximitytothecorrespondingslotandisyellowor
amberincolor.ThisIndicatordrawstheattentionoftheendusertotheslotfor
service.Thespecmakesacleardistinctionbetweenoperationalandvalidation
errors and does not permit the attention indicator to report validation errors.
Validation errors are problems detected and reported by software prior to
beginning the hot plug operation. The behavior of the Attention Indicator is
listedinTable 193onpage 860.

859
PCIe 3.0.book Page 860 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table193:BehaviorandMeaningoftheSlotAttentionIndicator

IndicatorBehavior AttentionState

Off NormalNormalOperation

On AttentionHotPlugOperationFailedduetoanoper
ationalproblem(e.g.,problemswithexternalcabling,
addincards,softwaredrivers,andpowerfaults)

Blinking LocateSlotisbeingidentifiedatoperatorsrequest

Power Indicator
Thepowerindicatorsimplyreflectsthestateofmainpowerattheslot,andis
controlledbyHotPlugsoftware.Thecolorofthisindicatorisgreenandisillu
minatedwhenpowertotheslotison.

ThespecspecificallyprohibitsRootorSwitchPorthardwarefromchangingthe
powerindicatorstateautonomouslyasaresultofpowerfaultorotherevents.A
singleexceptiontothisruleallowsaplatformtodetectstuckonpowerfaults.A
stuckonfaultissimplyaconditioninwhichcommandsissuedtoremoveslot
powerareineffective.Ifthesystemisdesignedtodetectthisconditionthesys
temmayoverridetheRootorSwitchPortscommandtoturnthepowerindica
toroffandforceittoremainon.Thisnotifiestheoperatorthatthecardshould
notberemovedfromtheslot.Thespecfurtherstatesthatsupportingstuckon
faults is optional and, if handled via system software, the platform vendor
must ensure that this optional feature of the Standard Usage Model is
addressedviaothersoftware,platformdocumentation,orbyothermeans.

Thebehaviorofthepowerindicatorandtherelatedpowerstatesarelistedin
Table 194 on page 861. Note that Vaux remains on and switch signals are still
connecteduntiltheretentionlatchisreleasedorwhenthecardisremovedas
detectedbythePrsnt1#andPrsnt2#signals.

860
PCIe 3.0.book Page 861 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

Table194:BehaviorandMeaningofthePowerIndicator

IndicatorBehavior PowerState

Off PowerOffitissafetoremoveorinsertacard.Allpower
hasbeenremovedasrequiredforhotplugoperation.Vauxis
onlyremovedwhentheManualRetentionLatchisreleased.

On PowerOnremovalorinsertionofacardisnotallowed.
Poweriscurrentlyappliedtotheslot.

Blinking PowerTransitioncardremovalorinsertionisnotallowed.
Thisstatenotifiestheoperatorthatsoftwareiscurrently
removingorapplyingslotpowerinresponsetoahotplug
request.

Manually Operated Retention Latch and Sensor


TheManualRetentionLatch(MRL)isrequiredandholdsPCIExpresscardsrig
idly inthe slot. Each MRL can implement an optional sensor that notifies the
HotPlug Controller that the latch has been closed or opened. The spec also
allowsasinglelatchthatcanholddownmultiplecards.Suchimplementations
donotsupporttheMRLsensor.

AnMRLSensorisaswitch,opticaldevice,orothertypeofsensorthatreports
whetherthelatchisclosedoropen.Ifanunexpectedlatchreleaseisdetected,
theportautomaticallydisablestheslotandnotifiessystemsoftware,although
changing the state of the Power or Attention indicators autonomously is not
allowed.

The switched signals and auxiliary power (Vaux) must be automatically


removed from the slot when the MRL Sensor indicatesthat the MRL is open,
andtheymustberestoredtotheslotwhentheMRLSensorindicatesthatthe
latchisclosed.TheswitchedsignalsareVaux,SMBCLK,andSMBDAT.

The spec also describes an alternate method for removing Vaux and SMBus
powerwhenanMRLsensorisnotpresent.ThePRSNT#2pinindicateswhether
acardisphysicallyinstalledintotheslotandcanbeusedtotriggertheportto
removetheswitchedsignals.

861
PCIe 3.0.book Page 862 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Electromechanical Interlock (optional)


The optional electromechanical card interlock mechanism provides a more
sophisticated method of ensuring that a card is not removed while power is
appliedtotheslot.Thespecdoesnotdefinethespecificnatureoftheinterlock,
butstatesthatitcanphysicallylocktheaddincardortheMRLinplace.

The lock mechanism is controlled via software; however, there is no specific


programminginterfacedefinedforit.Instead,aninterlockiscontrolledbythe
samePortsignalthatenablesmainpowertotheslot.

Software User Interface


Anoperatormayuseasoftwareinterfacetorequestcardremovalorinsertion.
This interface is provided by system software, which also monitors slots and
reportsstatusinformationtotheoperator.Thespecstatesthattheuserinterface
isimplementedbytheOperatingSystemandconsequentlyisbeyondthescope
ofthespec.

The operator must be able to initiate operations at each slot independent of


otherslots.Consequently,theoperatormayinitiateahotplugoperationonone
slotusingthesoftwareuserinterfaceorattentionbuttonwhileahotplugoper
ationonanotherslotisinprocess.Thiscanbedoneregardlessofwhichinter
facetheoperatorusedtostartthefirstHotPlugoperation.

Attention Button
TheAttentionButtonisamomentarycontactpushbuttonswitch,locatednear
thecorrespondingHotPlugslotoronamodule.Theoperatorpressesthisbut
tontoinitiateahotplugoperationforthisslot(e.g.,cardremovalorinsertion).
OncetheAttentionButtonispressed,thePowerIndicatorstartstoblink.From
thetimetheblinkingbeginstheoperatorhas5secondstoaborttheHotPlug
operationbypressingthebuttonasecondtime.

ThespecrecommendsthatifanoperationinitiatedbyanAttentionButtonfails,
the system software should notify the operator of the failure. For example, a
messageexplainingthenatureofthefailurecanbereportedorlogged.

Slot Numbering Identification


Softwareandoperatorsmustbeabletoidentifyaphysicalslotbasedonitsslot
number. Each hotplug capable port must implement registers that software
usestoidentifythephysicalslotnumber.TheregistersincludeaPhysicalSlot

862
PCIe 3.0.book Page 863 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

numberandachassisnumber.Themainchassisisalwayslabeledchassis0.The
chassis numbers for other chassis must be nonzero and are assigned via the
PCItoPCIbridgesChassisNumberregister.

Standard Hot Plug Controller Signaling Interface


Figure 193 on page 864 presents a more detailed view of the logic within
SwitchPorts,alongwiththesignalsroutedbetweentheslotandthePort.The
importance of the standardized Hot Plug Controller is the common software
interfacethatallowsthedevicedrivertobeintegratedintooperatingsystems.

ThePCIespec,togetherwiththeCardElectroMechanical(CEM)spec,defines
theslotsignalsandthesupportrequiredforHotPlugPCIExpress.Followingis
alistofrequiredandoptionalportinterfacesignalsneededtosupporttheStan
dardUsageModel:

PWRLED#(required)portoutputthatcontrolsstateofPowerIndicator
ATNLED#(required)portoutputcontrolsstateofAttentionIndicator
PWREN (required if reference clock is implemented) port output that
controlsmainpowertoslot
REFCLKEN# (required) port output that controls delivery of reference
clocktotheslot
PERST#(required)portoutputthatcontrolsPERST#atslot
PRSNT1#(required)Groundedattheconnector
PRSNT2# (required) port input, pulled up on system board, that indi
catespresenceofcardinslot.
PWRFLT#(required)portinputthatnotifiestheHotPlugcontrollerofa
powerfaultconditiondetectedbyexternallogic
AUXEN#(requiredifAUXpowerisimplemented)portoutputthatcon
trolsswitchedAUXsignalsandAUXpowertoslotwhenMRLisopened
andclosed.TheMRL#signalisrequiredwithAUXpowerispresent.
MRL#(requiredifMRLSensorisimplemented)portinputfromtheMRL
sensor
BUTTON#(requiredifAttentionButtonisimplemented)portinputindi
catingoperatorhaspressedtheAttentionButton.

863
PCIe 3.0.book Page 864 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure193:HotPlugControlFunctionswithinaSwitch

The Hot-Plug Controller Programming Interface


ThestandardprogramminginterfacetotheHotPlugControllerisprovidedvia
the PCI Express Capability register block, shown in Figure 194 on page 865,
wheretheHotPlugrelatedregistersarehighlighted.HotPlugfeaturesarepri

864
PCIe 3.0.book Page 865 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

marily found in the Slot Registers defined for Root and Switch Ports. The
Device Capability register is also used in some implementations as described
laterinthischapter.

Figure194:PCIeCapabilityRegistersUsedforHotPlug

31 15 7 0
Next Cap PCI Express
PCI Express Capabilities Register Pointer Cap ID DW0
Device Capabilities Register DW1
Device Status Device Control DW2
Link Capabilities DW3
Link Status Link Control DW4
Slot Capabilities DW5
Slot Status Slot Control DW6
Root Capability Root Control DW7
Root Status DW8
Device Capabilities 2 DW9
Device Status 2 Device Control 2 DW10
Link Capabilities 2 DW11
Link Status 2 Link Control 2 DW12
Slot Capabilities 2 DW13
Slot Status 2 Slot Control 2 DW14

Slot Capabilities
Figure 195 on page 866 illustrates the slot capability register and bit fields.
Hardwareinitializesallofthesecapabilityregisterfieldstoreflectthefeatures
implemented by this port. This register applies to both card slots and rack
mount implementations, except for the indicators and attention button. Soft
waremustreadfromthedevicecapabilityregisterwithinthemoduletodeter
mine if indicators and attention buttons are implemented. Table 195 on
page 866listsanddefinestheslotcapabilityfields.

865
PCIe 3.0.book Page 866 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure195:SlotCapabilitiesRegister

31 19 18 17 16 15 14 7 6 5 4 3 2 0

Physical Slot Number

No Command Completed Support


Electromechanical Interlock Present

Slot Power Limit Scale

Slot Power Limit Value

Hot Plug Capable


Hot Plug Surprise

Power Indicator Present


Attention Indicator Present
MRL Sensor Present
Power Controller Present
Attention Button Present

Table195:SlotCapabilityRegisterFieldsandDescriptions

Bit(s) RegisterNameandDescription

0 AttentionButtonPresentindicatesthepresenceofanattentionbutton
onthechassisadjacenttotheslot.

1 PowerControllerPresentindicatesthepresenceofapowercontroller
forthisslot.

2 MRLSensorPresentindicatesthepresenceofaMRLSensoronthe
slot.

3 AttentionIndicatorPresentindicatesthepresenceofanattentionindi
catoronthechassisadjacenttotheslot.

4 PowerIndicatorPresentindicatesthepresenceofapowerindicatoron
thechassisadjacenttotheslot.

866
PCIe 3.0.book Page 867 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

Table195:SlotCapabilityRegisterFieldsandDescriptions(Continued)

Bit(s) RegisterNameandDescription

5 HotPlugSurpriseindicatesthatitspossiblefortheusertoremovethe
cardfromthesystemwithoutpriornotification.ThistellstheOStoallow
forsuchremovalwithoutaffectingcontinuedsoftwareoperation.

6 HotPlugCapableindicatesthatthisslotsupportshotplugoperation.

14:7 SlotPowerLimitValuespecifiesthemaximumpowerthatcanbesup
pliedbythisslot.Thislimitvalueismultipliedbythescalespecifiedinthe
nextfield.

16:15 SlotPowerLimitScalespecifiesthescalingfactorfortheSlotPower
LimitValue.

17 ElectroMechanicalInterlockPresentindicatesthatthisisimplemented
forthisslot

18 NoCommandCompletedSupportindicatesthatthisslotdoesntgener
atesoftwarenotificationwhenacommandhasbeencompleted.Earlier
versionssometimestookalongtimetoexecutehotplugcommands(for
example,sometimestakingasecondormoretocommunicateacrossan
I2Cbustoturnthepoweronoroff),andgeneratedaninterruptwhenthey
werefinallydone.WhensetthisbitmeansthatthisPortcanacceptwrites
toallfieldsintheSlotControlregisterwithoutdelay,sotheresnoneedfor
thenotification.

31:19 PhysicalSlotNumberIndicatesthephysicalslotnumberassociated
withthisport.Itmustbehardwareinitializedtoanumberthatisunique
withinthechassis.Notethatsoftwarewillneedthisnumbertorelatethe
physicalslottotheLogicalSlotID(Bus,Device,&Functionnumberfor
thisdevice).

Slot Power Limit Control


The spec provides a method for software to limit the amount of power con
sumedbyacardinstalledintoanexpansionslotorbackplaneimplementation.
TheregisterstosupportthisfeatureareincludedintheSlotCapabilityregister.

867
PCIe 3.0.book Page 868 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Slot Control
SoftwarecontrolstheHotPlugeventsthroughtheSlotControlregister,shown
inFigure196onpage868.ThisregisterpermitssoftwaretoenablevariousHot
Plugfeaturesandcontrolhotplugoperations.Itsalsousedtoenableinterrupt
generationaswellasenablingthesourcesofHotPlugeventsthatcanresultin
interruptgeneration.

Figure196:SlotControlRegister

15 13 12 11 10 9 8 7 6 5 4 3 2 1 0

RsvdP

Data Link Layer


State Changed Enable
Electromechanical
Interlock Control
Power Controller Control
Power Indicator Control
Attention Indicator Control

Hot Plug Interrupt Enable

Command Completed Interrupt Enable

Presence Detect Changed Enable

MRL Sensor Changed Enable


Power Fault Detected Enable

Attention Button Pressed Enable

868
PCIe 3.0.book Page 869 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

Table196:SlotControlRegisterFieldsandDescriptions

Bit(s) RegisterNameandDescription

0 AttentionButtonPressedEnable.Whenset,thisbitenablesthegenera
tionofahotpluginterrupt(ifenabled)orassertionoftheWake#message,
whentheattentionbuttonispressed.

1 PowerFaultDetectedEnable.Whenset,enablesgenerationofahotplug
interrupt(ifenabled)orWake#messageupondetectionofapowerfault.

2 MRLSensorChangedEnable.Whenset,enablesgenerationofahot
pluginterruptorWake#(ifenabled)messageupondetectionofaMRL
sensorchangedevent.

3 PresenceDetectChangedEnable.Whensetthisbitenablesthegenera
tionofthehotpluginterruptoraWakemessagewhenthepresence
detectchangedbitintheSlotStatusregisterisset.

4 CommandCompletedInterruptEnable.Whenset,enablesaHotPlug
interrupttobegeneratedthatinformssoftwarethatthehotplugcontrol
lerisreadytoreceivethenextcommand.

5 HotPlugInterruptEnable.Whenset,enablesthegenerationofHotPlug
interrupts.

7:6 AttentionIndicatorControl.Writestothefieldcontrolthestateofthe
attentionindicatorandreadsreturnthecurrentstate,asfollows:
00b=Reserved
01b=On
10b=Blink
11b=Off

9:8 PowerIndicatorControl.Writestothefieldcontrolthestateofthepower
indicatorandreadsreturnthecurrentstate,asfollows:
00b=Reserved
01b=On
10b=Blink
11b=Off

10 PowerControllerControl.Writestothefieldswitchmainpowertothe
slotandreadsreturnthecurrentstate:0b=PowerOn,1b=PowerOff

869
PCIe 3.0.book Page 870 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table196:SlotControlRegisterFieldsandDescriptions(Continued)

Bit(s) RegisterNameandDescription

11 ElectromechanicalInterlockControlIftheinterlockisimplemented,
writinga1btothisbittogglesthestateofitwhilewritinga0bhasno
effect.Readingthisbitalwaysreturnsa0b.

12 DataLinkLayerStateChangedEnableIftheDataLinkLayerLink
ActiveReportingcapabilityis1b,settingthisbitenablessoftwarenotifica
tionwhentheDataLinkLayerLinkActivebitchanges.IftheDataLink
LayerLinkActiveReportingcapabilityis0b,thenthisbitbecomesread
onlywithavalueof0b.

Slot Status and Events Management


TheHotPlugControllermonitorsavarietyofeventsandreportstheseeventsto
theHotPlugSystemDriver.Softwarecanusethedetectedbitstodetermine
which event has occurred, while the status bit identifies that nature of the
change.Thechangedbitsmustbeclearedbysoftwareinordertodetectasubse
quentchange.Notethatwhethertheseeventsgetreportedtothesystem(viaa
system interrupt) is determined by the related enable bits in the Slot Control
Register.
Figure197:SlotStatusRegister

15 9 8 7 6 5 4 3 2 1 0

RsvdZ

Data Link Layer State Changed


Electromechanical Interlock Status
Presence Detect State

MRL Sensor State

Command Completed

Presence Detect Changed

MRL Sensor Changed


Power Fault Detected

Attention Button Pressed

870
PCIe 3.0.book Page 871 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

Table197:SlotStatusRegisterFieldsandDescriptions

Bit
RegisterNameandDescription
Location

0 AttentionButtonPressedIfthebuttonisimplemented,thisbitis
setwhentheAttentionButtonispressed.

1 PowerFaultDetectedIfaPowerControllerthatsupportspower
faultdetectionisimplemented,thisbitissetwhenitdetectsapower
faultatthisslot.Thespecnotesthatitspossibleforapowerfaultto
bedetectedatanytime,regardlessofthePowerControlsettingor
whethertheslotisoccupied.

2 MRLSensorChangedIfanMRLSensorisimplemented,thisis
setwhenaMRLSensorstatechangeisdetected.Ifnosensoris
presentthisbitwillalwaysbezero.

3 PresenceDetectChangedsetwhenachangehasbeendetectedin
thePresenceDetectStatebit.

4 CommandCompletedIftheNoCommandCompletedSupport
bitintheSlotCapabilitiesregisteris0b,thenthisbitissetwhena
hotplugcommandhascompletedandtheHotPlugControlleris
readytoacceptanothercommand.Technically,onlythislastmean
ingisguaranteed:thecontrollerisreadytoacceptanothercom
mand,regardlessofwhetherthepreviousonehasactually
completed.

5 MRLSensorStatewhenset,indicatesthecurrentstateofthe
MRLsensor,ifimplemented:0b=MRLClosed,1b=MRLOpen

6 PresenceDetectStatethisbitindicatesthepresenceofacardina
slotandisrequiredforallDownstreamPortsthatimplementaslot.
ItsvalueisthelogicalORofPhysicalLayersDetectionlogicand
anyothersidebanddetectmechanismimplementedfortheslot
(suchasPRSNT1#andPRSNT2#).Thebigdifferencebetweenthem
isthatthepinsrequirenopowertophysicallydetectthecardand
canthusreportonitwithoutneedingthepowerrestored,while
usingthePhysicalLayerDetectlogicdoesneedpower.

871
PCIe 3.0.book Page 872 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table197:SlotStatusRegisterFieldsandDescriptions(Continued)

Bit
RegisterNameandDescription
Location

7 ElectromechanicalInterlockStatusIfanElectromechanicalInter
lockisimplemented,thisbitindicateswhetheritisengaged(1b)or
disengaged(0b).

8 DataLinkStateChangedThisbitissetwhentheDataLink
LayerLinkActivebitintheLinkStatusregisterchanges.Inresponse
tothisevent,softwaremustreadtheDataLinkLayerLinkActivebit
todeterminewhethertheLinkisactivebeforesendingconfigura
tioncyclestothehotpluggeddevice.

Add-in Card Capabilities


TheDeviceCapabilityregister,seeninFigure198onpage873,alsohasfields
relevanttoaddincardsthatrecordthepowerreportedbytheHotPlugCon
trollerasbeingavailabletotheirslot.Thisinformationmustbecommunicated
automatically with a Set_Slot_Power_Limit Message whenever either of these
takesplace:

A configuration write to the Slot Capabilities register changes the Slot


PowerLimitValueandSlotPowerLimitScalevalues.
The Link transitions from nonDL_UP to DL_Up status (unless the Slot
Capabilitiesregisterhasnotyetbeeninitialized).

ThemessageupdatestheCapturedSlotPowerLimitValueandScaleregisters
withthevaluesinthemessage,makingthisinformationreadilyavailabletoits
devicedriver.

872
PCIe 3.0.book Page 873 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

Figure198:DeviceCapabilitiesRegister

31 29 28 27 26 25 18 17 16 15 14 12 11 9 8 6 5 4 3 2 0

RsvdP Undefined

Function-Level
Reset Capability
Captured Slot Power Limit Scale

Captured Slot Power Limit Value


RsvdP
Role-Based Error Reporting

Endpoint L1 Acceptable Latency


Endpoint L0 Acceptable Latency
Extended Tag Field Supported
Phantom Functions Supported
Max Payload Size Supported

Quiescing Card and Driver

General
Prior to removing a card from the system, two things must occur: the device
driver must stop accessing the card, and the card must stop initiating or
respondingtonewRequests.HowthisisaccomplishedisOSspecific,butthe
followingmusttakeplace:

TheOSmuststopissuingnewrequeststothedevicesdriverorinstructthe
drivertostopacceptingnewrequests.
Thedrivermustterminateorcompletealloutstandingrequests.
ThecardmustbedisabledfromgeneratinginterruptsorRequests.

WhentheOScommandsthedrivertoquiesceitselfanditsdevice,theOSmust
not expect the device to remain in the system (in other words, it could be
removedandnotreplacedwithanidenticalcard).

873
PCIe 3.0.book Page 874 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Pausing a Driver (Optional)


Optionally, an OS could implement a Pause capability to temporarily stop
driver activity in the expectation that the same card will be reinserted. If the
cardisnotreinstalledwithinareasonableamountoftime,however,thedriver
mustbequiescedandthenremovedfrommemory.

Asanexample,thecurrentlyinstalledcardisfailingorisbeingreplacedwitha
laterrevisionasanupgrade.Iftheoperationistoappearseamlessfromasoft
wareandoperationalperspective,thedriverwouldhavetoquiescethedevice,
savethecurrentcontext(contentsofregisters,stackandinstructionpointerof
local microcontroller, etc.) and turn off the power to the slot. The new card
couldthenbeinstalledandpowered,andthen,whenitscontextisrestored,it
couldresumenormaloperationwhereitleftoff.Ofcourse,iftheoldcardhad
failed,itmaynotbepossibletosimplyresumeoperation.

Quiescing a Driver That Controls Multiple Devices


If a driver controls multiple cards and it receives a command from the OS to
quiesceitsactivitywithrespecttoaspecificcard,itmustonlyquiesceitsactiv
itywiththatcardandthecarditself.

Quiescing a Failed Card


Ifacardhasfailed,itmaynotbepossibleforthedrivertocompleterequests
previouslyissuedtothecard.Inthiscase,thedrivermustdetecttheerror,ter
minatetherequestswithoutcompletion,andattempttoresetthecard.

The Primitives
This section discusses the hotplug software elements and the information
passedbetweenthem.Forareviewofthesoftwareelementsandtheirrelation
shipstoeachother,refertoTable 191onpage 852.Communicationsbetween
the HotPlug Service within the OS and the HotPlug System Driver is in the
formofrequests.Thespecdoesntdefinetheexactformatoftheserequests,but
doesdefinethebasicrequesttypesandtheircontent.Eachrequesttypeissued
totheHotPlugSystemDriverbytheHotPlugServiceisreferredtoasaprimi
tive.TheyarelistedanddescribedinTable 198onpage 875.

874
PCIe 3.0.book Page 875 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

Table198:ThePrimitives

Primitive Parameters Description

QueryHotPlug Input:None RequeststhattheHotPlugSystem


SystemDriver DriverreturnasetofLogicalSlot
Return:SetofLogicalSlot IDsfortheslotsitcontrols.
IDsforslotscontrolledby
thisdriver.

SetSlotStatus Inputs: Thisrequestisusedtocontrolthe


LogicalSlotID slotsandtheAttentionIndicator
Newslotstate(onor associatedwitheachslot.Good
off). completionofarequestisindicated
NewAttentionIndica byreturningtheStatusChangeSuc
torstate. cessfulparameter.Ifafaultis
NewPowerIndicator incurredduringanattemptedsta
state. tuschange,theHotPlugSystem
Drivershouldreturntheappropri
Return:Requestcomple atefaultmessage(seemiddlecol
tionstatus: umn).Unlessotherwisespecified,
statuschangesuccessful thecardshouldbeleftintheoff
faultwrongfrequency state.
faultinsufficient
power
faultinsufficientcon
figurationresources
faultpowerfail
faultgeneralfailure

QuerySlot Input:LogicalSlotID Thisrequestreturnsthestateofthe


Status indicatedslot(ifacardispresent).
Return: TheHotPlugSystemDrivermust
Slotstate(onoroff) returntheSlotPowerstatusinfor
Cardpowerrequire mation.
ments.

875
PCIe 3.0.book Page 876 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Table198:ThePrimitives(Continued)

Primitive Parameters Description

AsyncNoticeof Input:LogicalSlotID Thisistheonlyprimitive(defined


SlotStatus bythespec)thatisissuedtothe
Change Return:none HotPlugServicebytheHotPlug
SystemDriver.Itissentwhenthe
Driverdetectsanunsolicited
changeinthestateofaslot.Exam
pleswouldbearuntimepower
faultoracardinstalledinaprevi
ouslyemptyslotwithnowarning.

Introduction to Power Budgeting


TheprimarygoalofthePCIExpresspowerbudgetingcapabilityistoallocate
power for PCI Express hot plug devices that are added to the system during
runtime.Thisensuresthatthesystemcanallocatetheproperamountofpower
andcoolingforthesedevices.

The spec states that power budgeting capability is optional for PCI Express
devicesimplementedinaformfactorwhichdoesnotrequirehotplug,orthat
areintegratedonthesystemboard.Noneoftheformfactorspecsreleasedat
thetimeofthiswritingrequiredsupportforhotplugorthepowerbudgeting
capability,butthesechangeoften.

Systempowerbudgetingisalwaysrequiredtosupportallsystemboarddevices
andaddincards.Thenewcapabilityprovidesmechanisms formanagingthe
budgeting process for a hotplug card. Each form factor spec defines the min
andmaxpowerfor agivenexpansionslot. Forexample,theCEMspeclimits
thepoweranexpansioncardcanconsumepriortobeingfullyenabledbut,after
itisenabled,itcanconsumethemaximumamountofpowerspecifiedforthe
slot. In the absence of the power budgeting capability registers, the system
designer is responsible for guaranteeing that power has been budgeted cor
rectly and that sufficient cooling is available to support any compliant card
installedintotheconnector.

The spec defines the configuration registers to support the power budgeting
process,butdoesnotdefinethepowerbudgetingmethodsandprocesses.The
next section describes the hardware and software elements that would be
involvedinpowerbudgeting,includingthespecifiedconfigurationregisters.

876
PCIe 3.0.book Page 877 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

The Power Budgeting Elements


Figure1910illustratestheconceptofPowerBudgetingforhotplugcards.The
roleofeachelementinvolvedinthepowerbudgeting,allocation,andreporting
processislistedanddescribedbelow:

SystemFirmwareforPowerManagement(usedduringboottime).
PowerBudgetManager(usedduringruntime).
ExpansionPorts(towhichcardslotsareattached).
AddinDevices(PowerBudgetCapable).

System Firmware
Written by the platform designers the specific system, this is responsible for
reporting system power information. The spec recommends the following
power information be reported to the PCI Express power budget manager,
which allocatesand verifiespower consumption and dissipation during runt
ime:

Totalsystempoweravailable.
Powerallocatedtosystemdevicesbyfirmware
Numberandtypeofslotsinthesystem.

FirmwaremayalsoallocatepowertoPCIedevicesthatsupportthepowerbud
getingcapabilityregisterset,suchasahotplugdeviceusedduringboottime.
The Power Budgeting Capability register, shown in Figure 199 on page 878,
contains a System Allocated bit that is hardware initialized (usually by firm
ware) to notify the power budget manager that power for this device has
alreadybeenincludedinthesystempowerallocation.Ifso,thePowerBudget
Manager still needs to read and save the power information for the hotplug
devicesthatwereallocatedincasetheyarelaterremovedduringruntime.

877
PCIe 3.0.book Page 878 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure199:PowerBudgetRegisters

31 0
Offset
PCIe Extended Capability Header 00h

Data Select 04h


RsvdP
Register
Data Register 08h
Power Budget
RsvdP Capability Register
0Ch

System Allocated Bit


Bit 0 of Power Budget Capability Register

The Power Budget Manager


This initializes when the OS installs and receives powerbudget information
fromsystemfirmware,althoughthespecdoesnotdefinethemethodfordeliv
eringthisinformation.Thismanagerisresponsibleforallocatingpowerforall
PCIExpressdevicesincluding:

PCI Express devices that have not already been allocated by the system
(includingembeddeddevicesthatsupportpowerbudgeting).
Hotpluggeddevicesinstalledatboottime.
Newdevicesaddedduringruntime.

Expansion Ports
Figure 1910 on page 880 illustrates a hot plug port that must have the Slot
Power Limit and Slot Power Scale fields within the Slot Capabilities register
implemented. The firmware or power budget manager must load these fields
withavaluethatrepresentsthemaximumamountofpowersupportedbythis
Port. When software writes to these fields the Port automatically delivers a
Set_Slot_Power_Limitmessagetothedevice.Thesefieldsarealsowrittenwhen
softwareconfiguresanewcardthathasbeenaddedasahotpluginstallation.

878
PCIe 3.0.book Page 879 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

Specrequirements:

AnyDownstreamPortthathasaslotattached(theSlotImplementedbitin
its PCIe Capabilities register is set) must implement the Slot Capabilities
register.
SoftwaremustinitializetheSlotPowerLimitValueandScalefieldsofthe
Slot Capabilities register of the Downstream Port that is connected to an
addinslot.
UpstreamPortsmustimplementtheDeviceCapabilitiesregister.
Whenacardisinstalledinaslotandsoftwareupdatesthepowerlimitand
scalevaluesintheDownstreamPort,thatPortwillautomaticallysendthe
Set_Slot_Power_LimitmessagetotheUpstreamPortontheinstalledcard.
TherecipientoftheMessagemustusethedatapayloadtolimititspower
usagefortheentirecard,unlessthecardwillneverexceedthelowestvalue
specifiedinthecorrespondingelectromechanicalspec.

Add-in Devices
Expansioncardsthatsupportthepowerbudgetingcapabilitymustincludethe
SlotPowerLimitValueandSlotLimitScalefieldswithintheDeviceCapabilities
register,andthePowerBudgetingCapabilityregistersetforreportingpower
relatedinformation.

Thesedevicesmustnotconsumemorethanthelowestpowerspecifiedbythe
form factor spec. Once power budgeting software allocates additional power
viatheSet_Slot_Power_Limitmessage,thedevicecanconsumethepowerthat
hasbeenspecified,butnotuntilithasbeenconfiguredandenabled.

Device DriverThe devices software driver is responsible for verifying that


sufficientpowerisavailableforproperdeviceoperationpriortoenablingit.If
thepowerislowerthanthatrequiredbythedevice,thedevicedriverisrespon
sibleforreportingthistoahighersoftwareauthority.

879
PCIe 3.0.book Page 880 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure1910:ElementsInvolvedinPowerBudget

Operating Firmware
System Power Budgeting

Reports Power Budget Info


Device to Power Manager including:
Driver 1 Power Budget
Manager
- Total system power budget
- Total power allocated to system
Devices board devices.
- Total number and type of slots

PCIe
Bus Driver

Configures Ports
Root or Switch Port with Power Limit
Information
Slot Capabilities Register
Hot-Plug
31 19 18 17 16 15 14 7 6 5 4 3 2 0
Controller 1
Physical Slot Number
Hot Plug Stat
Indicator Ctl

Hot Plug Ctl

Slot Power Scale


Slot Power Value
Port
Interface Root or Switch port
sends power limit
message to add-in card.

Device Capabilities Register


31 28 27 26 25 18 17 15 14 13 12 11 9 8 6 5 4 3 2 0

RsvdP

Captured Slot Power Limit Value

Captured Slot Power Limit Scale

Power Budget Capability Registers


31 0
PCIe Extended Capability Header
RsvdP Data Select Register

Data Register
Power Budget Capability
RsvdP Register

880
PCIe 3.0.book Page 881 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

Slot Power Limit Control


Softwareisresponsiblefordeterminingthemaximumpowerthatanexpansion
deviceisallowedtoconsume.Thisallocationisbasedonthepowerpartitioning
withinthesystem,thermalcapabilities,etc.Knowledgeofthesystemspower
andthermallimitscomesfromsystemfirmware.Thefirmwareorpowerman
agerisresponsibleforreportingthepowerlimitstoeachexpansionport.

Expansion Port Delivers Slot Power Limit


SoftwarewritestotheSlotPowerLimitValueandSlotPowerLimitScalefieldsof
the Slot Capability register to specify the maximum power that can be con
sumedbythedevice.Softwareisrequiredtospecifyapowervaluethatreflects
oneofthemaximumvaluesdefinedbythespec.Forexample,revision2.0ofthe
CEMspecdefinespowerusageaslistedinTable199.

Aninterestingnoteaboutthesevaluesisthatastandardheightx1servercardis
limitedto10Wafteraresetandisonlyallowedtousethefull25Wafteritsbeen
configuredandenabled.Similarly,ax16graphicscardwillbelimitedto25W
untilconfiguredandenabledtousethefull75W.

Table199:MaximumPowerConsumptionforSystemBoardExpansionSlots

X1Link X4/X8Link X16Link

StandardHeight 10W 25W 25W(max) 25W 75W


(max (max (max (max
desktop) server) server) graph
icscard)

LowProfileCard 10W(max) 25W(max) 25W(max)

InadditiontothebaseCEMspec,twomorespecshavebeendefinedforhigher
powered devices. First is the PCIe x16 Graphics 150WATX Spec 1.0, which
defines a video card thats able to draw 75W from the card connector and
another 75W from a separate 3pin ATX power connector. The second is the
PCIe 225W/300W High Power CEM Spec 1.0, which extends this by adding
another3pinpowerconnectortoachieve225W,ora4pinATXconnectorthat
bringsthetotalto300W.

881
PCIe 3.0.book Page 882 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

WhentheSlotPowerregistersarewrittenbypowerbudgetsoftware,theexpan
sionportsendsaSet_Slot_Power_Limitmessagetotheexpansiondevice.This
procedureisillustratedinFigure1911onpage882.

Figure1911:SlotPowerLimitSequence

Root or Switch Port


Slot Capabilities Register
Hot-Plug
31 19 18 17 16 15 14 7 6 5 4 3 2 0
Controller 1
Physical Slot Number
Hot Plug Stat
Indicator Ctl

Hot Plug Ctl

Slot Power Scale


Slot Power Value
Port
Interface Root or Switch port
sends power limit
message to add-in card.

Device Capabilities Register


31 28 27 26 25 18 17 15 14 13 12 11 9 8 6 5 4 3 2 0

RsvdP

Captured Slot Power Limit Scale


Captured Slot Power Limit Value

1. When Hot Plug software is notified of a card insertion request, Power and Clock
are restored to the slot.
2. Hot Plug software calls configuration and power budgeting software to configure
and allocate power to the device.
3. Power budget software may interrogate the card to determine it's power requirements
and characteristics.
4. Power is then allocated based on the device's requirements and the system's capabilities
5. Power management software writes to the Slot Power Scale and Slot Power Value fields
within the expansion port.
6. Writes to these fields command the port to send the Set_Slot_Power_Limit message to
convey the contents of the Slot Power fields.
7. The slot receives the message and updates its Captured Slot Power Limit Value and Scale
fields.
8. These values limit the power that the expansion device can consume once it is enabled by
its device driver.

882
PCIe 3.0.book Page 883 Sunday, September 2, 2012 11:25 AM

Chapter 19: Hot Plug and Power Budgeting

Expansion Device Limits Power Consumption


The device driver reads the values from the Captured Slot Power Limit and
Scalefieldstoverifythatthepoweravailableissufficienttooperatethedevice.
Severalconditionsmayexist:

Enough power is available to operate the device at full capability. In this


case, the driver enables the device by writing to the configuration Com
mandregister,permittingthedevicetoconsumepoweruptothelimitspec
ifiedinthePowerLimitfields.
Thepoweravailableissufficienttooperatethedevicebutnotatfullcapa
bility.Inthiscase,thedriverisrequiredtoconfigurethedevicesuchthatit
consumesnomorepowerthanspecifiedinthePowerLimitfields.
The power available is insufficient to operate the device. In this case, the
drivermustnotenablethecardandmustreporttheinadequatepowercon
dition to the upper software layers, which should in turn inform the end
useroftheproblem.
The power available exceeds the maximum power specified by the form
factorspec.Thisconditionshouldnotoccur.but,ifitdoes,thedeviceisnot
permittedtoconsumepowerbeyondthemaximumpermittedbytheform
factor.
Thepoweravailableislessthanthelowestvaluespecifiedbytheformfac
torspec.Thisisaviolationofthespec,whichstatesthattheexpansionport
mustnottransmitaSet_Slot_Power_LimitMessagethatindicatesalimit
lowerthanthelowestvaluespecifiedintheelectromechanicalspecforthe
slotsformfactor.

Someexpansiondevicesmayconsumelesspowerthanthelowestlimitspeci
fiedfortheirformfactor.Suchdevicesarepermittedtodiscardtheinformation
delivered in the Set_Slot_Power_Limit Messages. When the Slot Power Limit
ValueandScalefieldsareread,thesedevicesreturnzeros.

The Power Budget Capabilities Register Set


Theseregisterspermitpowerbudgetingsoftwaretoallocatepowermoreeffec
tivelybasedoninformationprovidedbythedevicethroughitspowerbudget
dataselectanddataregister.Thisfeatureissimilartothedataselectanddata
fieldswithinthepowermanagementcapabilityregisters.However,thepower
budget registers provide more detailed information to software to aid it in
determining the effects of expansion cards that are added during runtime on

883
PCIe 3.0.book Page 884 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

thesystempowerbudgetandcoolingrequirements.Throughthiscapability,a
devicecanreportthepoweritconsumes:

fromeachpowerrail
invariouspowermanagementstates
indifferentoperatingconditions

Theseregistersarenotrequiredfordevicesimplementedonthesystemboard
oronexpansiondevicesthatdonotsupporthotplug.Figure1912onpage884
illustratesthepowerbudgetcapabilitiesregistersetandshowsthedataselect
anddatafieldthatprovidethemethodforaccessingthepowerbudgetinforma
tion.

Thepowerbudgetinformationismaintainedwithinatablethatconsistsofone
ormore32bitentries.Eachtableentrycontainspowerbudgetinformationfor
the different operating modes supported by the device. Each table entry is
selected via the data select field, and the selected entry is then read from the
data field. The index values start at zero and are implemented in sequential
order.Whenaselectedindexreturnsallzerosinthedatafield,theendofthe
power budget table has been located. Figure 1913 on page 885 illustrates the
formatandtypesofinformationavailablefromthedatafield.

Figure1912:PowerBudgetCapabilityRegisters

31 0
Offset
PCIe Extended Capability Header 00h

Data Select 04h


RsvdP
Register
Data Register 08h
Power Budget
RsvdP Capability Register
0Ch

884
This entire register is read-only
31 21 20 18 17 15 14 13 12 10 9 8 7 0
Power PM PM Sub Data
RsvdP Rail Type State State Scale Base Power

Power rail of operating condition described by this entry:


PCIe 3.0.book Page 885 Sunday, September 2, 2012 11:25 AM

000b 12V power. Specifies the base power (in watts)


001b 3.3V power. for the state indicated by bits [20:10].
010b 1.8V power. Base Power x Scale = actual power consumption.
111b Thermal Data Scale Values:
All other encodings are reserved. 00b 1.0x
01b 0.1x
10b 0.01x
Type of operating condition described by this entry: 11b 0.001x
000b PME Aux All other encodings are reserved.
001b Auxiliary
010b Idle
011b Sustained PM sub state of operating condition described by this entry:
111b Maximum 000b Default Sub State
All other encodings are reserved. 001b 111b Device-specific Sub State
All other encodings are reserved.

PM state described by this entry:


00b D0
01b D1
10b D2
11b D3
D3-Cold PM State description = 11b and Aux or PME Aux in Type field.
D3-Hot state = 11b + any other Type value.
How it works:
The power budgeting data for the function consists of a table of n entries starting with
entry 0. Each entry is read by placing an index value in the Power Budgeting Data
Select register and then reading the value returned in the Power Budgeting Data
register.The end of table is indicated by a return value of all 0's in the Data register.
Figure1913:PowerBudgetDataFieldFormatandDefinition

885
Chapter 19: Hot Plug and Power Budgeting
PCIe 3.0.book Page 886 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

886
PCIe 3.0.book Page 887 Sunday, September 2, 2012 11:25 AM

20 UpdatesforSpec
Revision2.1
Previous Chapter
The previous chapter describes the PCI Express hot plug model. A standard
usage model is also defined for all devices and form factors that support hot
plugcapability.Powerisanissueforhotplugcards,too,andwhenanewcard
is added to a system during runtime, its important to ensure that its power
needsdontexceedwhatthesystemcandeliver.Amechanismwasneededto
querythepowerrequirementsofadevicebeforegivingitpermissiontooper
ate.Powerbudgetingregistersprovidethat.

This Chapter
Thischapterdescribesthechangesandnewfeaturesthatwereaddedwiththe
2.1 revision of the spec. Some of these topics, like the ones related to power
management, are described in other chapters, but for others there wasnt
another logical place for them. In the end, it seemed best to group them all
togetherinonechaptertoensurethattheywereallcoveredandtohelpclarify
whatfeatureswerenew.

The Next Chapter


Thenextsectionisthebookappendixwhichincludestopicssuchas:Debugging
PCIExpressTrafficusingLeCroyTools,Markets&ApplicationsofPCIExpress
Architecture,ImplementingIntelligentAdaptersandMultiHostSystemswith
PCIExpressTechnology,LegacySupportforLockingandthebookGlossary.

Changes for PCIe Spec Rev 2.1


The2.1revisionofthespecforPCIeintroducedseveralchangestoenhanceper
formance or improve operational characteristics. It did not add another data
rate and thats why it was considered an incremental revision. The modifica
tionscanbegroupedgenerallyintofourareasofimprovement:SystemRedun
dancy,Performance,PowerManagement,andConfiguration.

887
PCIe 3.0.book Page 888 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

System Redundancy Improvement: Multi-casting


The Multicasting capability allows a Posted Write TLP to be routed to more
than one destination at the same time, allowing for things like automatically
making redundant copies of data or supporting multiheaded graphics. As
shown in Figure 201 on page 888, a TLP sourced from one Endpoint can be
routed to multiple destinations based solely on its address. In this example,
dataissenttothevideoportfordisplaywhileredundantcopiesofitareauto
matically routed to storage. There are other ways this activity could be sup
ported,ofcourse,butthisisveryefficientintermsofLinkusagesinceitdoesnt
requirearecipienttoresendthepackettosecondarylocations.

Figure201:MulticastSystemExample

SDRAM

GFX Root Complex

Endpoint Endpoint
Switch NIC

Disk Disk

SCSI SCSI

Thismechanismisonlysupportedforposted,addressroutedRequests,suchas
Memory Writes, that contain data to be delivered and an address that can be
decodedtoshowwhichPortsshouldreceiveit.NonpostedRequestswillnot
betreatedasMulticasteveniftheiraddressesfallwithintheMultiCastaddress
range.ThosewillbetreatedasunicastTLPsjustastheynormallywould.

ThesetupforMulticastoperationinvolvesprogramminganewregisterblock
foreachroutingelementandFunctionthatwillbeinvolved,calledtheMulti
castCapabilitystructure.ThecontentsofthisblockareshowninFigure202on
page889,whereitcanbeseenthattheydefineaddressesandalsoMCGs(Mul
tiCastGroupnumbers)thatexplainwhetheraFunctionshouldsendorreceive
copies of an incoming TLP or whether a Port should forward them. Lets

888
PCIe 3.0.book Page 889 Sunday, September 2, 2012 11:25 AM

Chapter20:UpdatesforSpecRevision2.1

describetheseregistersnextanddiscusshowtheyreusedtocreate Multicast
operationsinasystem.


Figure202:MulticastCapabilityRegisters

31 20 19 16 15 0
Next Extended Version PCIe Extended Capability ID
Capability Offset (1h) (0012h for Multicast)

31 0 Offset

PCIe Enhanced Capability Header 00h

Multicast Control Multicast Capability 04h

08h
MC_Base_Address Register
MCGs this Function 0Ch
is allowed to receive
or forward 10h
MC_Receive Register 14h
MCGs this Function
must not send 18h
or forward MC_Block_All Register 1Ch

MCGs this Function 20h


must not send or MC_Block_Untranslated Register 24h
forward if the address
is untranslated 28h
MC_Overlay_BAR 2Ch
Root Ports and
Switch Ports

Multicast Capability Registers


TheCapabilityHeaderregisteratthetopofthefigureincludestheCapability
IDof0012h,a4bitVersionnumber,andapointertothenextcapabilitystruc
tureinthelinkedlistofregisters.

Multicast Capability
Thisregister,shownindetailinFigure203onpage890,containsseveralfields.
TheMC_Max_GroupvaluedefineshowmanyMulticastGroupsthisFunction
has been designed to support minus one, so that a value of zero means one

889
PCIe 3.0.book Page 890 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

groupissupported.TheWindowSizeRequested,whichisonlyvalidforEnd
points and reserved in Switches and Root Ports, represents the address size
neededforthispurposeasapoweroftwo.

Figure203:MulticastCapabilityRegister

15 14 13 8 7 6 5 0

MC_Window_Size RsvdP MC_Max_Group


Requested

RsvdP Exponent for MC Max number of MCGs


MC_ECRC_ window size in Supported minus 1
Regeneration_Supported endpoints
RsvdP in Switches
and RC

Lastly,bit15indicateswhetherthisFunctionsupportsregeneratingtheECRC
valueinaTLPifforwardingitinvolvedmakingaddresschangestoit.Referto
thesectioncalledOverlayExampleonpage 895formoredetailonthis.

Multicast Control
Thisregister,showninFigure204onpage890,containstheMC_Num_Group
thatisprogrammedwiththenumberofMulticastGroupsconfiguredbysoft
wareforusebythisFunction.Thedefaultnumberiszero,andthespecnotes
thatprogrammingavalueherethatisgreaterthanthemaxvaluedefinedinthe
MC_Max_Groupregisterwillresultinundefinedbehavior.TheMC_Enablebit
isusedtoenabletheMulticastmechanismforthiscomponent.

Figure204:MulticastControlRegister

15 14 6 5 0

RsvdP MC_Num_Group

MC_Enable Number of MCGs


Configured minus 1

890
PCIe 3.0.book Page 891 Sunday, September 2, 2012 11:25 AM

Chapter20:UpdatesforSpecRevision2.1

Multicast Base Address


Thebaseaddressregister,showninFigure205onpage891,containsthe64bit
startingaddressoftheMulticastAddressrangeforthiscomponent.TheMulti
CastIndexPositionregisterindicatesthebitpositionwithintheaddresswhere
the MultiCast Group (MCG) number is to be found. When the address of an
incoming TLP falls within the MultiCast address range starting at this Base
Address,thelogicwilloffsetintotheaddressitselfbythenumberofbitloca
tionsgivenintheIndexPositionandinterpretthenextbits(upto6bits,allow
ing up to 64 groups) as the MCG number for that TLP. The MCG number, in
turn,willindicatewhetherthePortshouldforwardacopyofthisTLP.

Figure205:MulticastBaseAddressRegister

31 12 11 6 5 0

MC_Index
MC_Base_Address [31:12] RsvdP
_Position

MC_Base_Address [63:32]

AnexampleoflocatingtheMCGwithintheaddressisshowninFigure206on
page892.HeretheIndexPositionvalueis24,sotheMCGisfoundinaddress
bits25to30.Interestingly,sincethebaseaddressdoesntdefinethelower12bits
oftheaddress,theMCIndexPositionmustbe12orgreatertobevalid.Ifitsless
than12andtheMC_Enablebitisset,thecomponentsbehaviorwillbeunde
fined.

891
PCIe 3.0.book Page 892 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure206:PositionofMulticastGroupNumber

+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Byte 4 Requester ID Tag Last DW 1st DW
BE BE
Byte 8 Address [63:32]
Byte 12 MCG Address [31:2] R

MC_Index_Position = 24

MC Receive
This64bitregisterisabitvectorthatindicatesforwhichofthe64MCGsthis
FunctionshouldacceptacopyorthisPortshouldforwardacopy.IftheMCG
valueisfoundtobe47,forexample,andbit47issetinthisregister,thenthis
FunctionshouldreceiveitorthisPortshouldforwardit.

MC Block All
This 64bit register indicates which MCGs an Endpoint Function is blocked
fromsendingandwhichaSwitchorRootPortisblockedfromforwarding.This
canbeprogrammedinaSwitchorRootPorttopreventitfromforwardingMul
tiCast TLPs to an Endpoint that doesnt understand them, for example. A
blockedTLPisconsideredanerrorcondition,andhowtheerrorishandledis
describedinthenextsection.

MC Block Untranslated
Themeaninganduseofthis64bitregisterisalmostidenticaltotheBlockAll
registerexceptthatitdoesntapplytoTLPswhoseATheaderfieldshowsthem
tobetranslated.ThismechanismcanbeusedtosetupaMulticastwindowthat
isprotectedinthatitcanonlyreceivetranslatedaddresses.

IfaTLPisblockedbecauseofthesettingofeitherofthesetwoblockingregis
ters,itshandledasanMCBlockedTLP,meaningitgetsdroppedandthePort

892
PCIe 3.0.book Page 893 Sunday, September 2, 2012 11:25 AM

Chapter20:UpdatesforSpecRevision2.1

orFunctionlogsandsignalsthisasanerror.Loggingtheerrorinvolvessetting
theSignaledTargetAbortbitinitsStatusregisteroritsSecondaryStatusregis
ter, as appropriate. Thats barely enough information to be useful, though, so
thespechighlyrecommendsthatAdvancedErrorReporting(AER)registersbe
implemented in Functions with Multicast capability to facilitate isolating and
diagnosingfaults.

ThespecnotesthatthisregisterisrequiredinallFunctionsthatimplementthe
MC Capability registers, but if an Endpoint Function doesnt implement the
ATS(AddressTranslationServices)registers,thedesignermaychoosetomake
thesebitsreserved.

Multicast Example
Atthispoint,anexamplewillhelptoillustratehowtheseregisterscanbeused
tosetupamulticastenvironment.Tosetthisup,letsfirstgivetherelevantreg
isterssomevalues:

MC_Base_Address=2GB(Startingaddressforthemulticastrange)
MC_Max_Group=7(Meaning8windowsarepossibleforthisdesign)
MC_Window_Size_Requested=10(Meaning210or1KBsizewasrequested
byanEndpoint)
MC_Index_Position=12(Meaningtheactualsizeofeachwindowis212)
MC_Num_Group=5(Meaningsoftwareonlyconfigured6oftheavailable
multicastwindows).

Basedonthoseregistersettings,theimageinFigure207onpage894illustrates
theresult.Themulticastwindowrangeisshownstartingat2GBandrangingas
highas2GB+8*(thewindowsize).However,only6areenabledbysoftware,
sotheactualmulticastaddressrangeisfrom2GBto2GB+24KB.Thewindows
areallthesamesizeandcorrespondtotheMCGs:MCG0isthefirstwindow,1
isthenextwindow,andsoon.

893
PCIe 3.0.book Page 894 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure207:MulticastAddressExample

System Memory Map

MC Address Range
= 2GB to 2GB + 212 * 6
= 2GB to 2GB + 24KB
8 MC windows available in
2GB + 24KB MC Group 5 hardware, each at least 210
MC Group 4
Only 6 MC windows are MC Group 3 in size (technically, 212 is
configured for use MC Group 2
MC Group 1
min. address granularity)
MC Group 0
2GB MC_Base_Address

MC Overlay BAR
ThislastsetofregistersarerequiredforSwitchandRootPortsthatimplement
Multicasting,buttheyrenotimplementedinEndpoints.Themotivationforthis
BAR is that it allows two special cases. First, a Port can forward TLPs down
streamiftheyhitinamulticastwindoweveniftheEndpointwasntdesigned
formulticasting.Second,aPortcanforwardmulticastTLPsupstreamtosystem
memory.Inbothcases,thisisaccomplishedbyreplacingpartoftheRequests
addresswithanaddressthatwillberecognizedbythetarget.Doingsoallowsa
singleBARinacomponenttoserveasatargetforbothunicastandmulticast
writesevenifitwasntdesignedwithmulticastcapability.

AsshowninFigure208onpage895,thisregisterblockconsistsofanaddress
thatwillbeoverlaidontotheoutgoingTLP,anda6bitOverlaySizeindicator.
Thesizereferredtohereissimplythenumberofbitsfromtheoriginal64bit
addressthatwillberetained,whilealltheotherswillbereplacedbytheOver
layBARbits.Thespecmistakenlyreferstothisinatleastoneplaceasthesizein
bytes, but in otherplacesits madeclearthat it is abit number. Note that the
overlaysizevaluemustbe6orhighertoenabletheoverlayoperation.Ifthesize
isgivenas5orlower,nooverlaywilltakeplaceandtheaddressisunchanged.

894
PCIe 3.0.book Page 895 Sunday, September 2, 2012 11:25 AM

Chapter20:UpdatesforSpecRevision2.1

Figure208:MulticastOverlayBAR

31 6 5 0

MC_Overlay
MC_Overlay_BAR [31:6]
_Size

MC_Overlay_BAR [63:32]

Overlay Example
Nowconsiderthecaseinwhichanaddressoverlayisdesired,asshowninFig
ure209onpage896.HeretheaddressofaTLPtobeforwarded,ABCD_BEEFh,
fallswithinthedefinedmulticastrange(alsoreferredtoasamulticasthit)and
theegressPorthasbeenconfiguredwithvalidvaluesintheOverlayBAR.

The overlay case creates the unusual situation with the ECRC value that was
mentionedearlierinthedescriptionoftheMulticastCapabilityregister.Ifthe
TLP whose address is being changed by the overlay includes an ECRC, that
value would be rendered incorrect by this change. Switches and Root Ports
optional support regenerating the ECRC based on the new address so that it
stillservesitspurposegoingforward.Iftheroutingagentdoesnotsupportit,
theECRCissimplydroppedandtheTDheaderbitisforcedtozerotoavoid
anyconfusion.

A potential problem can arise with ECRC regeneration. If the incoming TLP
already had an error but the ECRC value is regenerated because the address
wasmodified,thatwouldinadvertentlyhidetheoriginalerror.Toavoidthat,
theroutingagentmustverifytheoriginalECRCfirst.Ifitfindsanerror,itmust
forceabadECRContheoutgoingTLPbyinvertingthecalculatedECRCvalue
beforeappendingittoensurethatthetargetwillseeitasanerrorcondition.

895
PCIe 3.0.book Page 896 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure209:OverlayExample

System Memory Map

PCIe BAR Range Overlaid Address:


FEED_0000 FEED_BEEFh
to FEED_FFFF

Original Address:
ABCD_BEEFh
Multicast Address
Range

Routing Multicast TLPs


When a Switch or Root Port detects an MC hit (address falls within the MC
range) normal routing is suspended. The MCG is extracted from the address
andiscomparedtotheMC_ReceiveregisterofallthePortstoseewhichofthem
shouldforwardacopyofthisTLP.PortswhosecorrespondingReceiveregister
bitissetwillforwardacopyoftheTLPunlesstheircorrespondingMCBlocked
registerbitisalsoset.IfnoPortsforwardtheTLPandnoFunctionsconsumeit,
itissilentlydropped.Topreventloops,aTLPisneverforwardedbackoutonits
ingressPort,withthepossibleexceptionofanACScase.

EndpointsextracttheMCGandcompareitwiththeirReceiveregister.Iftheres
nomatch,theTLPissilentlydropped.IftheEndpointdoesntsupportMulti
casting,itwilltreattheTLPashavinganordinaryaddress.

896
PCIe 3.0.book Page 897 Sunday, September 2, 2012 11:25 AM

Chapter20:UpdatesforSpecRevision2.1

Congestion Avoidance
TheuseofMulticastingwillincreasetheamountofsystemtrafficinproportion
tothepercentageofMCtraffic,whichleadstotheriskofpacketcongestion.To
avoidcreatingbackpressure,MCtargetsshouldbedesignedtoacceptMCtraf
fic at speed, meaning with minimal delay. To avoid oversubscribing the
Links,MCinitiatorsshouldlimittheirpacketinjectionrate.Asystemdesigner
would be wise to choose components carefully to handle this. For example,
using Switches and Root Ports whose buffers are big enough to handle the
expectedtraffic,andEndpointsthatareabletoaccepttheirincomingMCpack
etsquicklyenoughtoavoidtrouble.

Performance Improvements
Systemperformanceisenhancedwiththeadditionoffournewfeatures:
1. AtomicOpstoreplacethelegacytransactionlockingmechanism
2. TLPProcessingHintstoallowsoftwaretosuggestcachingoptions
3. IDBasedOrderingtoavoidunnecessarylatency
4. AlternativeRoutingIDInterpretationtoincreasethenumberofFunctions
availableinadevice.

AtomicOps
Processors that share resources or otherwise communicate with each other
sometimes need uninterrupted, or atomic, access to system resources to do
thingsliketestingandsettingsemaphores.Onparallelprocessorbusesthiswas
accomplishedbylockingthebuswiththeassertionofaLockpinuntiltheorigi
natorcompletedthewholesequence(areadfollowedbyawrite),duringwhich
timeotherprocessorswerenotallowedtoinitiatetransactionsonthebus.PCI
includedaLockedpintoapplythissamemodelonthePCIbusasonthepro
cessorbus,allowingthisprotocoltousedwithperipheraldevices.

Thismodelworkedbutwasslowonthesharedprocessorbusandevenworse
whengoingontothePCIbus.ThatsonereasonwhyPCIelimiteditsuseonlyto
Legacy devices. However, the increasing use of shared processing in todays
PCs,suchasgraphicscoprocessorsandcomputeaccelerators,hasbroughtthis
issuebacktotheforebecausethedifferentcomputeenginesneedtobeableto
shareanatomicprotocol.ThewaythisproblemwasresolvedonPCIewasto
introducethreenewcommands thatcaneachdoaseriesofthingsatomically

897
PCIe 3.0.book Page 898 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

withinthetargetdeviceratherthanrequiringaseriesofseparateuninterrupt
ablecommandsontheinterface.Thesenewcommands,calledAtomicOps,are:
1. FetchAdd(FetchandAdd)ThisRequestcontainsanaddvalue.Itreads
thetargetlocation,addstheaddvaluetoit,storestheresultinthetarget
locationandreturnstheoriginalvalueofthetargetlocation.Thiscouldbe
usedinsupportofatomicallyupdatingstatisticscounters.
2. Swap (Unconditional Swap) This Request contains a swap value. It
reads the target location, writes the swap value into it, and returns the
originaltargetvalue.Thiscouldbeusefulforatomicallyreadingandclear
ingcounters.
3. CAS(CompareandSwap)ThisRequestcontainsbothacomparevalue
and a swap value. It reads the target location, compares it against the
comparevalueand,iftheyreequal,writesintheswapvalue.Finally,it
returnstheoriginalvalueofthetargetlocation.Thiscanbeusefulasatest
andsetmechanismformanagingsemaphores.

Both Endpoints and Root Ports are optionally allowed to act as AtomicOp
RequestersandCompleters,whichmightseemunexpectedbecause,inPCsat
least,thiskindoftransactionisusuallyonlyinitiatedbythecentralprocessor.
ButmodernsystemscanincludeanEndpointactingasacoprocessor,inwhich
caseitwouldneedtobeabletouseAtomicOpstoproperlyhandletheprotocol.
All threecommands support32bit and 64bit operands, while CASalsosup
ports128bitoperands.TheactualsizeinusewillbegivenintheLengthfieldin
theheader.RoutingelementslikeSwitchPortsandRootPortswithpeertopeer
accesswillneedtosupporttheAtomicOproutingcapabilitytobeabletorecog
nizeandroutetheseRequests.

A question naturally arises as to how the system (Root Complex) will be


instructed to generate these new commands in response to processor activity,
sincetheremaynotbeadirectlyanalogousprocessorbuscommand.Thespec
suggeststwoapproaches.First,theRootcouldbedesignedtorecognizespecific
processoractivityandinterpretthattoexportaPCIeAtomicOpinresponse.
Second,aregisterbasedapproachsimilartotheoneusedforlegacyConfigura
tionaccesscouldbeused.Inthatcase,oneregistermightgivethetargetaddress
whileanotherspecifiedwhichcommandshouldbegeneratedandthecombina
tionofthetwowouldgeneratetheRequest.

AtomicOpCompleterscanbeidentifiedbythepresenceofthethreenewbitsin
theDeviceCapabilities2register,asshowninFigure2010onpage899.Bit6of
this register also identifies whether routing elements are capable of routing
AtomicOps.

898
PCIe 3.0.book Page 899 Sunday, September 2, 2012 11:25 AM

Chapter20:UpdatesforSpecRevision2.1

Legacy PCI does not comprehend AtomicOps, of course, and there is no


straightforward way to translate them into PCI commands. For that reason,
PCIetoPCIbridgesdonotsupportAtomicOps.Ifatomicaccessisneededon
thatbusitwouldhavetobedonewiththelegacylockedprotocolandthespec
statesthatLockedTransactionsandAtomicOpscanoperateconcurrentlyonthe
sameplatform.

Figure2010:DeviceCapabilities2Register

31 24 23 22 21 20 19 14 13 12 11 10 9 8 7 6 5 4 3 0

RsvdP RsvdP

Max End-End
TLP Prefixes
End-End TLP
Prefix Supported
Extended Fmt
Field Supported
TPH Completer Supported
LTR Mechanism Supported
No RO-enabled PR-PR Passing
128-bit CAS Completer Supported
64-bit AtomicOp Completer Supported
32-bit AtomicOp Completer Supported
AtomicOp Routing Supported
ARI Forwarding Supported
Completion Timeout Disable Supported
Completion Timeout Ranges Supported

TPH (TLP Processing Hints)


Adding hints about how the system should handle TLPs targeting memory
spacecanimprovelatencyandtrafficcongestion.Thespecdescribesthisspecial
handling basically as providing information about which of several possible
cachelocationsinthesystemwouldbetheoptimalplaceforatemporarycopy

899
PCIe 3.0.book Page 900 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ofaTLP.Thespecmakesnoteofthefactthat,sincetheusagedescribedforTPH
relatestocaching,itwouldntusuallymakesensetousethemwithTLPstarget
ing Nonprefetchable Memory Space. If such usage was needed, it would be
essentialtosomehowguaranteethatcachingsuchTLPsdidnotcauseundesir
ablesideeffects.

TPH Examples

DeviceWritetoHostRead.To help clarify the motivation for TPH, con


sider the example shown in Figure 2011 on page 901. Here the Endpoint is
writingdataintomemoryforlaterusebytheCPU.Thesequenceisasfollows:
1. First,theEndpointsendsamemorywriteTLPcontaininganaddressthat
mapstothesystemmemory.ThepacketgetsroutedtotheRootComplex
(RC).
2. The RC recognizes this as an access to a cacheable memory space and
pauses its progress while it snoops the CPU cache. This may result in a
writeback cycle from the CPU to update the system memory before the
transactioncanproceed,andthisisshownasstep2a.
3. Onceanywritebackshavefinished,theRCallowsthewritetoupdatethe
systemmemory.
4. Atsomepoint,theEndpointnotifiestheCPUaboutdatadelivery.
5. Finally,theCPUfetchesthedatafrommemorytocompletethesequence.

900
PCIe 3.0.book Page 901 Sunday, September 2, 2012 11:25 AM

Chapter20:UpdatesforSpecRevision2.1

Figure2011:TPHExample

4
2
5
2a

Thissequenceworksbuttheresanopportunityforperformanceimprovement
byaddinganintermediatecacheinthesystem.Toillustratethis,considerthe
exampleshowninFigure2012onpage902.FromtheperspectiveoftheEnd
point, the operation is the same but the knows to handle it a differently. The
stepsnowareasfollows:
1. The Endpoint does the same memory write but this time TPH bits are
included.ThewriteisforwardedtotheRCbytheSwitchasbefore.
2. TheRCunderstandsthatthismemoryaccessmustbesnoopedtotheCPU
asbefore.However,oncethesnoophasbeenhandled,theRCisinformed
bytheTPHbitstostorethisTLPinanintermediatecacheratherthangoing
tosystemmemory.
3. TheEndpointnotifiestheCPUthatthedataitemhasbeendelivered.
4. TheCPUreadsfromthespecifiedaddress,butnowthedataisfoundinthe
intermediatecacheandsotherequestdoesnotgotosystemmemory.This
hastheusualbenefitswedexpectfromacachedesign:fasteraccesstimeas
wellasreducedtrafficforthesystemmemory.

901
PCIe 3.0.book Page 902 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ThisisasimpleDeviceWritetoHostRead(DWHR)exampletoillustratethe
conceptbutitwouldntbehardtoimagineamorecomplexsystemwithamuch
largertopologyinwhichtherecouldbeothercachesplacedinSwitchesorother
locationstoachievethesamebenefitsforothertargets.

Figure2012:TPHExamplewithSystemCache

3
2 4

Cache
1

HostWritetoDeviceRead.Toillustratetheconceptgoingtheotherway
(calledHostWritetoDeviceReadorHWDR),considertheexampleshownin
Figure 2013 on page 903. In this example, the CPU initiates a memory write
whoseaddresstargetsthePCIeEndpointinstepone.ThepacketcontainsTPH
bitsthattelltheRCthatitshouldbestoredinanintermediatecachenearthe
target,insteadofthecacheintheRCthatwasusedinthepreviousexample.In
thiscaseacachebuiltintotheSwitchservesthepurpose.TheTLPisthenfor
wardedontothetargetEndpointinsteptwo.Thismodelisbeneficialwhenthe
dataisupdatedinfrequentlybutreadoftenbytheEndpoint.Thatallowssev
eralmemoryreadsthatwouldnormallygotosystemmemorytobehandledby
thecacheinstead,offloadingboththeLinkfromtheSwitchtotheRCandthe
pathtomemory.

902
PCIe 3.0.book Page 903 Sunday, September 2, 2012 11:25 AM

Chapter20:UpdatesforSpecRevision2.1

Figure2013:TPHUsageforTLPstoEndpoint

Cache

Cache
2

DevicetoDevice.One last example is illustrated in Figure 2014 on page


904,wheretwoEndpointscommunicatewitheachother(calledDeviceRead/
WritetoDeviceRead/WriteorD*D*)throughasharedmemorylocationthatis
directedbyTPHbitstoanintermediatecache.Inthiscase,bothmayupdatedif
ferentlocationsthattheyneedtohandleasreadmostly,oroneEndpointmay
updatedatathattheotherneedstoreadseveraltimes.Inbothcases,usingthe
intermediatecacheimprovessystemperformance.

903
PCIe 3.0.book Page 904 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure2014:TPHUsageBetweenEndpoints

Cache

TPH Header Bits


SeveralbitsintheTLPheaderdescribehowthehintsareused.First,asshown
in the middle at the top of Figure 2015 on page 905, the TH (TLP Hints) bit
reportswhethertheoptionalTPHbitsareinusefortheTLP.Whenset,thePH
(ProcessingHintbits)indicatethenextlevelofinformation.

904
PCIe 3.0.book Page 905 Sunday, September 2, 2012 11:25 AM

Chapter20:UpdatesforSpecRevision2.1

Figure2015:TPHHeaderBits

+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Byte 4 Requester ID Tag Last DW 1st DW
BE BE
Byte 8 Address [63:32]
Byte 12 Address [31:2] PH

WhentheTHbitissetthePHbits,shownatthebottomrightofFigure2015on
page905,taketheplaceofwhatwerethetworeservedLSBsintheaddressfield.
Fora32bitaddress,thesearebyte11[1:0],whileforthe64bitaddressshown,
they are byte 15 [1:0]. Their encoding is described in Table 201 on page 905.
ThesehintsareprovidedbytheRequesterbasedonknowledgeofthedatapat
terns in use, which is information that would be difficult for a Completer to
deduceonitsown.

Table201:PHEncodingTable

PH[1:0] ProcessingHint UsageModel

00b Bidirectionaldata Indicatesfrequentread/writeaccessbyHostand


structure device.

01b Requester D*D*(devicetodevicetransfers).Indicatesfre


quentread/writeaccessbydevice.Theasterisk
meanseitherdevicecouldbereadingorwriting.

10b Target DWHR,HWDR(devicetohostorhosttodevice


transfers).Indicatesfrequentread/writeaccessby
Host.

11b TargetwithPriority SameasTargetbutwithadditionaltemporal


reusepriorityinformation.Indicatesfrequent
read/writeaccessbyHostandhightemporallocal
ityforaccesseddata.

905
PCIe 3.0.book Page 906 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

ThenextlevelofinformationistheSteeringTagbytethatprovidessystemspe
cific information regarding the best place to cache this TLP. Interestingly, the
location of this byte in the header varies depending on the Request type. For
Posted Memory Writes the Tag field is repurposed to be the Steering Tag (no
completionwillbereturnedsotheTagisntneeded),whileforMemoryReads
thetwoByteEnablefieldsarerepurposedforit(byteenablesarenotneededfor
prefetchable reads). The meaning of the bits is implementation specific but
theyneedtouniquelyidentifythelocationofthedesiredcacheinthesystem.

TwoformatsforTPHaredescribedinthespecandthislevelofhintinformation
(TH+PH+8bitSteeringTag),calledBaselineTPH,isthefirstandisrequiredof
allRequeststhatprovideTPH.ThesecondformatusesTLPPrefixestoextend
theSteeringTags(seeTLPPrefixesonpage 908formoredetail).

Steering Tags
Thesevaluesareprogrammedbysoftwareintoatabletobeusedduringnormal
operation.ThespecrecommendsthatthetablebelocatedintheTPHRequester
Capabilitystructure,showninFigure2016onpage906,butitcanalternatively
bebuiltintotheMSIXtableinstead.Onlyoneortheotherofthesetableloca
tions can be used for a given Function. The location is given in the ST Table
Locationfield[10:9]oftheRequesterCapabilityregister,showninFigure2017
onpage907.Theencodingofthese2bitsisshowninTable 202onpage 907.

Figure2016:TPHRequesterCapabilityStructure

31 15 7 0
Next Cap PCI Express DW0
PCI Express Capabilities Register Pointer Cap ID (17h)

TPH Requester Capability Register DW1


TPH Requester Control Register DW2
TPH ST Table (optional) DW3
(Sized by number of ST entries)

906
PCIe 3.0.book Page 907 Sunday, September 2, 2012 11:25 AM

Chapter20:UpdatesforSpecRevision2.1

Figure2017:TPHCapabilityandControlRegisters

TPH Requester Capability Register


31 27 26 16 15 11 10 9 8 7 3 2 1 0

RsvdP ST Table Size RsvdP RsvdP

ST Table Location
Extended TPH Requester Supported

Device-Specific Mode Supported

Interrupt Vector Mode Supported

No ST Mode Supported

TPH Requester Control Register


31 10 9 8 7 3 2 0

RsvdP RsvdP

TPH Requester Enable


ST Mode Select

Table202:STTableLocationEncoding

Bits[10:9] STTableLocation

00b Notpresent

01b LocatedintheRequesterCapa
bilitystructure

10b LocatedintheMSIXtable

11b Reserved

907
PCIe 3.0.book Page 908 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

TheRequesterCapabilityregisterliststhenumberofentriesintheSTTablein
bits[26:16].Eachtableentryis2byteswide,andtheSTTableimplementedin
the TPH Capability register set is shown in Figure 2018 on page 908, where
entry zero is highlighted. The Requester Capability register also describes
whichSTModesaresupportedfortheRequesterwiththe3LSBs:
NoSTuseszerosforSTbits.SelectedintheTPHRequesterControlregis
tersSTModeSelectfieldwhenthevalue=000b.
Interrupt Vector uses the interrupt vector number as the offset into the
table,meaningthevaluesarecontainedintheMSIXtable.(STModeSelect
value=001b.)
DeviceSpecificusesadevicespecificmethodtooffsetintotheSTTable
in the TPH Capability structure because the ST values are located there.
Thisistherecommendedimplementation,althoughhowagivenRequestis
associated with a particular ST entry is outside the scope of the spec. (ST
ModeSelectvalue=010b.)
AllotherSTModeSelectencodingsarereservedforfutureuse.

Figure2018:TPHCapabilitySTTable

31 24 23 16 15 8 7 0

ST Upper Entry (1) ST Lower Entry (1) ST Upper Entry (0) ST Lower Entry (0)

ST Upper Entry (3) ST Lower Entry (3) ST Upper Entry (2) ST Lower Entry (2)

ST Upper Entry ST Lower Entry ST Upper Entry ST Lower Entry


(Table Size) (Table Size) (Table Size - 1) (Table Size - 1)

TLP Prefixes
TheSteeringTagbitscanbeextendedwiththeadditionofoptionalTLPPrefixes
ifneeded.WhenoneormorePrefixesaregivenwiththeTLP,theheaderreports
itbysettingthemostsignificantbitintheFormatfield,asshowninFigure2019
onpage909.

908
PCIe 3.0.book Page 909 Sunday, September 2, 2012 11:25 AM

Chapter20:UpdatesforSpecRevision2.1

Figure2019:TPHPrefixIndication

+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
100 tr H D P
Byte 4 Requester ID Tag Last DW 1st DW
BE BE
Byte 8 Address [63:32]
Byte 12 Address [31:2] PH

IDO (ID-based Ordering)


Transaction ordering rules are important for proper traffic flow, but there are
timeswhenitsnotnecessaryandlatenciescanbeimprovedinthosecases.In
particular,TLPsfromdifferentRequestersareveryunlikelytohavedependen
cies between them, so this feature allows software to enable them to be
reordered for improved performance. The details of this operation are
describedinthesectioncalledIDBasedOrdering(IDO)onpage 301.

ARI (Alternative Routing-ID Interpretation)


ThemotivationforthisoptionalfeatureistoincreasethenumberofFunction
numbersavailabletoEndpoints.Devicenumberswereusefulinasharedbus
architecturelikePCIbutarenotusuallyneededinapointtopointarchitecture.
Consequently,thespecwriterschosetoallowdevicestointerpretthedestina
tion for IDrouted commands differently. This was accomplished by defining
theDevicenumbertoalwaysbezeroandthenallowingtheFunctionnumberto
usethe5bitsintheIDthatwerepreviouslytheDevicenumber.Effectively,the
DevicenumbergoesawaywhiletheFunctionnumbergrowsto8bits.Thetar
getforaTLPthatusesARIwillneedtobeenabledtorecognizeitbeforesoft
warecanusethisfeature,butRoutingelementsinthepathtoitdonthavetobe
awareofthis.Theyreonlylookingatthebusnumbertodeterminetherouting.

909
PCIe 3.0.book Page 910 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Power Management Improvements


There are four additions that improve the systems ability to manage power
effectively,andtheyarelistedhere.AllofthesearecoveredinChapter16,enti
tledPowerManagement,onpage703.

DPA (Dynamic Power Allocation


Anewsetofextendedconfigurationregistersdefinesupto32substatesbelow
D0.Thisallowssoftwaretoeasilymakechangestoadevicespowerstatewith
outincurringthelatencypenaltyofgoingallthewaytotheD1devicepower
state. To learn more on this, see Dynamic Power Allocation (DPA) on
page 714

LTR (Latency Tolerance Reporting)


AllowingEndpointstoreportthelatenciestheycantolerateinresponsetotheir
requests enables system software to make better choices regarding system
responsetimeandsleepstates.Tolearnmoreaboutthis,seeLTR(LatencyTol
eranceReporting)onpage 784.

OBFF (Optimized Buffer Flush and Fill)


Similarly,allowingthesystemtoreportthepreferredtimeslotsduringwhich
EndpointsshouldorshouldnotinitiateDMAorinterrupttraffichelpscoordi
natesystemsleeptimesandimprovepowermanagement.Formoreonthis,see
OBFF(OptimizedBufferFlushandFill)onpage 776.

ASPM Options
ThischangesimplypermitsdevicestosupportnoASPMLinkpowermanage
mentiftheychoosetodoso.Inthepreviousspecversions,supportforL0swas
mandatory,butnowitbecomesoptional.

910
PCIe 3.0.book Page 911 Sunday, September 2, 2012 11:25 AM

Chapter20:UpdatesforSpecRevision2.1

Configuration Improvements
A few configuration registers were added to improve software visibility and
controlofdevices.

Internal Error Reporting


Thisisintendedtoprovideastandardizedwayofreportinginternalproblems
fordeviceslikeswitchesthatdonthaveadrivertohandlethatforthem.Italso
adds the capability to track multiple TLP headers when they result in errors
insteadofjustoneasbefore.Thistopiciscoveredinthesectiononerrorscalled
InternalErrorsonpage 667.

Resizable BARs
Thisnewsetofextendedconfigurationregistersallowsdevicesthatusealarge
amountoflocalmemorytoreportwhethertheycanworkwithsmalleramounts
and,ifso,whatsizesareacceptable.Softwarethatknowstolookforthemcan
findthenewregisters,showninFigure2020onpage912,andprogramthemto
give the appropriate memory size for the platform based on the competing
requirementsofsystemmemoryandotherdevices.

Afewrulesapplytotheuseoftheseregisters:

1. Toavoidconfusion,aBARsizeshouldonlybechangedwhentheMemory
EnablebithasbeenclearedintheCommandregister.
2. ThespecstronglyrecommendsthatFunctionsnotadvertiseBARsthatare
biggerthantheycaneffectivelyuse.
3. Toensureoptimalperformance,softwareshouldallocatethebiggestBAR
sizethatwillworkforthesystem.

911
PCIe 3.0.book Page 912 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure2020:ResizableBARRegisters

31 20 19 16 15 0
Next Extended Version PCIe Extended Capability ID
Capability Offset (1h) (0015h for Resizable BAR)
31 0 Offset

PCIe Enhanced Capability Header 000h

Resizable BAR Capability Register (0) 004h


Register Pair
for each 008h
Reserved Resizable BAR Control Register (0)
supported
BAR

Resizable BAR Capability Register (n) n*8 +4

Reserved Resizable BAR Control Register (n) n*8 +8

Capability Register
ThisregistersimplyreportswhichBARsizeswillworkforthisFunction.Bits4
to23areusedforthisandthevaluesareasshownhere:

Bit41MBBARsizewillworkforthisFunction
Bit52MB
Bit64MB
...
Bit23512GBwillworkforthisFunction

Figure2021:ResizableBARCapabilityRegister

31 24 23 4 3 0

RsvdP RsvdP

Control Register
TheBARIndexfieldinthisregisterreportstowhichBARthissizerefers(0to5
arepossible).TheNumberofResizableBARsfieldisonlydefinedforControl

912
PCIe 3.0.book Page 913 Sunday, September 2, 2012 11:25 AM

Chapter20:UpdatesforSpecRevision2.1

Registerzeroandisreservedforalltheothers.Ittellshowmanyofthesixpos
sibleBARsactuallyhaveanadjustablesize.Finally,theBARSizefieldispro
grammedbysoftwaretospecifythedesiredsizetheBARindicatedbytheBAR
Indexfield(0=1MB,1=2MB,2=4MB,...,19=512GB).

Figure2022:ResizableBARControlRegister

31 13 12 8 7 5 4 3 2 0

RsvdP RsvdP

BAR Size (RW)

Number of Resizable
BARs (RO)

BAR Index (RO)

OncetheResizablevalueshavebeenprogrammed,thenenumerationsoftware
willbeabletoworkasitnormallydoes:writingallFstoeachBARandreading
it back will report the size that was selected. Note that if the size value is
changed,thecontentsoftheBARwillbelostandwillneedtoreprogrammedif
itwaspreviouslysetup.Figure2023onpage914highlightstheBARregisters
intheconfigurationheaderspaceforaType0header.

913
PCIe 3.0.book Page 914 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Figure2023:BARsinaType0ConfigurationHeader

DW
3 2 1 0
Device Vendor 00
ID ID
Status Command 01
Register Register
Class Code Revision 02
ID
Header Latency Cache 03
Type Timer Line
Size

Base Address 0 04

Base Address 1 05

Base Address 2 06

Base Address 3 07

Base Address 4 08

Base Address 5 09

CardBus CIS Pointer 10

Subsystem 11
Subsystem ID
Vendor ID
Expansion ROM 12
Base Address
Capabilities 13
Reserved Pointer
14

Max_Lat Min_Gnt Interrupt Interrupt 15


Pin Line

Simplified Ordering Table


ThischangesimplifiestheTransactionOrderingTablebyreducingthenumber
of entries in the table. Essentially, it no longer distinguishes between comple
tions for reads or completions for nonposted writes. The motivation for this
wastoreducethenumberofcasesthatneededtobetested.Formoreonthis,
seethesectioncalledTheSimplifiedOrderingRulesTableonpage 288.

914
PCIe 3.0.book Page 915 Sunday, September 2, 2012 11:25 AM

Appendices
PCIe 3.0.book Page 916 Sunday, September 2, 2012 11:25 AM
PCIe 3.0.book Page 917 Sunday, September 2, 2012 11:25 AM

AppendixA:
DebuggingPCIeTraffic
withLeCroyTools

Christoper Webb, LeCroy Corporation

Overview
The transition of IO bus architecture from PCI to PCI Express had a large
impactondeveloperswithrespecttotypesoftoolsrequiredforvalidationand
debug.
WithparallelbusessuchasPCI,awaveformviewofthesignalsshowsenough
information for the developer to interpret the state of the bus. A user could
visually examine a waveform and mentally decode the type of transactions,
howmuchdataistransferred,andeventhecontentofthattransfer.
Since PCI Express packet traffic is both encoded and scrambled, examining a
waveformviewofthetrafficprovidesverylittleinformationaboutthestateof
thelink.Thespeedofthelinkcanbeinferredfromthewidthofthebittimes,
andthewidthofthelinkcanbeinferredbythenumberofactivelanes.How
ever, the user cannot visually interpret the symbol alignment, let alone the
packetsthemselves.
Anewclassoftoolsevolvedtohelpdevelopersvisualizethestateoftheirnow
seriallinks.Thesetoolsperformthedeserialization,decoding,anddescram
blingfortheusers.Atfirstglancethiswouldseemtobeenoughforthedevel
oper.ButforPCIExpressspecifically,othercomplicationssuchasflowcontrol
credits, lanetolane skew, polarity inversion, and lane reversal must also be
comprehendedbythesetoolsaspartofunderstandingPCIeprotocol.
Bothpreandpostsilicondebugshareacommonneedfortools.Inthisappen
dixchapter,wedescribesomeoftheproductofferingsavailablefordebugging
PCIExpressinterconnects,bothfromapreandpostsiliconperspective.

917
PCIe 3.0.book Page 918 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Pre-silicon Debugging

RTL Simulation Perspective


InRTLsimulation,lookingatawaveformviewofanFPGAoranASICsignalis
the most common way to debug. By showing internal state machine states,
monitoringIOasitmovesthroughthedevice,orseeingthestateofcontrolsig
nals; a waveform view is quite powerful. But, as we discussed above, a PCI
expresslinkisnotunderstandablewhenshownasawaveform.Additionalpro
cessingordecodingmustbedonetomakesenseofthisnewlink.Toaugment
thesimulationtools,aPCIExpressBusMonitoristypicallyaddedtoaddress
thisneed.

PCI Express RTL Bus Monitor


APCIExpressBusmonitorisapieceofcodewhichusersinsertintheirRTL
simulationtohelpmonitorthestateoftheirPCIelink.Atminimum,thismoni
tor will output text based log files with information about link state changes
and types of packet activity. More complex monitors will perform real time
compliancechecking.AnumberofvendorsprovidepurchasableIPwhichper
formthisexactfunction.Theemphasishoweveristypicallyoncompliance.Less
functionality is provided with respect to visualization of things such as flow
controlcredits,linkutilization,orlinktrainingdebug.

RTL vector export to PETracer Application


LeCroyhaspartneredwithanumberoftheleadingPCIeverificationIPvendors
to create tools to further enhance the visualization and analysis of presilicon
PCIetraffic.ThisinvolvesusingthevendorsBusMonitortoexportrawsymbol
trafficintothesamePETracerapplicationusedbythededicatedPCIeAnalyzer
hardware.SimPASSPEisLeCroyssolutiontosupportingthisexport.

More information about LeCroys PETracer application and its features are
describedinthesectionAsalastresort,aflyingleadprobeshowninFigure5
onpage924maybeusedtoattachtheprotocolanalyzertothesystemunder
test. This involves soldering a resistive tap circuit and connector pins to the
PCIetraces.ThiscircuitryistypicallysolderedtotheACcouplingcapsofthe
PCIelinkastheyareoftentheonlyplacetoaccessthetraces.Oncetheprobecir

918
PCIe 3.0.book Page 919 Sunday, September 2, 2012 11:25 AM

AppendixA

cuitryissolderedtothePCB,theanalyzerprobecanbeconnectedandremoved
asneeded.ThisapproachcanbeusedonvirtuallyanyPCIelink,howeverthe
robustnessoftheconnectionislimitedbytheskillofthetechnicianaddingthe
probe.onpage 924.

Post-Silicon Debug

Oscilloscope
UseofanoscilloscopefordebuggingaPCIelinkistypicallyfocusedontheelec
tricalvalidationofthelink.Themostcommonusageisexamininganeyepat
ternwithamaskoverlayfordeterminingelectricalcompliance.Alesserknown
compliancecheckistoexaminetheentryandexitofelectricalidlestatetoseeif
the link goes to the common mode voltage within the required time periods
afteranelectricalidleorderedsetistransmitted.Theseare2examplesofPCIe
compliance checking which are best performed using an oscilloscope such as
showninFigure1onpage920.
Withtheadditionofdynamiclinktrainingfor8.0GT/soperation,devicesmust
nowtrainthetransmitteremphasisduring theRecovery.EQLTSSMsubstate.
The goal is to set the transmitter EQ to provide the best signal eye to the
receiver. Monitoring this dynamic equalization process is another example
wheretheuseofanoscilloscopeisquitepowerful.Witharealtimeoscilloscope,
theusercancapturethisprocessandseetheimpactonthewaveformastrans
mittersettingsarechanged.Thisallowstheusertoverifythatthetransmitteris
indeedactingonthecoefficientchangerequests,butitalsoallowstheuserto
determineifthereceiverhasproperlychosenthecorrectsetting.
Forlogicaldebugofthelink,theoscilloscopeismostusefulwhenthelinkisx1
orx2asyouarelimitedbythenumberchannelsthescopecanacquire.Thefirst
methodofexaminingPCIetrafficisawaveformview.AswiththeRTLwave
formviewer,thereislittletounderstandaboutthestateofthelinkwithoutSW
help to perform 8b/10b decoding and descrambling. Fortunately, more
advancedoscilloscopeshaveSWpackagesthatperformtheseduties.Forthisto
workproperly,thescopemusthavedeepcapturebuffersandmustseetheSKIP
orderedsetssothattheycandecipherthebytealignmentandsynchronizethe
descramblerLFSR.
TheLeCroyOscilloscopecanoverlayPCIesymbolsrightontothewaveformfor
enhancedvisibilityofthetraffic.Anadditionaltextbasedlistingofthepacket
symbolscanbedisplayedonthescreenasanadditionalmethodofexamining
thewaveform.

919
PCIe 3.0.book Page 920 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

LeCroy recently announced a SW package called ProtoSync for their oscillo


scope line which allows the user to export the captured waveform into the
PETracer application. This is the same SW package thatthe protocol analyzer
uses which includes a wide range of post processing capabilities described
below. The PETracer software can run independently on the scope hardware,
oftenonasecondmonitor.Thisallowstimecorrelatedcomparisonofthephysi
callayerdatapresentedbythescopewaveformalongsidethelogiclayerpre
sentationofdatapresentedbythePETracersoftware.

Capture of the 8.0 GT/s dynamic link equalization on the oscilloscope and
exportingthistraffictothePETracerapplicationisaprimeexamplewherethis
solutionismostpowerful.TheusercannavigatePETracertothelinktraining
packet where the TX coefficient change request has been sent, then identify
wherethiscoefficientchangewasappliedinthescopeSW.Theusercanthen
measurethetimeittakesforthecoefficientchangetobeappliedandcompare
thistothetimingrequiredinthePCIespec.

FigureA1:LeCroyOscilloscopewithProtoSyncSoftwareOption

Protocol Analyzer
AgrowingtrendindebuggingPCIelinksistouseadedicatedprotocolanalysis
tool.Whatseparatesaprotocolanalyzerfromalogicanalyzeristhatitisbuilt
to support a specific protocol such as PCIe. From a hardware perspective, a
PCIeprotocolanalyzerisoptimizedforacquiringandstoringPCIetraffic.This
starts from the dedicated PCIe interposer probes, continues to the cabling
choice,andcariesthroughintotheinternalhardwarecomponents.Forrecover
ingPCIetraffic,specializedclockanddatarecoverycircuitsareusedwhichcan
handle the electrical idle transitions, spread spectrum modulation, as well as

920
PCIe 3.0.book Page 921 Sunday, September 2, 2012 11:25 AM

AppendixA

handletherunlengthsfoundin128b/130bencoding.Sophisticatedequalization
circuitsareusedtorecoverthesignaleyepriortodeserialization.Withoutcom
prehending the complexities of PCIe recovery, the Analyzer hardware would
not be optimized for recovering complex traffic such as speed switching,
dynamiclinkwidths,andlowpowerstatessuchasL0s.
InadditiontochoosingappropriatehardwarecomponentsforrecoveringPCIe
traffic,aprotocolanalyzerincludeslogiccircuitrywhichisPCIespecific.This
logicmustinferthestateofthePCIelinkandfollowitduringvariousLTSSM
statechanges.Oncethelinkstateisbeingproperlyfollowed,dedicatedpacket
inspectioncircuitsperformdatamatchingagainstincomingpacketstolookfor
eventsprogrammedbytheuser.Thesematchersareusedforfilteringoftraffic
aswell as performingthetrigger functionalityneededforstopping thetraffic
capture.Amixtureofthesetrafficfiltersaswellasdeeptracebuffers(often4GB
to 8GB in size) allow the user to capture significantly longer traffic scenarios
thanwouldbepossiblewithoutaprotocolanalyzer.
Finally,themostimportantpieceofaprotocolanalyzeristhesoftwareGUI.By
optimizing the traffic views, post processing reports, and hardware controls
with a dedicated PCI Express software tool; a very comprehensive set of PCI
expressspecificanalysiscanbeperformed.

Logic Analyzer
SomelogicanalyzersofferPCIespecificsoftwarepackages.Thissoftwarewill
read the PCI express capture from the logic analyzer hardware and perform
someamountofpostprocessingofthisdata.Thisanalysisincludesthebasics
suchasdecoding,descrambling,anddecodingofthetraffic.TheseSWtoolsdo
notperformmanyoftherichpostprocessingfeaturesofferedbydedicatedpro
tocolanalyzersoftware,however.

Using a Protocol Analyzer Probing Option


TorecordyourPCIetrafficyoumustfirstfindthebestmethodforprobingit.
PCIestartedasanaddincardformfactorfordesktopPCsandservers,buthas
sinceproliferatedintoadizzyingarrayofstandardandnonstandardformfac
tors.Forthestandardformfactors,thebestprobeoptionisadedicatedinter
poser.

AnInterposerisadedicatedpieceofhardwarewhichincludesprobecircuitry
requiredforpassingacopyofthePCIetraffictotheAnalyzerhardwareforcap
tureandanalysis.Theseinterposersaredesignedspecificallyforthemechanical

921
PCIe 3.0.book Page 922 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

andelectricalenvironmentsforwhichtheyareplaced.Themostcommoninter
poserisaSlotInterposersuchasshowninFigure2onpage922.Thisinter
poserisusedforprobingstandardCEMcompliantPCIeaddincards.

Careshouldbetakenwhenselectinganinterposerastheprobecircuitryvaries
byvendorandbyrequirementsimposedbythemaxPCIelinkspeed.Forexam
ple, a Gen3 slot interposer should contain probe circuitry which allows the
dynamiclinktrainingprocesstopassproperlythroughtheprobe.TheLeCroy
Gen3slotinterposeruseslinearcircuitstomaintaintheshapeofthewaveform
asitpassesthroughtheprobe.Thisallowspreemphasisofthetransmittertobe
dynamicallychangedduringlinktrainingwhileallowingthereceivertoquan
tifytheimpactofanewsetting(eitherpositiveornegativeimpact).

FigureA2:LeCroyPCIExpressSlotInterposerx16

LeCroyalsooffersafamilyofotherdedicatedinterposersforformfactorssuch
as ExpressCard, XMC, Mini Card, Express Module, AMC, etc. Some of these
interposersareshowninFigure3onpage923.Foracompletelistoftheseinter
posers please refer to the LeCroy website: www.lecroy.com as this list is con
stantlygrowing.

922
PCIe 3.0.book Page 923 Sunday, September 2, 2012 11:25 AM

AppendixA

FigureA3:LeCroyXMC,AMC,andMiniCardInterposers

FordebuggingPCIelinkswhichcannotbenefitfromadedicatedinterposer,a
midbusprobeshowninFigure4onpage923isthenextbestoption.Amidbus
probeinvolvesplacementofanindustrystandardprobegeometryonthePCB.
EachPCIelaneisroutedtoapairofpadsonthefootprintwhichcanbeprobed
usingamidbusprobehead.TheseprobesusespringpinsorCclipsforprovid
ingsolderfreemechanical attachmentbetweenthe systemundertest andthe
protocolanalyzer.

FigureA4:LeCroyPCIExpressGen3MidBusProbe

923
PCIe 3.0.book Page 924 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Asalastresort,aflyingleadprobeshowninFigure5onpage924maybeused
toattachtheprotocolanalyzertothesystemundertest.Thisinvolvessoldering
aresistivetapcircuitandconnectorpinstothePCIetraces.Thiscircuitryistyp
ically soldered to the AC coupling caps of the PCIe link as they are often the
onlyplacetoaccessthetraces.OncetheprobecircuitryissolderedtothePCB,
the analyzer probe can be connected and removed as needed. This approach
canbeusedonvirtuallyanyPCIelink,howevertherobustnessoftheconnec
tionislimitedbytheskillofthetechnicianaddingtheprobe.

FigureA5:LeCroyPCIExpressGen2FlyingLeadProbe

Viewing Traffic Using the PETracer Application

CATC Trace Viewer


TheprimarywaytoviewPCIExpresstrafficwiththeLeCroyPETracerapplica
tionistheCATCTraceview.Thisviewtakeseachrecordedpacketandbreaksit
downintodifferentpacketfieldstohighlighttheimportantvaluescontainedin
this packet. A mixture of colors and text are used to visually categorize and
explainthepurposeofeachpacket.Errorsarehighlightedinredsuchasshown
inFigure6onpage925.Warningsarehighlightedinyellowmakingiteasyto
identifyareasoftrafficorfieldsinapacketwhicharenoncompliant.

924
PCIe 3.0.book Page 925 Sunday, September 2, 2012 11:25 AM

AppendixA

FigureA6:TLPPacketwithECRCError

Inadditiontodecodingandvisuallybreakingdowneachpacket,ahierarchical
displayallowslogicalgroupingofrelatedpackets.Forexample,inLinkLevel
mode,TLPpacketsaregroupedwiththeirrespectiveACKpacket.EachTLPis
identified as either implicitly or explicitly ACKd or NAKd. An example of a
ACKDLLPisshowninFigure7onpage925alongwiththeACKdTLP.

FigureA7:LinkLevelGroupsTLPPacketswiththeirLinkLayerResponse

In SplitLevel mode shown in Figure 8 on page 926, the CATC Trace view
combines split transactions. For example, a single TLP read can be grouped
with1ormorecompletionTLPstologicallyshowlargedatatransfersasasin
gle line in the trace. The amount of data, starting address, as well as perfor
mancemetricsareprovidedforeachsplitleveltransaction.Thisallowstheuser
tobypassthedetailsofhowlargememorytransactionsarebrokenintomultiple
TLPpacketsandratherfocusonthecontentsofthedata.Iftheuserwishesto
seethedetailsofthesplittransaction,thehierarchicaldisplaycanshowthelink
leveland/orpacketlevelbreakdownofallthepacketswhichmakeupthissplit
transaction. This drilldown approach to traffic analysis allows the user to
startfromahighlevelviewofwhatshappeningonthebusanddrilldownonly
intheareasoftrafficwhichareinterestingtotheuser.

925
PCIe 3.0.book Page 926 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

FigureA8:SplitLevelGroupsCompletionswithAssociatedNonPostedRequest

The CATC trace view also supports CompactView shown in Figure 9 on


page 927. In this view, packets which are sent repeatedly are collapsed into a
singlecell.ThisisveryusefulforcollapsingTrainingSequencesorFlowControl
Initializationpackets.Thesoftwarealgorithmswhichperformthiscollapseare
smart enough to collapse any SKIP ordered sets as well. This creates a very
compactviewofthelinktrainingprocessallowingtheusertoexaminechanges
inthelinktrainingpacketswithoutscrollingthroughhundredsofpackets.

926
PCIe 3.0.book Page 927 Sunday, September 2, 2012 11:25 AM

AppendixA

FigureA9:CompactViewCollapsesRelatedPacketsforEasyViewingofLinkTraining

LTSSM Graphs
To further enhance the drilldown traffic viewing approach, the PETracer
applicationincludesanLTSSMgraphviewasshowninFigure10onpage928.
When this graph is invoked, the SW parses through the trace to find the link
training sections and infers the state of the Link Training and Status State
Machine (LTSSM). The result is a graph which breaks down the LTSSM state
transitionsinaveryhighlevelview.Thisgraphallowstheusertoimmediately
seeifthelinkwentintoarecoverystate.Ifso,theusercaneasilyidentifywhich
sideofthelinkinitiatedtherecovery,howmanytimesitenteredrecovery,and
evenifthelinkspeedorlinkwidthdecreasedbecauseoftherecovery.

TheLTSSMgraphisalsoanactivelinkbackintothetracefile.Forexample,if
theuserclicksontheentrytorecovery,thetracefilewillbenavigatedtothat
locationinthetracefile.Thiswouldallowtheusertoperhapsseeiftherecov
erywascausedbyrepeatedNAKsorforsomeotherreasonsuchaslossofblock
alignment.

927
PCIe 3.0.book Page 928 Sunday, September 2, 2012 11:25 AM

In short, when users are debugging issues related to link training, speed
change,orlowpowerstatetransitions,theLTSSMisaffected.Byexaminingthe
LTSSM graph, the user can easily identify whether these link state changes
occurred,wheretheyoccurred,andnavigatedirectlytothemforfasteranalysis.

FigureA10:LTSSMGraphShowsLinkStateTransitionsAcrosstheTrace

Flow Control Credit Tracking


FlowcontrolcredittrackingisparticularlyproblematicinPCIexpress.Theflow
controlupdatepacketsdonotshowthenumberofcreditseachendpointhas,
ratheritshowshowmanycreditsintotalhavebeenused.Thismeansthateach
endpointmustkeeparunningcounterofcreditsforeachtype.Thereareanum
berofscenarioswherecreditscanbelost,andifthisoccurs,thelinkwilleventu
allybeunabletotransmitdataduetolackofcredits.Suchproblemsarevery
difficulttoidentifyanddebug.

TheLeCroyPETracerapplicationhasacredittrackingSWtoolshowninFigure
11onpage929toaidinthisdebug.IfthetracecontainsFCInitpackets,itwill
walkthroughthetraceandshowtheamountofremainingcreditspervirtual
channelbuffertypeaftereachTLPandFCUpdate.

FCInit packets are sent once after link training. Because of this, the PETracer
applicationhastheabilityfortheusertosetinitialcreditvaluesatsomepointin
PCIe 3.0.book Page 929 Sunday, September 2, 2012 11:25 AM

AppendixA

thetraceandtheSWwillcalculatetherelativecreditvaluesfortheremaining
packets.Eveniftheinitialcreditvaluesaresetimproperlybytheuser,having
theabilitytoseetherelativecreditsisoftenenoughtocatchaflowcontrolissue.

FigureA11:FlowControlCreditTracking

Bit Tracer
Somedebugsituationsarenotsolvedbyadrilldownapproachtoexamining
thetraffic.Forexampleifthelinksettingsareincorrect,therecordingisoften
unreadable.Whatifadeviceisnotproperlyscramblingthetraffic,orthe10bit
symbols are sent in reverse order? For this scenario, a tool which focuses on
analysisbetweenthewaveformviewofthescopeandtheCATCTraceviewis
needed. This is where the BitTracer view shown in Figure 12 on page 930 is
mostpowerful.

TheBitTracerviewallowstheusertoseerawtrafficexactlyasitwasseenonthe
link.Thesoftwareallowstheusertoseethetrafficas10bitsymbols,scrambled
bytes, or unscrambled bytes. Invalid symbols and incorrect running disparity
arehighlightedinred.

929
PCIe 3.0.book Page 930 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

To further determine what may be wrong with the traffic, the BitTracer tool
adds a powerful list of post processing features which can modify the traffic.
Forexample,postcapture;theusercaninvertthepolarityofagivenlane.Once
applied,theusercanseeifthe10bitsymbolsarenowrepresentedproperlyin
thetrace.Ifthiscleansupthetrace,itsanindicationthattherecordingsettings
fortheAnalyzerhardwareneedtobechanged.

FigureA12:BitTracerViewofGen2Traffic

Inaddition,thelaneorderingcanbemodified.Thisisusefulfordeterminingif
lane reversal is causing a bad capture. If the traffic has excessive lane to lane
skew,theBitTracersoftwareallowstheusertorealignthetraffic.ForGen3traf
fic,thisskewcanbeapplied1bitatatime.Thisessentiallyallowstheusertofix
the130bitblockalignmentpostcapture.

After applying changes to the data, all or just a portion of the data can be
exported into the standard CATC Trace view for higher level analysis. This
workflowisverypowerfulfordebugginglowlevelissuesduringearlybring
up. Lets say for example, the users device trains the link properly, and then
suddenlyappliespolarityinversionto1lane.Thisisaclearviolationofthespec
and will cause the link to retrain. If this traffic is captured with the BitTracer
tool,theusercouldeasilyidentifythisastheproblem.Additionally,theportion
ofthetrafficbeforeandaftertheinversioncouldbeexportedintoseparatetrace
filesandexaminedintheCATCTraceview.

930
PCIe 3.0.book Page 931 Sunday, September 2, 2012 11:25 AM

AppendixA

Analysis overview
Asyoucansee,different trafficviewscanbebeneficialfordebuggingcertain
failureconditions.LeCroysupportsimportofPCIetrafficfrommanysources
intoitshighlysophisticatedPEtracersoftware.WhetherthesourceisRTLsimu
lation, an oscilloscope capture, or a dedicated protocol analyzer capture,
PETracerhasarichsetoftrafficviewsandreportswhichallowtheusertobest
understandthehealthandstateoftheirPCIelink.

Traffic generation

Pre-Silicon
ForstimulatingaPCIExpressendpointinsimulation,dedicatedverificationIP
canbepurchasedfromanumberofvendors.ThisIPwilltestforbasicfunction
alityaswellasperformanumberofPCIecompliancechecks.Itiscertainlyin
theinterest ofthe ASICdevelopertofind andfixtheseissuesbeforetapeout,
and this is where the value of these tools comes from. If the PCIe design is
implemented in an FPGA where mask costs are not an issue, it may be more
costeffectivetoperformthesecompliancechecksinhardwarewithadedicated
trafficgenerationtoolsuchastheLeCroyPETrainerorLeCroyPTCcard.

Post-Silicon
Exerciser Card
TothoroughlytestthePCIecomplianceandoverallrobustnessofaPCIedesign
postsilicon,adedicatedExercisercardsuchastheLeCroyPETrainershownin
Figure 13 on page 932 is used. This card allows the user to generate a wide
range of compliant and noncompliant traffic. For example, if you place your
PCIecardinastandardmotherboard,youmaybelimitedinthesizeoftheTLP
packetsitwillsee.AdedicatedExercisercardcangenerateTLPpacketsacross
theentirelegalrangeofpacketsizes.

Secondly,ifyouwouldliketotestthatacardissuesaNAKinresponsetoaTLP
withabadLCRC,itwouldnotpossiblewiththecardconnectedtocompliant
devices.Theydonottransmitbadpackets.AnExercisercardcancreateaTLP
withabadLCRC,improperheadervalues,orendtheTLPwithanEDBsymbol.

931
PCIe 3.0.book Page 932 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

If you would like to test that your card properly replays a packet when it
receivesaNAK,thiscanbedonewithanExerciser.Perhapsyouwouldliketo
issue4NAKsinarowtoacertainTLPsothatlinkrecoveryisinitiated.This
behaviorisallquiteeasytoprogramintotheexercisercard.

Thenumberoftestcasesandfailurescenariosislimitedonlybythenumberof
scriptsyouwrite.Oncewritten,thesescriptscanbereusedfortestingnewver
sionsofyourdesign.TheAnalyzerSWcanrecordthesesessionsandusescript
ing to determine if the response was correct. A number of LeCroy customers
havecreatedlargelibrariesofregressiontestsusingthesetools.

FigureA13:LeCroyGen3PETrainerExerciserCard

PTC card
ThePCISIGhaspublishedaspecificlistofcompliancetestswhichallCompli
antdevicesmustpass.TheLeCroyProtocolTestCard(PTC)isthehardware
used to perform these tests at the PCI SIG Compliance workshops. Users can
purchaseaPTCcardfromLeCroyshowninFigure14onpage933topretest
theirdevicestoensuretheywillpassPCISIGcompliancetesting.

The LeCroy PTC is used to test root complex or endpoint devices at x1 link
widths.LinkspeedscanbeeitherGen1orGen2.

932
PCIe 3.0.book Page 933 Sunday, September 2, 2012 11:25 AM

AppendixA

FigureA14:LeCroy Gen2 Protocol Test Card (PTC)

Conclusion
Today, the PCIe developer has access to a wide range of tools to help debug
theirPCIedesign.ThankstothewideadoptionofthePCIestandard,manyof
thesetoolsaredesignedspecificallyforPCIedebugandincludefeatureswhich
addressthechallengesmanyPCIedevicesface.

For more information about the LeCroy PCIe tool offerings, please visit the
LeCroywebsitewww.lecroy.com

933
PCIe 3.0.book Page 934 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

934
PCIe 3.0.book Page 935 Sunday, September 2, 2012 11:25 AM

AppendixB:
Markets&ApplicationsforPCI
Express

Akber Kazmi (Senior Director Marketing, PLX Technology, Inc.)

Introduction
Sinceitsdefinitionintheearly1990s,PCIhasemergedasthemostsuccessful
interconnect technology ever used in computers. Originally intended for per
sonalcomputersystems,thePCIarchitecturehasexpandedintovirtuallyevery
computingplatformcategory,includingservers,storage,communications,and
awiderangeofembeddedcontrolapplications.Mostimportant,eachadvance
mentinPCIbusspeedandwidthprovidedbackwardcompatibility.

As successful as the PCI architecture was, there was a limit to what could be
accomplishedwithamultidrop,parallel,sharedbusinterconnecttechnology.
Anumberofissuesclockskew,highpincount,traceroutingrestrictionsin
printed circuit boards (PCB), bandwidth and latency requirements, physical
scalability,andtheneedtosupportQualityofService(QoS)withinasystemfor
awidevarietyofapplicationsleadtothedefinitionofthePCIExpress(PCIe)
architecture.

PCIe was the natural successor to PCI, and was developed to provide the
advantagesofastateoftheart,highspeedserialinterconnecttechnologywith
apacketbasedlayeredarchitecture,butmaintainbackwardcompatibilitywith
the large PCI software infrastructure. The key goal was to provide an opti
mized, universal interconnect solution for a wide variety of future platforms,
including desktop, server, workstation, storage, communications, and embed
dedsystems.

935
PCIe 3.0.book Page 936 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

After its introduction in 2001, PCIe has gone through three generations of
enhancements.Inthefirstgeneration(Gen1),signalingratewassetat2.5GT/s
and later enhanced to 5 GT/s (Gen2) and eventually 8 GT/s (Gen3). The PCIe
specification allows combining of 2, 4, 8, 12, 16 or 32 lanes into a single port.
However,productsavailabletodaydonotsupport12and32lanewideports.
ItisimportanttonotethatallPCIeGen2andGen3devicesarerequiredtobe
backwardcompatibleinspeedwiththatofthepreviousgeneration.

TheindustryhaslaunchedandhasfullyembracedPCIeGen3products,while
atthesametimethePCISpecialInterestGroup(PCISIG)isanalyzingsignaling
rate(speed)forGen4.ThegoalforPCIeGen4istodoublethespeedofGen3,to
16GT/s.

PCIeswitchesareavailableinanarrayofsizes,rangingfrom3to96lanes,and3
to24portswhereeachportcouldbeone,two,four,eightor16laneswide.A
Gen3singlelanewouldprovide1GB/sofbandwidth,whilea16laneportoffers
16GBbandwidthineachdirection.Additionally,PCIeswitchvendors,suchas
PLXTechnology,haveaddedfeaturesandenhancementtotheirproductsthat
arenotpartofPCIespecificationsbutenablethemtodifferentiatetheirprod
uctsandaddvalueforthesystemdesigners.Thesefeaturesdelivereaseofuse,
higherperformance,failover,errordetection,errorisolation,andfieldupgrad
ability.

Onchipfeaturesincludenontransparent(NT)bridging,peertopeercommu
nication,HotPlug,directmemoryaccess(DMA),anderrorchecking/recovery.
Additionally debug features such as packet generation, receivereye measure
ment,trafficmonitoring,anderrorinjectioninlivetrafficoffersignificantvalue
to the designers, enabling early system bringup. Many of these features can
alsobeusedforruntimeperformanceimprovementsandmonitoring.

FeaturesincludedinnextgenerationofPCIeswitchesare:

NT bridging: Allows two hosts/CPUs to be connected to the same PCIe


switchwhileelectricallyandlogicallyisolated.TheNTbridgingfunctions
is broadly used in systems requiring isolation of two active CPUs or two
CPUswhereoneisactiveandotherispassive.TheNTfunctionalityallows
theexchangeofheartbeatbetweenthetwohostCPUstoenablefailoverif
oneofthemfails.

936
PCIe 3.0.book Page 937 Sunday, September 2, 2012 11:25 AM

Chapter : Appendix B: Markets & Applications for PCI

DMA:AnonchipDMAcontrollerinaPCIeswitchofferssignificantvalue
to the designers as it enables them to spare CPU cycles to move data
betweenpeersandtheCPUto/fromI/Os.TheCPUsreducedeffortinmov
ingdataboostsoverallperformanceofthesystemasthesparedCPUcycles
canbeusedtorunapplicationsratherthanmanagingdataI/O.
Error Isolation: Users can program triggers for certain error events and
responsebytheswitch.Theresponseofswitchcanalsobeprogrammedto
ignore,triggerahostinterrupt,bringthe portwitherrorsdown,orbring
theentireswitchdown.
PacketGeneration:Generally,itisdifficulttogeneratetrafficthatsaturates
aPCIeportwithouttheuseofexpensivepacketgeneratorequipment.PCIe
switchesnowhavetheabilitytosaturateanyPCIeportwithdesiredtraffic,
suchastransactionlayerpackets,tochecktheperformanceandrobustness
ofthesystem.

PCI Express IO Virtualization Solutions


ThePCIetechnologywasinitiallydefinedasasinglehostinterconnecttechnol
ogybutinlastfewyearsnewstandardshavebeendevelopedthatmakePCIe
suitable for multihost systems as a switch fabric technology for data centers
andenterpriseITapplications.ThepresenceofnativePCIeinterfaces(ports)on
x86CPUsandserversplatformshasenableddesignerstousePCIeasbackplane
andfabrictechnologyforsmalltomidsizeserverclusters.

In2007,thePCISIGreleasedtheSingleRootI/OVirtualization(SRIOV)speci
fication that enables sharing of a single physical resource such as a network
interface card or host bus adapter in a PCIe system among multiple virtual
machines running on one host. This is the simplest approach to sharing
resourcesorI/Odevicesamongdifferentapplicationsorvirtualmachines.

ThePCISIGfollowedbycompleting,in2008,workonitsMultiRootI/OVirtu
alization(MRIOV)specificationthatextendstheuseofPCIetechnologyfroma
singlerootdomaintoamultirootdomain.TheMRIOVspecificationenables
the use of a single I/O device by multiple hosts and multiple system images
simultaneously,asillustratedinFigure01onpage938.Thisillustrationshows
a multihost environment where MRIOV capable NIC and HBA are shared
acrossmultipleserversorvirtualmachinesviaanMRIOVswitch.

937
PCIe 3.0.book Page 938 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Figure01:MRIOVSwitchUsage

InordertoimplementMRIOVspecifications,threecomponentsofthesystem
need to be developed MRIOV PCIe switches, endpoints, and management
software.Allthreeofthesecomponentsmustbeavailablesimultaneouslyand
work seamlessly. Unfortunately, four years after the specification was devel
oped,thereisnotasinglesiliconvendorthathasMRIOVcapablePCIeswitch
orendpoints.PCIeswitchvendorsareofferingsolutionsthatprovidecapabili
ties defined for MRIOV through vendordefined features and utilizing avail
ableSRIOVendpoints.

Multi-Root (MR) PCIe Switch Solution


PCIeswitchvendorshavecreatedswitchesthatofferimplementationofmulti
hostfunctionthroughnontransparentbridgingandmultiroot(MR)capabili
ties.TheseMRswitchesallowmultiplehoststobeconnectedtoasingleswitch
ingdevice,whichcanbeportionedunderusercontrolinsuchawaythateach
hostwillbeconnectedtoadesiredsetofdownstreamportsoftheswitch.

IntheMRswitches,oneofthehostsactsasthemasterandassignsI/Ostoother
hostports.Eachhostoperatesindependentlyofotherhostsandcontrolsdown
stream devices in its domain. Figure 02 on page 939 illustrates the internal
architectureofanMRswitch,inwhichparticularsetsofdownstreamportsare
associatedtoparticularhostportsundermanagementcontrol.

938
PCIe 3.0.book Page 939 Sunday, September 2, 2012 11:25 AM

Chapter : Appendix B: Markets & Applications for PCI

Figure02:MRIOVSwitchInternalArchitecture

PCIe Beyond Chip-to-Chip Interconnect


InearlystagesofPCIedeploymentsthetechnologywasusedasachiptochip
interconnect but now broad availability of PCIe interfaces on CPUs, chipsets
and IOs and broad adoption of these components is pushing it beyond tradi
tionalapplications.Inanewgenerationofapplications,PCIeisusedinsystem
backplanes, switch fabrics, cabling systems, storage/IO expansion, IO virtual
ization,highperformancecomputing(HPC),andserverclusters.Figure03on
page940illustratesuseofPCIeinadatacenterforhighperformancecompute
application where servers in a rack are clustered through a topofrack (TOR)
PCIeswitchfabricbox.TheTORPCIeswitchcanbeconnectedtothenetwork
through Ethernet and to local storage and compute resources through PCIe
links.

PCIeconnectionsoutsidetheboxdependonPCIecopperoropticalcablesthat
theleaderintheindustryareintroducingatlowercost.ThePCIeTORfabricis
suitableforserver/computeclusteringandmayreplaceInfiniBandastheeco
systemforPCIeasfabricgrows.

939
PCIe 3.0.book Page 940 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Figure03:PCIeinaDataCenterforHPCApplications

SSD/Storage IO Expansion Boxes


Recently,theindustryhasconvergedtowardsPCIeastheunifiedinterconnect
technologyforenterprisestorageandsolidstatedrive(SSD)applications.The
NVM HCI, an industry consortium, has released a specification called NVM
Express (NVMe) that uses PCIe to provide the bandwidth needed for SSD
applications. Additionally, a T10 committee has embarked on defining SCSI
overPCIe(SOP)protocoltotakeadvantageofPCIetechnologycapabilitiesfor
highperformance storage applications. Furthermore, the SATA consortium
recentlyannouncedthatitwouldusePCIeastheinterconnectforitsnextgen
erationSATAspecificationcalledSATAExpress(SATAe).

PCIe in SSD Modules for Servers


Traditionally,enterpriseSSDmoduleshaveshippedwithSAS,SATAandFibre
Channel interfaces but due to the abovementioned developments, a large
majorityofSSDcontroller,moduleandsystemsuppliershaveintroducedprod
ucts with PCIe interfaces. Most SSD controllers peak their performance and
capacityduetoaheavyloadofmanagingflash.Inhighperformanceapplica
tions,multipleSSDcontrollers(orASICs)areusedandaggregatedthrougha
PCIeswitch.Figure04onpage941showsabasicusageofaPCIeswitchinan
SSDaddincardthatappliestoanycardormoduleformfactor.

940
PCIe 3.0.book Page 941 Sunday, September 2, 2012 11:25 AM

Chapter : Appendix B: Markets & Applications for PCI

Figure04:PCIeSwitchApplicationinanSSDAddInCard

Forlargedatacenterapplications,theSSDaddincardsareinstalledinserver
motherboardsasshowninFigure05onpage941andIOexpansionboxes(Fig
ure6)aggregatedthroughPCIeswitches.Inservermotherboarddesigns,PCIe
switches are utilized to create more ports/slots that accommodate additional
SSDmodulestosupporttheapplicationsneeds.

Figure05:ServerMotherboardUsePCIeSwitches

Inadditiontoprovidingconnectivity,PCIeswitchescanbeusedforproviding
redundancyandfailoverthroughNTbridgingandMRfunctionality.TheMR
switches support 1+N failover capability, in which one server/host communi
cateswithNnumberofserverstochecktheheartbeatandinitiateafailoverif
oneofthemfails.OneoftheserversillustratedinFigure06onpage942canbe
usedasbackupfortheothersin1+Nfailoverscheme.

941
PCIe 3.0.book Page 942 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Figure06:ServerFailoverin1+NFailoverScheme

Conclusion
PCIe interconnect technology has becomea serious contender formany high
end applications beyond chiptochip interconnect and is expected to be uti
lizedinexternalI/Osharing,serverclustering,I/OexpansionandTORswitch
ing.Thecurrent8GT/sandnextgeneration(Gen4)16GT/slinerates,theability
toaggregatemultiplelanesinsinglehighbandwidthports,failovercapabili
ties,embeddedDMAfordatatransfers,andIOsharing/virtualizationprovide
capabilitiesthatareatleastequalto,ifnotsuperiorto,interfacessuchasInfini
BandandEthernet.

942
PCIe 3.0.book Page 943 Sunday, September 2, 2012 11:25 AM

AppendixC:
ImplementingIntelligent
AdaptersandMultiHost
SystemsWithPCIExpress
Technology

Jack Regula, Danny Chi, Tim Canepa (PLX Technology, Inc. )

Introduction
Intelligentadapters,hostfailovermechanismsandmultiprocessorsystemsare
threeusagemodelsthatarecommontoday,andexpectedtobecomemoreprev
alentasmarketrequirementsfornextgenerationsystems.Despitethefactthat
each of these was developed in response to completely different market
demands, all share the common requirement that systems that utilize them
require multiple processors to coexist within the system. This appendix out
lineshowPCIExpresscanaddresstheseneedsthroughnontransparentbridg
ing.

Because of the widespread popularity of systems using intelligent adapters,


hostfailoverandmultihosttechnologies,PCIExpresssiliconvendorsmustpro
videameanstosupportthem.Thisisactuallyarelativelylowriskendeavor;
giventhatPCIExpressissoftwarecompatiblewithPCI,andPCIsystemshave
longimplementeddistributedprocessing.Themostobviousapproach,andthe
onethatPLXespouses,istoemulatethemostpopularimplementationusedin
thePCIspaceforPCIExpress.Thisstrategyallowssystemdesignerstousenot
onlyafamiliarimplementationbutonethatisaprovenmethodology,andone

943
PCIe 3.0.book Page 944 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

that can provide significant software reuse as they migrate from PCI to PCI
Express.This paper outlines how multiprocessor PCI Express systems will be
implemented using industry standard practices established in the PCI para
digm.Wefirst,however,willdefinethedifferentusagemodels,andreviewthe
successfuleffortsinthePCIcommunitytodevelopmechanismstoaccommo
date these requirements. Finally, we will cover how PCI Express systems will
utilize nontransparent bridgingto provide thefunctionalityneeded for these
typesofsystems.

Usage Models

Intelligent Adapters
Intelligentadaptersaretypicallyperipheraldevicesthatusealocalprocessorto
offloadtasksfromthehost.ExamplesofintelligentadaptersincludeRAIDcon
trollers,modemcards,andcontentprocessingbladesthatperformtaskssuchas
securityandflowprocessing.Generally,thesetasksareeithercomputationally
onerousorrequiresignificantI/Obandwidthifperformedbythehost.Byadd
ing a local processor to the endpoint, system designers can enjoy significant
incrementalperformance.IntheRAIDmarket,asignificantnumberofproducts
utilizelocalintelligencefortheirI/Oprocessing.

Another example ofintelligentadapters isanecommerce blade.Because gen


eralpurposehostprocessorsarenotoptimizedfortheexponentialmathematics
necessaryforSSL,utilizingahostprocessortoperformanSSLhandshaketypi
cally reduces system performance by over 90%. Furthermore, one of the
requirementsfortheSSLhandshakeoperationisatruerandomnumbergenera
tor.Manygeneralpurposeprocessorsdonothavethisfeature,soitisactually
difficult to perform SSL handshakes without dedicated hardware. Similar
examples abound throughout theintelligent adapter marketplace; in fact, this
usage model is so prevalent that for many applications it has become the de
factostandardimplementation.

Host Failover
Hostfailovercapabilitiesaredesignedintosystemsthatrequirehighavailabil
ity.Highavailabilityhasbecomeanincreasinglyimportantrequirement,espe
cially in storage and communication platforms. The only practical way to
ensurethattheoverallsystemremainsoperationalistoprovideredundancyfor

944
PCIe 3.0.book Page 945 Sunday, September 2, 2012 11:25 AM

Chapter : Appendix C: Implementing Intelligent Adapt-

all components. Host failover systems typically include a host based system
attachedtoseveralendpoints.Inaddition,abackuphostisattachedtothesys
tem and is configured to monitor the system status. When the primary host
fails, the backup host processor must not only recognize the failure, but then
takestepstoassumeprimarycontrol,removethefailedhosttopreventaddi
tionaldisruptions,reconstitutethesystemstate,andcontinuetheoperationof
thesystemwithoutlosinganydata.

Multiprocessor Systems
Multiprocessor systems provide greater processing bandwidth by allowing
multiplecomputationalenginestosimultaneouslyworkonsectionsofacom
plexproblem.Unlikesystemsutilizinghostfailover,wherethebackupproces
sor is essentially idle, multiprocessor systems utilize all the engines to boost
computationalthroughput.Thisenablesasystemtoreachperformancelevels
notpossiblebyusingonlyasinglehostprocessor.Multiprocessorsystemstypi
callyconsistoftwoormorecompletesubsystemsthatcanpassdatabetween
themselvesviaaspecialinterconnect.Agoodexampleofamultihostsystemis
abladeserverchassis.Eachbladeisacompletesubsystem,oftenrepletewithits
ownCPU,DirectAttachedStorage,andI/O.

The History Multi-Processor Implementations Using PCI


TobetterunderstandtheimplementationproposedforPCIExpress,oneneeds
tofirstunderstandthePCIimplementation.

PCI was originally defined in 1992 for personal computers. Because of the
natureofPCsatthattime,theprotocolarchitectsdidnotanticipatetheneedfor
multiprocessors. Therefore, they designed the system assuming that the host
processorwouldenumeratetheentirememoryspace.Obviously,ifanotherpro
cessor is added, the system operation would fail as both processors would
attempttoservicethesystemrequests.

1Several methodologies were subsequently invented to accommodate the


requirementformultiprocessorcapabilitiesusingPCI.Themostpopularimple
mentation, and the one discussed in this paper for PCI Express, is the use of
nontransparent bridging between the processing subsystems to isolate their
memoryspaces.1

945
PCIe 3.0.book Page 946 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Becausethehostdoesnotknowthesystemtopologywhenitisfirstpoweredup
orreset,itmustperformdiscoverytolearnwhatdevicesarepresentandthen
mapthemintothememoryspace.Tosupportstandarddiscoveryandconfigu
rationsoftware,thePCIspecificationdefinesastandardformatforControland
Status Registers(CSRs) of compliant devices. The standard PCItoPCIbridge
CSRheader,calledaType1header,includesprimary,secondaryandsubordi
nate bus number registers that, when written by the host, define the CSR
addressesofdevicesontheothersideofthebridge.BridgesthatemployaType
1CSRheaderarecalledtransparentbridges.

A Type 0 header is used for endpoints. A Type 0 CSR header includes base
address registers (BARs) used to request memory or I/O apertures from the
host.BothType1andType0headersincludeaclasscoderegisterthatindicates
whatkindofbridgeorendpointisrepresented,withfurtherinformationavail
able in a subclass field and in device ID and vendor ID registers. The CSR
header format and addressing rules allow the processor to search all the
branches of a PCI hierarchy, from the host bridge down to each of its leaves,
readingtheclasscoderegistersofeachdeviceitfindsasitproceeds,andassign
ing bus numbers as appropriate as it discovers PCItoPCI bridges along the
way.Atthecompletionofdiscovery,thehostknowswhichdevicesarepresent
andthememoryandI/Ospaceeachdevicerequirestofunction.Theseconcepts
areillustratedinFigureC01.

1. Unless explicitly noted, the architecture for multiprocessor systems using PCI and
PCI Express are similar and may be used interchangeably.

946
PCIe 3.0.book Page 947 Sunday, September 2, 2012 11:25 AM

Chapter : Appendix C: Implementing Intelligent Adapt-

Figure01:EnumerationUsingTransparentBridges

Implementing Multi-host/Intelligent Adapters in PCI


Express Base Systems
Uptothispoint,ourdiscussionshavebeenlimitedtooneprocessorwithone
memoryspace.Astechnologyprogressed,systemdesignersbegandeveloping
end points with their own native processors built in. The problem that this
caused was that both the host processor and the intelligent adapter would,
uponpoweruporreset,attempttoenumeratetheentiresystem,causingsys
temconflictandultimatelyanonfunctionalsystem.1

1. While we are using an intelligent endpoint as the examples, we should note


that a similar problem exists for multi-host systems.

947
PCIe 3.0.book Page 948 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

To get around this, architects designed nontransparent bridges. A nontrans


parentPCItoPCIBridge,orPCIExpresstoPCIExpressBridge,isabridgethat
exposesaType0CSRheaderonbothsidesandforwardstransactionsfromone
side to the other with address translation, through apertures created by the
BARsofthoseCSRheaders.BecauseitexposesaType0CSRheader,thebridge
appearstobeanendpointtodiscoveryandconfigurationsoftware,eliminating
potentialdiscoverysoftwareconflicts.EachBARoneachsideofthebridgecre
atesatunnelorwindowintothememoryspaceontheothersideofthebridge.
Tofacilitatecommunicationbetweentheprocessingdomainsoneachside,the
nontransparentbridgealsotypicallyincludesdoorbellregisterstosendinter
ruptsfromeachsideofthebridgetotheother,andscratchpadregistersaccessi
blefrombothsides.

Anontransparentbridgeisfunctionallysimilartoatransparentbridgeinthat
bothprovideapathbetweentwoindependentPCIbuses(orPCIExpresslinks).
Thekeydifferenceisthatwhenanontransparentbridgeisused,devicesonthe
downstreamsideofthebridge(relativetothesystemhost)arenotvisiblefrom
theupstreamside.Thisallowsanintelligentcontrolleronthedownstreamside
tomanagethedevicesinitslocaldomain,whileatthesametimemakingthem
appearasasingledevicetotheupstreamcontroller.Thepathbetweenthetwo
busesallowsthedevicesonthedownstreamsidetotransferdatadirectlytothe
upstreamsideofthebuswithoutdirectlyinvolvingtheintelligentcontrollerin
thedatamovement.Thustransactionsareforwardedacrossthebusunfettered
justasinaPCItoPCIBridge,buttheresourcesresponsiblearehiddenfromthe
host,whichseesasingledevice.

Because we now have two memory spaces, the PCI Express system needs to
translate addresses of transactions that cross from one memory space to the
other.ThisisaccomplishedviaTranslationandLimitRegistersassociatedwith
theBAR.SeeAddressTranslationonpage 958foradetaileddescription;Fig
ureC02onpage949providesaconceptualrenderingofDirectAddressTrans
lation. Address translation can be done by Direct Address Translation
(essentiallyreplacementofthedataunderamask),tablelookup,orbyadding
anoffsettoanaddress.FigureC03onpage950showsTableLookupTransla
tionusedtocreatemultiplewindowsspreadacrosssystemmemoryspacefor
packetoriginatedinalocalI/Oprocessorsdomain,aswellasDirectAddress
Translationusedtocreateasinglewindowintheoppositedirection.

948
PCIe 3.0.book Page 949 Sunday, September 2, 2012 11:25 AM

Chapter : Appendix C: Implementing Intelligent Adapt-

Figure02:DirectAddressTranslation

949
PCIe 3.0.book Page 950 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Figure03:LookUpTableTranslationCreatesMultipleWindows

Example: Implementing Intelligent Adapters in a PCI


Express Base System
IntelligentadapterswillbepervasiveinPCIExpresssystems,andwilllikelybe
themostwidelyusedexampleofsystemswithmultipleprocessors.

FigureC04onpage951illustrateshowPCIExpresssystemswillimplement
intelligentadapters.Thesystemdiagramconsistsofasystemhost,arootcom
plex(thePCIExpressversionofaNorthbridge),athreeportswitch,anexample
endpoint,andanintelligentaddincard.Similartothesystemarchitecture,the
addin card contains a local host, a root complex, a three port switch, and an

950
PCIe 3.0.book Page 951 Sunday, September 2, 2012 11:25 AM

Chapter : Appendix C: Implementing Intelligent Adapt-

example endpoint. However we should note two significant differences: the


intelligent addin card contains an EEPROM, and one port of the switch con
tainsabacktobacknontransparentbridge.

Figure04:IntelligentAdaptersinPCIandPCIExpressSystems

Uponpowerup,thesystemhostwillbeginenumeratingtodeterminethetopol
ogy.ItwillpassthroughtheRootComplexandenterthefirstswitch(SwitchA).
Uponenteringthetopmostport,itwillseeatransparentbridge,soitwillknow
tocontinuetoenumerate.Thehostwillthenpolltheleftmostportand,upon
findingaType0CSRheader,willconsideritanendpointandexplorenodeeper
alongthatbranchofthePCIhierarchy.Thehostwillthenusetheinformationin
theendpointsCSRheadertoconfigurebaseandlimitregistersinbridgesand
BARsinendpointstocompletethememorymapforthisbranchofthesystem.

951
PCIe 3.0.book Page 952 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

The host will then explore the rightmost port of Switch A and read the CSR
headerregistersassociatedwiththetopportofSwitchB.Becausethisportisa
nontransparentbridge,thehostfindsaType0CSRheader.Thehostprocessor
therefore believes that this is an endpoint and explores no deeper along that
branchofthePCIhierarchy.ThehostreadstheBARsofthetopportofSwitchB
todeterminethememoryrequirementsforwindowsintothememoryspaceon
theothersideofthebridge.Thememoryspacerequirementscanbepreloaded
fromanEEPROMintotheBARSetupRegistersofSwitchBsnontransparent
port or can be configured by the processor that is local to Switch B prior to
allowingthesystemhosttocompletediscovery.

Similartothehostprocessorpowerupsequence,thelocalhostwillalsobegin
enumerating its own system. Like the system host processor, it will allocate
memoryforendpointsandcontinuetoenumeratewhenitencountersatrans
parent bridge. When the host reaches the topmost port of Switch B, it sees a
nontransparent bridge with a Type 0 CSR header. Accordingly, it reads the
BARsoftheCSRheadertodeterminethememoryaperturerequirements,then
terminatesdiscoveryalongthisbranchofitsPCItree.Again,thememoryaper
tureinformationcanbesuppliedbyanEEPROM,orbythesystemhost.

Communicationbetweenthetwoprocessordomainsisachievedviaamailbox
system and doorbell interrupts. The doorbell facility allows each processor to
sendinterruptstotheother.Themailboxfacilityisasetofdualportedregisters
that are both readable and writable by both processors. Shared memory
mappedmechanismsviatheBARsmayalsobeusedforinterprocessorcom
munication.

Example: Implementing Host Failover in a PCI


Express System
FigureC05onpage953illustrateshowmostPCIExpresssystemswillimple
menthostfailover.Theprimaryhostprocessorinthisillustrationisontheleft
sideofthediagram,withthebackuphostontherightsideofthediagram.Like
mostsystemswithwhichwearefamiliar,thehostprocessorconnectstoaroot
complex.Inturn,therootcomplexroutesitstraffictotheswitch.Inthisexam
ple,theswitchhastwoportstoendpointsinadditiontotheupstreamportfor
the primary host we have just described. Furthermore, this system also has
anotherprocessor,whichisconnectedtotheswitchviaanotherrootcomplex.

952
PCIe 3.0.book Page 953 Sunday, September 2, 2012 11:25 AM

Chapter : Appendix C: Implementing Intelligent Adapt-

Figure05:HostFailoverinPCIandPCIExpressSystems

Theswitchportstobothprocessorsneedtobeconfigurabletobehaveeitheras
atransparentbridgeoranontransparentbridge.AnEEPROMorstrappinson
theswitchcanbeusedtoinitiallybootstrapthisconfiguration.
Undernormaloperation,uponpowerup,theprimaryhostbeginstoenumerate
thesystem.Inourexample,astheprimaryhostprocessorbeginsitsdiscovery
protocolthroughthefabric,itdiscoversthetwoendpoints,andtheirmemory
requirements,bysizingtheirBARs.Whenitgetstotheupperrightport,itfinds
aType0CSRheader.Thissignifiestotheprimaryhostprocessorthatitshould
not attempt discovery on the far side of the associated switch port. As in the
previous example, the BARs associated with the nontransparent switch port
mayhavebeenconfiguredbyEEPROMloadpriortodiscoveryormightbecon
figuredbysoftwarerunningonthelocalprocessor.

953
PCIe 3.0.book Page 954 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Again, similar to the previous example, the backup processor powers up and
beginstoenumerate.Inthisexample,thebackupprocessorchipsetconsistsof
therootcomplexandthebackupprocessoronly.Itdiscoversthenontranspar
ent switch port and terminates its discovery there. It is keyed by EEPROM
loadedDeviceIDandVendorIDregisterstoloadanappropriatedriver.
During the course of normal operation, the host processor performs all of its
normaldutiesasitactivelymanagesthesystem.Inaddition,itwillsendmes
sages to the backup processor called heartbeat messages. Heartbeat messages
are indications of the continued good health of the originating processor. A
heartbeatmessagemightbeassimpleasadoorbellinterruptassertion,buttyp
ically would include some data to reduce the possibility of a false positive.
Checkpoint andjournal messages arealternative approachestoproviding the
backupprocessorwithastartingpoint,shoulditneedtotakeover.Inthejour
nal methodology, the backup is provided with a list or journal of completed
transactions (in the application specific sense, not inthesense of bustransac
tions).Inthecheckpointmethodology,thebackupisperiodicallyprovidedwith
acompletesystemstatefromwhichitcanrestartifnecessary.Theheartbeats
jobistoprovidethemeansbywhichthebackupprocessorverifiesthatthehost
processor is still operational. Typically this data provides the latest activities
andthestateofalltheperipherals.
Ifthebackupprocessorfailstoreceivetimelyheartbeatmessages,itwillbegin
assumingcontrol.Oneofitsfirsttasksistodemotetheprimaryporttoprevent
thefailedprocessorfrominteractingwiththerestofthesystem.Thisisaccom
plished by reprogramming the CSRs of the switch using a memory mapped
viewoftheswitchsCSRsprovidedviaaBARinthenontransparentport.To
take over, the backup processor reverses the transparent/nontransparent
modesatbothitsportandtheprimaryprocessorsportandtakesdownthelink
totheprimaryprocessor.Aftercleaningupanytransactionsleftinthequeues
orleftinanincompletestateasaresultofthehostfailure,thebackupprocessor
reconfiguresthesystemsothatitcanserveasthehost.Finally,itusesthedata
inthecheckpointorjournalmessagestorestartthesystem.

954
PCIe 3.0.book Page 955 Sunday, September 2, 2012 11:25 AM

Chapter : Appendix C: Implementing Intelligent Adapt-

Example: Implementing Dual Host in a PCI Express


Base System
FigureC06onpage955illustrateshowPCIExpresssystemsmightimplement
a dual host system1. In this example, the leftmost blocks are a typically com
pletesystem,withtherightmostblocksbeingaseparatesubsystem.Asprevi
ouslydiscussed,connectingtheleftmostandrightmostdiagramisasetofnon
transparentbridges.

Figure06:DualHostinaPCIandPCIExpressSystem

Uponpowerup,bothprocessorswillbeginenumerating.Asbefore,thehosts
will search out the endpoints by reading the CSR and then allocate memory

1. Back to back non-transparent (NT) ports are unnecessary but occur as a result of the
use of identical single board computers for both hosts. A transparent backplane fabric
would typically be interposed between the two NT ports.

955
PCIe 3.0.book Page 956 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

appropriately. When the hosts encounter the nontransparent bridge port in


eachoftheirprivateswitches,theywillassumeitisanendpointand,usingthe
dataintheEEPROM,allocateresources.Bothsystemswillusethedoorbelland
mailboxregistersdescribedabovetocommunicatewitheachother.
2
Thedualhostsystemmodelmaybeextendedtoafullyredundantdualstar
systembyusingadditionalswitchestodualportthehostsandlinecardsintoa
redundant fabric as shown in Figure C07 on page 957. This is particularly
attractive to vendors who employ chassis based systems for their flexibility,
scalabilityandreliability.

Twohostcardsareshown.HostAistheprimaryhostofFabricAandthesec
ondaryhostofFabricB.Similarly,HostBistheprimaryhostofFabricBandthe
secondaryhostofFabricA.

Each host is connected to the fabric it serves via a transparent bridge/switch


portandtothefabricforwhichitprovidesonlybackupviaanontransparent
bridge/switchport.Thesenontransparentportsareusedforhosttohostcom
munications and also support crossdomain peertopeer transfers where
addressmapsdonotallowamoredirectconnection.

956
PCIe 3.0.book Page 957 Sunday, September 2, 2012 11:25 AM

Chapter : Appendix C: Implementing Intelligent Adapt-


Figure07:DualStarFabric

Summary
Throughnontransparentbridging,PCIExpressBaseoffersvendorstheability
tointegrateintelligentadaptersandmultihostsystemsintotheirnextgenera
tiondesigns.Thisappendixdemonstratedhowthesefeatureswillbedeployed
using defacto standard techniques adopted in the PCI environment and
showed how they would be utilized for various applications. Because of this,
we can expect this methodology to become the industry standard in the PCI
Expressparadigm.

957
PCIe 3.0.book Page 958 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

Address Translation
This section provides an indepth description of how systems that use non
transparentbridgescommunicateusingaddresstranslation.Weprovidedetails
aboutthemechanismbywhichsystemsdeterminenotonlythesizeofthemem
oryallocated,butalsoabouthowmemorypointersareemployed.Implementa
tions using both Direct Address Translation as well as Lookup Table Based
Address Translation are discussed. By using the same standardized architec
turalimplementationofnontransparentbridgingpopularizedinthePCIpara
digmintothePCIExpressenvironment,interconnectvendorscanspeedmarket
adoption of PCI Express into markets requiring intelligent adapters, host
failoverandmultihostcapabilities.
ThetransparentbridgeusesbaseandlimitregistersinI/Ospace,nonprefetch
ablememoryspace,andprefetchablememoryspacetomaptransactionsinthe
downstreamdirectionacrossthebridge.Alldownstreamdevicesarerequired
tobemappedincontiguousaddressregionssuchthatasingleapertureineach
spaceissufficient.Upstreammappingisdoneviainversedecodingrelativeto
thesameregisters.Atransparentbridgedoesnottranslatetheaddressesoffor
wardedtransactions/packets.
ThenontransparentbridgesusethestandardsetofBARsintheirType0CSR
header to define apertures into the memory space on the other side of the
bridge.TherearetwosetsofBARs:oneonthePrimarysideandoneontheSec
ondary. BARs define resource apertures that allow the forwarding of transac
tionstotheopposite(otherside)interface.
ForeachBARbridgethereexistsasetofassociatedcontrolandsetupregisters
usuallywritablefromtheothersideofthebridge.EachBARhasasetupreg
ister,whichdefinesthesizeandtypeofitsaperture,andanaddresstranslation
register.Somebarsalsohavealimitregisterthatcanbeusedtorestrictitsaper
turessize.Theseregistersneedtobeprogrammedpriortoallowingaccessfrom
outside the local subsystem. This is typically done by software running on a
localprocessororbyloadingtheregistersfromEEPROM.
InPCIExpress,theTransactionIDfieldsofpacketspassingthroughtheseaper
tures are also translated to support Device ID routing. These Device IDs are
usedtoroutecompletionstononpostedrequestsandIDroutedmessages.
ThetransparentbridgeforwardsCSRtransactionsinthedownstreamdirection
accordingtothe secondary andsubordinatebusnumberregisters,converting
Type 1 CSRs to Type 0 CSRs as required. The nontransparent bridge accepts
onlythoseCSRtransactionsaddressedtoitandreturnsanunsupportedrequest
responsetoallothers.

958
PCIe 3.0.book Page 959 Sunday, September 2, 2012 11:25 AM

Chapter : Appendix C: Implementing Intelligent Adapt-

Direct Address Translation


The addresses of all upstream and downstream transactions are translated
(exceptBARsaccessingCSRs).Withtheexceptionofthecasesinthefollowing
twosections,addressesthatareforwardedfromoneinterfacetotheotherare
translated by adding a Base Address to their offset within the BAR that they
landedinasseeninFigureC08onpage959.TheBARBaseTranslationRegis
tersareusedtosetupthesebasetranslationsfortheindividualBARs.

Figure08:DirectAddressTranslation

Lookup Table Based Address Translation


Following the de facto standard adopted by the PCI community, PCI Express
shouldprovideseveralBARsforthepurposesofallocatingresources.AllBARs
containthememoryallocation;however,inaccordancewithPCIindustrycon
ventions, BAR 0 contains the CSR information whereas BAR1 contains I/O
information,BAR2andBAR3areutilizedforLookupTableBasedTranslation.
BAR4andBAR5areutilizedforDirectAddressTranslations.
Onthesecondaryside,BAR3usesaspeciallookuptablebasedaddresstransla
tionfortransactionsthatfallinsideitswindowasseeninFigureC09onpage
960.Thelookuptableprovidesmoreflexibilityinsecondarybuslocaladdresses

959
PCIe 3.0.book Page 960 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

toprimarybusaddresses.Thelocationoftheindexfieldwiththeaddressbusis
programmabletoadjustaperturesize.

Figure09:LookupTableBasedTranslation

Downstream BAR Limit Registers


ThetwodownstreamBARsontheprimaryside(BAR2/3andBAR4/5)alsohave
Limitregisters,programmablefromthelocalside,tofurtherrestrictthesizeof
thewindowtheyexpose,asseeninFigureC010onpage961.BARscanonly
beassignedmemoryresourcesinpoweroftwogranularity.Thelimitregis
tersprovideameanstoobtainbettergranularitybycappingthesizeofthe
BARwithinthepoweroftwogranularity.OnlytransactionsbelowtheLimit
registersareforwardedtothesecondarybus.Transactionsabovethelimitare
discardedorreturn0xFFFFFFFF,oramasterabortequivalentpacket,onreads.

960
PCIe 3.0.book Page 961 Sunday, September 2, 2012 11:25 AM

Chapter : Appendix C: Implementing Intelligent Adapt-

Figure010:UseofLimitRegister

Forwarding 64bit Address Memory Transactions


CertainBARs can be configuredtowork in pairs toprovidethe baseaddress
and translation for transactions containing 64bit addresses. Transactions that
hit within these 64bit BARs are forwarded using Direct Address Translation.
Asinthecaseof32bittransactions,whenamemorytransactionisforwarded
fromtheprimarytothesecondarybus,theprimaryaddresscanbemappedto
another address in the secondary bus domain. The mapping is performed by
substitutinganewbaseaddressforthebaseoftheoriginaladdress.

A64bitBARpaironthesystemsideofthebridgeisusedtotranslateawindow
of64bitaddressesinpacketsoriginatedonthesystemsideofthebridgedown
below232inlocalspace.

961
PCIe 3.0.book Page 962 Sunday, September 2, 2012 11:25 AM

PCIExpress3.0Technology

962
PCIe 3.0.book Page 963 Sunday, September 2, 2012 11:25 AM

AppendixD:
LockedTransactions

Introduction
NativePCIExpressimplementationsdonotsupporttheoldlockprotocol.Sup
portforLockedtransactionsequencesonlyexiststosupportlegacydevicesoft
ware executing on the host processor that performs a locked RMW (read
modifywrite) operation on a memory location in a legacy PCI device. This
chapterdefinestheprotocoldefinedbyPCIExpressforthislegacysupportof
lockedaccesssequencesthattargetlegacydevices.Failuretosupportlockmay
resultindeadlocks.

Background
PCI Express supports atomic or uninterrupted transaction sequences (usually
described as an atomic readmodifywrite sequence) for legacy devices only.
NativePCIedevicesdontsupportthisatallandwillreturnaCompletionwith
UR(UnsupportedRequest)statusiftheyreceivealockedRequest.

LockedoperationsconsistofthebasicRMWsequence,thatis:

1. Oneormorememoryreadsfromthetargetlocationtoobtainthevalue.
2. Themodificationofthedatainaprocessorregister.
3. Oneormorewritestowritethemodifiedvaluebacktothetargetmemory
location.

This transaction sequence must be performed such that no other accesses are
permitted to the target locations (or device) during the locked sequence. This
requiresblockingothertransactionsduringtheoperation.Thiscanpotentially
resultindeadlocksandpoorperformance.

963
PCIe 3.0.book Page 964 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Thedevicesrequiredtosupportlockedsequencesare:
TheRootComplex.
Any Switches in the path to a Legacy Device that may be the target of a
lockedtransactionseries.
PCIetoPCIBridgeorPCIetoPCIXBridge.
Any Legacy Device whose driver issues locked transactions to memory
residingwithinthelegacydevice.
LockinginthePCIenvironmentisachievedbytheuseoftheLOCK#signal.The
equivalent functionality in PCIe is accomplished by using a specific Request
thatemulatestheLOCK#signalfunctionality.

The PCI Express Lock Protocol


TheonlysourceoflocksupportedbyPCIExpressisthesystemprocessoracting
through the Root Complex. A locked operation is performed between a Root
PortandtheLegacyEndpoint.Inmostsystems,thelegacydeviceistypicallya
PCIExpresstoPCIorPCIExpresstoPCIXbridge.Onlyonelockedsequence
atatimeissupportedforagivenhierarchicalpath.

Locked transactions are constrained to use only Traffic Class 0 and Virtual
Channel0.TransactionswithotherTCvaluesthatmaptoaVCotherthanzero
arepermittedtotraversethefabricwithoutregardtothelockedoperation,but
transactionsthatmaptoVC0areaffectedbythelockrulesdescribedhere.

Lock Messages The Virtual Lock Signal


PCI Express defines the following transactions that, together, act as a virtual
wireandreplacetheLOCK#signal.

Memory Read Lock Request (MRdLk) Originates a locked sequence.


The first MRdLk transaction blocks other Requests in VC0 from reaching
thetargetdevice.Oneormoreoftheselockedreadrequestsmaybeissued
duringthesequence.
MemoryReadLockCompletionwithData(CplDLk)Returnsdataand
confirmsthatthepathtothetargetislocked.AsuccessfulreadCompletion
thatreturnsdataforthefirstMemoryReadLockrequestresultsinthepath
between the Root Complex and the target device being locked. That is,
transactions traversing the same path from other ports are blocked from
reachingeithertherootportorthetargetport.Transactionsbeingroutedin
buffersforVC1VC7areunaffectedbythelock.

964
PCIe 3.0.book Page 965 Sunday, September 2, 2012 11:25 AM

Appendix D

MemoryReadLockCompletionwithoutData(CplLK)ACompletion
without a data payload indicates that the lock sequence cannot complete
currentlyandthepathremainsunlocked.
Unlock Message An unlock message is issued by the Root Complex
fromthelockedrootport.Thismessageunlocksthepathbetweentheroot
portandthetargetport.

The Lock Protocol Sequence an Example


This section explains the PCI Express lock protocol by example. The example
includesthefollowingdevices:

TheRootComplexthatinitiatestheLockedtransactionseriesonbehalfof
thehostprocessor.
ASwitchinthepathbetweentherootportandtargetedlegacyendpoint.
APCIExpresstoPCIBridgeinthepathtothetarget.
ThetargetPCIdevicewhosDeviceDriverinitiatedthelockedRMW.
A PCI Express endpoint is included to describe Switch behavior during
lock.

Inthisexample,thelockedoperationcompletesnormally.Thestepsthatoccur
duringtheoperationaredescribedinthetwosectionsthatfollow.

The Memory Read Lock Operation


FigureE1onpage967illustratesthefirststepintheLockedtransactionseries
(i.e.,theinitialmemoryreadtoobtainthesemaphore):

1. TheCPUinitiatesthelockedsequence(aLockedMemoryRead)asaresult
ofadriverexecutingalockedRMWinstructionthattargetsaPCItarget.
2. TheRootPortissuesaMemoryReadLockRequestfromport2.TheRoot
Complexisalwaysthesourceofalockedsequence.
3. TheSwitchreceivesthelockrequestonitsupstreamportandforwardsthe
request to the target egress port (3). The switch, upon forwarding the
requesttotheegressport,mustblockallrequestsfromportsotherthanthe
ingressport(1)frombeingsentfromtheegressport.
4. A subsequent peertopeer transfer from the illustrated PCI Express end
pointtothePCIbus(switchport2toswitchport3)wouldbeblockeduntil
the lock is cleared. Note that the lock is not yet established in the other
direction.TransactionsfromthePCIExpressendpointcouldbesenttothe
RootComplex.

965
PCIe 3.0.book Page 966 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

5. TheMemoryReadLockRequestissentfromtheSwitchsegressporttothe
PCIExpresstoPCIBridge.ThisbridgewillimplementPCIlocksemantics
(SeetheMindSharebookentitledPCISystemArchitecture,FourthEdition,for
detailsregardingPCIlock).
6. ThebridgeperformstheMemoryReadtransactiononthePCIbuswiththe
PCI LOCK# signal asserted. The target memory device returns the
requestedsemaphoredatatothebridge.
7. ReaddataisreturnedtotheBridgeandisdeliveredbacktotheSwitchviaa
MemoryReadLockCompletionwithData(CplDLk).
8. TheswitchusesIDroutingtoreturnthepacketupstreamtowardsthehost
processor.WhentheCplDLkpacketisforwardedtotheupstreamportof
theSwitch,itestablishesalockintheupstreamdirectiontopreventtraffic
fromotherportsfrombeingroutedupstream.ThePCIExpressendpointis
completely blocked from sending any transaction to the Switch ports via
thepathofthelockedoperation.NotethattransfersbetweenSwitchports
notinvolvedinthelockedoperationwouldbepermitted(notshowninthis
example).
9. UpondetectingtheCplDLkpacket,theRootComplexknowsthatthelock
hasbeenestablishedalongthepathbetweenitandthetargetdevice,and
thecompletiondataissenttotheCPU.

966
PCIe 3.0.book Page 967 Sunday, September 2, 2012 11:25 AM

Appendix D

FigureD1:LockSequenceBeginswithMemoryReadLockRequest

The CPU executes


the PCI target's device
drive that uses lock
1 CPU

Root Complex
Root Complex issues Root Complex receives
the MRdLk Request 2 9 CplDLk and returns data
to CPU

Switch forwards the Completion


Switch receives MRdLk and 1 to the upstream port (ID routing)
forwards it to the egress port (3). 3 8 and locks upstream port (1)
Switch blocks transactions from
other ports to egress port. Switch
2 3
Bridge returns data using
PCIe endpoint issues a MenRd 4 a CplDLk transaction
Request targeting a PCI device,
but request is blocked 7
5
PCIe PCIe
Endpoint to
PCI Bridge
The Bridge receives the MRdLk.
Bridges support lock based on the
PCI-based requirements
6

Target The Bridge asserts LOCK and


Device performs the PCI Rd transaction
and the target returns the read data

MRdLk CplDLk

Read Data Modified and Written to Target and Lock Com-


pletes
The device driver receives the semaphore value, alters it, and then initiates a
memorywritetoupdatethesemaphorewithinthememoryofthelegacyPCI
device. Figure E2 on page 969 illustrates the write sequence followed by the

967
PCIe 3.0.book Page 968 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

RootComplexstransmissionoftheUnlockmessagethatreleasesthelock:

10. TheRootComplexissuestheMemoryWriteRequestacrossthelockedpath
tothetargetdevice.
11. TheSwitchforwardsthetransactiontothetargetegressport(3).Themem
oryaddressoftheMemoryWritemustbethesameastheinitialMemory
Readrequest.
12. ThebridgeforwardsthetransactiontothePCIbus.
13. Thetargetdevicereceivesthememorywritedata.
14. OncetheMemoryWritetransactionissentfromtheRootComplex,itsends
anUnlockmessagetoinstructtheSwitchesandanyPCI/PCIXbridgesin
thelockedpathtoreleasethelock.NotethattheRootComplexpresumes
theoperationhascompletednormally(becausememorywritesareposted
andnoCompletionisreturnedtoverifysuccess).
15. TheSwitchreceivestheUnlockmessage,unlocksitsportsandforwardsthe
messagetotheegressportthatwaslockedtonotifyanyotherSwitchesand/
orbridgesinthelockedpaththatthelockmustbecleared.
16. UpondetectingtheUnlockmessage,thebridgemustalsoreleasethelock
onthePCIbus.

968
PCIe 3.0.book Page 969 Sunday, September 2, 2012 11:25 AM

Appendix D

FigureD2:LockCompleteswithMemoryWriteFollowedbyUnlockMessage

The CPU executes


the PCI target's device
drive that uses lock
CPU

Root Complex
Root Complex issues Root Complex sends
10 14 Unlock message
the Mem Write Request

Switch receives MemWt and 1 Switch receives the Unlock


forwards it to the egress port (3) 11 15
message and unlocks the
Switch ports in the locked path
2 3
Bridge releases lock
due to Unlock message

16
PCIe PCIe
12
Endpoint to
PCI Bridge
The Bridge receives the MemWt
performs the equivalent PCI
transaction
13
Target Target device receives the
Device PCI write data thereby
completing the operation

MemWt Unlock message

969
PCIe 3.0.book Page 970 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Notification of an Unsuccessful Lock


A locked transaction series is aborted when the initial Memory Read Lock
RequestreceivesaCompletionpacketwithnodata(CplLk).Thismeansthatthe
locked sequence must terminate because no data was returned. This could
resultfromanerrorassociatedwiththememoryreadtransaction,orperhaps
thetargetdeviceisbusyandcannotrespondatthistime.

Summary of Locking Rules


FollowingisalistoforderingrulesthatapplytotheRootComplex,Switches,
andBridges.

Rules Related To the Initiation and Propagation of


Locked Transactions
LockedRequestswhicharecompletedwithastatusotherthanSuccessful
Completiondonotestablishlock.
RegardlessofthestatusofanyoftheCompletionsassociatedwithalocked
sequence, all locked sequences and attempted locked sequences must be
terminatedbythetransmissionofanUnlockMessage.
MRdLk, CplDLk and Unlock semantics are allowed only for the default
TrafficClass(TC0).
Onlyonelockedtransactionsequenceattemptmaybeinprogressatagiven
timewithinasinglehierarchydomain.
Anydevicewhichisnotinvolvedinthelockedsequencemustignorethe
UnlockMessage.

The initiation and propagation of a locked transaction sequence through the


PCIExpressfabricisperformedasfollows:

AlockedtransactionsequenceisstartedwithaMRdLkRequest:
Any successive readsassociatedwiththe locked transactionsequence
mustalsouseMRdLkRequests.
The Completions for any successful MRdLk Request use the CplDLk
Completion type, or the CPlLk Completion type for unsuccessful
Requests.

970
PCIe 3.0.book Page 971 Sunday, September 2, 2012 11:25 AM

Appendix D

Ifanyreadassociatedwithalockedsequenceiscompletedunsuccessfully,
the Requester must assume that the atomicity of the lock is no longer
assured, and that the path between the Requester and Completer is no
longerlocked.
AllwritesassociatedwithalockedsequencemustuseMWrRequests.
The Unlock Message is used to indicate the end of a locked sequence. A
SwitchpropagatesUnlockMessagesthroughthelockedEgressPort.
Upon receiving an Unlock Message, a legacy Endpoint or Bridge must
unlockitselfifitisinalockedstate.Ifitisnotlocked,oriftheReceiverisa
PCIExpressEndpointorBridgewhichdoesnotsupportlock,theUnlock
Messageisignoredanddiscarded.

Rules Related to Switches


Switchesmustdetecttransactionsassociatedwithlockedsequencesfromother
transactions to prevent other transactions from interfering with the lock and
potentiallycausingdeadlock.Thefollowingrulescoverhowthisisdone.Note
thatlockedaccessesarelimitedtoTC0,whichisalwaysmappedtoVC0.

When a Switch propagates a MRdLk Request from an Ingress Port to the


Egress Port, it must block all Requests which map to the default Virtual
Channel (VC0) from beingpropagated tothe EgressPort. If a subsequent
MRdLk Request is received at this Ingress Port addressing a different
EgressPort,thebehavioroftheSwitchisundefined.Notethatthissortof
splitlock access is not supported by PCI Express and software must not
causesuchalockedaccess.Systemdeadlockmayresultfromsuchaccesses.
WhentheCplDLkforthefirstMRdLkRequestisreturned,iftheComple
tion indicates a Successful Completion status, the Switch must block all
RequestsfromallotherPortsfrombeingpropagatedtoeitherofthePorts
involvedinthelockedaccess,exceptforRequestswhichmaptochannels
otherthanVC0ontheEgressPort.
ThetwoPortsinvolvedinthelockedsequencemustremainblockeduntil
theSwitchreceivestheUnlockMessage(attheIngressPortwhichreceived
theinitialMRdLkRequest)
TheUnlockMessagemustbeforwardedtothelockedEgressPort.
TheUnlockMessagemaybebroadcasttoallotherPorts.
TheIngressPortisunblockedoncetheUnlockMessagearrives,andthe
EgressPort(s)whichwereblockedareunblockedfollowingthetrans
missionoftheUnlockMessageoutoftheEgressPort(s).Portsthatwere
notinvolvedinthelockedaccessareunaffectedbytheUnlockMessage

971
PCIe 3.0.book Page 972 Sunday, September 2, 2012 11:25 AM

PCI Express Technology

Rules Related To PCI Express/PCI Bridges


TherequirementsforPCIExpress/PCIBridgesaresimilartothoseforSwitches,
except that, because these Bridges only use TC0 and VC0, all other traffic is
blocked during the locked access. Requirements on the PCI bus side are
describedintheMindSharebook,PCISystemArchitecture,FourthEdition.

Rules Related To the Root Complex


ARootComplexispermittedtosupportlockedtransactionsasaRequester.If
locked transactions are supported, a Root Complex must follow the rules
already described to perform a locked access. The mechanism(s) used by the
RootComplextointerfacetothehostprocessorsFSB(FrontSideBus)areout
sidethescopeofthespec.

Rules Related To Legacy Endpoints


LegacyEndpointsarepermittedtosupportlockedaccesses,althoughtheiruse
isdiscouraged.Iflockedaccessesaresupported,legacyEndpointsmusthandle
themasfollows:

ThelegacyEndpointbecomeslockedwhenittransmitsthefirstCompletion
forthefirstreadrequestofthelockedtransactionseriesaccesswithaSuc
cessfulCompletionstatus:
IfthecompletionstatusisnotSuccessfulCompletion,thelegacyEnd
pointdoesnotbecomelocked.
Oncelocked,thelegacyEndpointmustremainlockeduntilitreceives
theUnlockMessage.
Whilelocked,alegacyEndpointmustnotissueanyRequestsusingTraffic
Classes which map to the default Virtual Channel (VC0). Note that this
requirementappliestoallpossiblesourcesofRequestswithintheEndpoint,
in the case where there is more than one possible source of Requests.
RequestsmaybeissuedusingTCswhichmaptoVCsotherthanVC0.

Rules Related To PCI Express Endpoints


Native PCI Express Endpoints do not support lock. A PCI Express Endpoint
musttreataMRdLkRequestasanUnsupportedRequest.

972
PCIe 3.0.book Page 973 Sunday, September 2, 2012 11:25 AM

Glossary

Term Definition
128b/130bEncoding Thisisntencodinginthesamesenseas8b/10b.Instead,
thetransmittersendsinformationinBlocksthatconsist
of16rawbytesinarow,precededbya2bitSyncfield
thatindicateswhethertheBlockistobeconsideredasa
Data Block or an Ordered Set Block. This scheme was
introducedwithGen3,primarilytoallowtheLinkband
widthtodoublewithoutdoublingtheclockrate.Itpro
vides better bandwidth utilization but sacrifices some
benefitsthat8b/10bprovidedforreceivers.

8b/10bEncoding Encodingschemedevelopedmanyyearsagothatsused
inmanyserialtransportstoday.Itwasdesignedtohelp
receiversrecovertheclockanddatafromtheincoming
signal, but it also reduces available bandwidth at the
receiver by 20%. This scheme is used with the earlier
versionsofPCIe:Gen1andGen2.

ACK/NAKProtocol The Acknowledge/Negative Acknowledge mechanism


by which the Data Link Layer reports whether TLPs
haveexperiencedanyerrorsduringtransmission.Ifso,a
NAKisreturnedtothesendertorequestareplayofthe
failed TLPs. If not, an ACK is returned to indicate that
oneormoreTLPshavearrivedsafely.

ACPI AdvancedConfigurationandPowerInterface.Specifies
thevarioussystemanddevicepowerstates.

ACS AccessControlServices.

973
PCIe 3.0.book Page 974 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Term Definition
ARI Alternative RoutingID Interpretation; optional feature
thatallowsEndpointstohavemoreFunctionsthatthe8
allowednormally.

ASPM Active State Power Management: When enabled, this


allows hardware to make changes to the Link power
statefromL0toL0sorL1orboth.

AtomicOps AtomicOperations;threenewRequestsaddedwiththe
2.1 spec revision. These carry out multiple operations
that are guaranteed to take place without interruption
withinthetargetdevice.

BandwidthManagement Hardwareinitiated changes to Link speed or width for


thepurposeofpowerconservationorreliability.

BAR BaseAddressRegister.UsedbyFunctionstoindicatethe
typeandsizeoftheirlocalmemoryandIOspace.

Beacon Lowfrequency inband signal used by Devices whose


mainpowerhasbeenshutofftosignalthataneventhas
occurred for which they need to have the power
restored.ThiscanbesentacrosstheLinkwhentheLink
isintheL2state.

BER Bit Error Rate or Ratio; a measure of signal integrity


based on the number of transmission bit errors seen
withinatimeperiod

BitLock The process of acquiring the transmitters precise clock


frequencyatthereceiver.ThisisdoneintheCDRlogic
andisoneofthefirststepsinLinkTraining.

Block The130bitunitsentbyaGen3transmitter,madeupofa
2bitSyncFieldfollowedbyagroupof16bytes.

974
PCIe 3.0.book Page 975 Sunday, September 2, 2012 11:25 AM

Glossary

Term Definition
BlockLock Finding the Block boundaries at the Receiver when
using 128b/130b encoding so as to recognize incoming
Blocks. The process involves three phases. First, search
the incoming stream for an EIEOS (Electrical Idle Exit
OrderedSet)andadjusttheinternalBlockboundaryto
match it. Next, search for the SDS (Start Data Stream)
Ordered Set. After that, the receiver is locked into the
Blockboundary.

Bridge AFunctionthatactsastheinterfacebetweentwobuses.
SwitchesandtheRootComplexwillimplementbridges
ontheirPortstoenablepacketrouting,andabridgecan
also be made to connect between different protocols,
suchasbetweenPCIeandPCI.

ByteStriping Spreading the output byte stream across all available


Lanes. All available Lanes are used whenever sending
bytes.

CC Credits Consumed: Number of credits already used by


thetransmitterwhencalculatingFlowControl.

CDR Clock and Data Recovery logic used to recover the


Transmitterclockfromtheincomingbitstreamandthen
sample the bits to recognize patterns. For 8b/10b, that
pattern, found in the COM, FTS, and EIEOS symbols,
allowsthelogictoacquireSymbolLock.For128b/130b
the EIEOS sequence is used to acquire Block Lock by
goingthroughthethreephasesoflocking.

Character Term used to describe the 8bit values to be communi


catedbetweenLinkneighbors.ForGen1andGen2,these
areamixofordinarydatabytes(labeledasDcharacters)
andspecialcontrolvalues(labeledasKcharacters).For
Gen3 there are no control characters because 8b/10b
encoding is no longer used. In that case, the characters
allappearasdatabytes.

975
PCIe 3.0.book Page 976 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Term Definition
CL CreditLimit:FlowControlcreditsseenasavailablefrom
thetransmittersperspective.Checkedtoverifywhether
enoughcreditsareavailabletosendaTLP.

ControlCharacter These are special characters (labeled as K characters)


usedin8b/10bencodingthatfacilitateLinktrainingand
OrderedSets.TheyarenotusedinGen3,wherethereis
nodistinctionbetweencharacters.

CorrectableErrors Errorsthatarecorrectedautomaticallybyhardwareand
dontrequiresoftwareattention.

CR CreditsRequiredthisisthesumofCCandPTLP.

CRC CyclicRedundancyCode;addedtoTLPsandDLLPsto
allowverifyingerrorfreetransmission.Thenamemeans
thatthepatternsarecyclicinnatureandareredundant
(theydontaddanyextrainformation).Thecodesdont
contain enough information to permit automatic error
correction,butproviderobusterrordetection.

CutThroughMode Mechanism by which a Switch allows a TLP to pass


through, forwarded from an ingress Port to an egress
Port without storing it first to check for errors. This
involves a risk, since the TLP may be found to have
errorsafterpartofithasalreadybeenforwardedtothe
egressPort.Inthatcase,theendofthepacketismodi
fiedintheDataLinkLayertohaveanLCRCvaluethatis
inverted from what it should be, and also modified at
the Physical Layer to have an End Bad (EDB) framing
symbol for 8b/10b encoding or an EDB token for 128b/
130b encoding. The combination tells the receiver that
the packet has been nullified and should be discarded
withoutsendinganACK/NAKresponse.

DataCharacters Characters (labeled as D characters) that represent


ordinary data and are distinguished from control char
acters in 8b/10b. For Gen3, there is no distinction
betweencharacters.

976
PCIe 3.0.book Page 977 Sunday, September 2, 2012 11:25 AM

Glossary

Term Definition
DataStream TheflowofdataBlocksforGen3operation.Thestream
isenteredbyanSDS(StartofDataStreamOrderedSet)
and exited with an EDS (End of Data Stream token).
DuringaData Stream,only dataBlocks orthe SOS are
expected.WhenanyotherOrderedSetsareneeded,the
Data Stream must be exited and only reentered when
more data Blocks are ready to send. Starting a Data
StreamisequivalenttoenteringtheL0Linkstate,since
OrderedSetsareonlysentwhileinotherLTSSMstates,
likeRecovery.

Deemphasis The process of reducing the transmitter voltage for


repeated bits in a stream. This has the effect of de
emphasizing the lowfrequency components of the sig
nal that are known to cause trouble in a transmission
medium and thus improves the signal integrity at the
receiver.

Digest Another name for the ECRC (EndtoEnd CRC) value


thatcanoptionallybeappendedtoaTLPwhenitscre
atedintheTransactionLayer.

DLCMSM Data Link Control and Management State Machine;


manages the Link Layer training process (which is pri
marilyFlowControlinitialization).

DLLP Data Link Layer Packet. These are created in the Data
LinkLayerandareforwardedtothePhysicalLayerbut
arenotseenbytheTransactionLayer.

DPA Dynamic Power Allocation; a new set of configuration


registerswiththe2.1specrevisionthatdefines32power
substates under the D0 device power state, making it
easierforsoftwaretocontroldevicepoweroptions.

DSP(DownstreamPort) Portthatfacesdownstream,likeaRootPortoraSwitch
Downstream Port. Thisdistinctionis meaningful in the
LTSSM because the Ports have assigned roles during
somestates.

977
PCIe 3.0.book Page 978 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Term Definition
ECRC EndtoEnd CRC value, optionally appended to a TLP
whenitscreatedintheTransactionLayer.Thisenablesa
receivertoverifyreliablepackettransportfromsourceto
destination,regardlessofhowmanyLinkswerecrossed
togetthere.

EgressPort Portthathasoutgoingtraffic.

ElasticBuffer PartoftheCDRlogic,thisbufferenablesthereceiverto
compensate for the difference between the transmitter
andreceiverclocks.

EMI ElectroMagnetic Interference: the emitted electrical


noisefromasystem.ForPCIe,bothSSCandscrambling
areusedtoattackit.

Endpoint PCIeFunctionthatisatthebottomofthePCIInverted
Treestructure.

Enumeration Theprocessofsystemdiscoveryinwhichsoftwarereads
alloftheexpectedconfigurationlocationstolearnwhich
PCIconfigurableFunctionsarevisibleandthuspresent
inthesystem.

Equalization The process of adjusting Tx and Rx values to compen


sateforactualorexpectedsignaldistortionthroughthe
transmission media. For Gen1 and Gen2, this takes the
formofTxDeemphasis.ForGen3,anactiveevaluation
process is introduced to test the signaling environment
andadjusttheTxsettingsaccordingly,andoptionalRx
equalizationismentioned.

Flow Control Mechanismbywhichtransmittersavoidtheriskofhav


ing packets rejected at a receiver due to lack of buffer
space.Thereceiversendsperiodicupdatesaboutavail
able buffer space and the transmitter verifies that
enoughisavailablebeforeattemptingtosendapacket.

FLR FunctionLevelReset

978
PCIe 3.0.book Page 979 Sunday, September 2, 2012 11:25 AM

Glossary

Term Definition
FramingSymbols Thestartandendcontrolcharactersusedin8b/10b
encodingthatindicatetheboundariesofaTLPorDLLP.

Gen1 Generation 1,meaning designscreatedto becompliant


withthe1.xversionofthePCIespec.

Gen1,Gen2,Gen3 AbbreviationsfortherevisionsofthePCIespec.Gen1=
rev1.x,Gen2=rev2.x,andGen3=rev3.0

Gen2 Generation 2,meaning designscreatedto becompliant


withthe2.xversionofthePCIespec.

Gen3 Generation 3,meaning designscreatedto becompliant


withthe3.xversionofthePCIespec.

IDO IDbased Ordering; when enabled, this allows TLPs


from different Requestersto beforwarded out of order
withrespecttoeachother.Thegoalistoimprovelatency
andperformance.

ImplicitRouting TLPswhoseroutingisunderstoodwithoutreferenceto
anaddressorID.OnlyMessagerequestshavetheoption
tousethistypeofrouting.

IngressPort Portthathasincomingtraffic.

ISI InterSymbolInterference;theeffectononebittimethat
iscausedbytherecentbitsthatprecededit.

Lane The two differential pairs that allow a transmit and


receive path of one bit between two Ports. A Link can
consistofjustoneLaneoritmaycontainasmanyas32
Lanes.

LanetoLaneSkew Difference in arrival times for bits on different Lanes.


Receiversarerequiredtodetectthisandcorrectitinter
nally.

LegacyEndpoint An Endpoint that carries any of three legacy items for


ward:supportforIOtransactions,supportforlocal32
bitonly prefetchable memory space, or support for the
lockedtransactions.

979
PCIe 3.0.book Page 980 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Term Definition
LFSR LinearFeedback Shift Register; creates a pseudoran
dompatternusedtofacilitatescrambling.

Link Interface between two Ports, made up of one or more


Lanes.

LTR LatencyTolerance Reporting; mechanism that allows


devicestoreporttothesystemhowquicklytheyneedto
getservicewhentheysendaRequest.Longerlatencies
affordmorepowermanagementoptionstothesystem.

LTSSM Link Training and Status State Machine; manages the


trainingprocessforthePhysicalLayer.

NonpostedRequest A Request that expects to receive a Completion in


response.Forexample,anyreadrequestwouldbenon
posted.

Nonprefetchable Memorythatexhibitssideeffectswhenread.Forexam
Memory ple,astatusregisterthatautomaticallyselfclearswhen
read. Such data is not safe to prefetch since, if the
requesterneverrequestedthedataanditwasdiscarded,
it would be lost to the system. This was an important
distinctionforPCIbridges,whichhadtoguessaboutthe
data size on reads.If they knew it was safe to specula
tivelyreadaheadinthememoryspace,theycouldguess
a larger number and achieve better efficiency. The dis
tinctionismuchlessinterestingforPCIe,sincetheexact
byte count for a transfer is included in the TLP, but
maintainingitallowsbackwardcompatibility.

NullifiedPacket Whenatransmitterrecognizesthatapackethasanerror
andshouldnothavebeensent,thepacketcanbenulli
fied, meaning it should be discarded and the receiver
shouldbehaveasifithadneverbeensent.Thisproblem
can arise when using cutthrough operation on a
Switch.

980
PCIe 3.0.book Page 981 Sunday, September 2, 2012 11:25 AM

Glossary

Term Definition
OBFF OptimizedBufferFlushandFill;mechanismthatallows
thesystemtotelldevicesaboutthebesttimestoinitiate
traffic. If devices send requests during optimal times
andnotduringothertimessystempowermanagement
willbeimproved.

OrderedSets Groups of Symbols sent as Physical Layer communica


tion for Lane management. These often consist of just
controlcharactersfor8b/10bencoding.Theyarecreated
inthePhysicalLayerofthesenderandconsumedinthe
Physical Layer of the receiver without being visible to
theotherlayersatall.

PCI Peripheral Component Interface. Designed to replace


earlierbusdesignsusedinPCs,suchasISA.

PCIX PCI eXtended. Designed to correct the shortcomings of


PCIandenablehigherspeeds.

PME PowerManagementEvent;messagefromadeviceindi
catingthatpowerrelatedserviceisneeded.

PoisonedTLP Packetwhosedatapayloadwasknowntobebadwhen
itwascreated.Sendingthepacketwithbaddatacanbe
helpful asanaidtodiagnosingtheproblemanddeter
miningasolutionforit.

PolarityInversion The receivers signal polarity is permitted to be con


nected backwards to support cases when doing so
wouldsimplifyboardlayout.Thereceiverisrequiredto
detect this condition and internally invert the signal to
correctitduringLinkTraining.

Port Input/outputinterfacetoaPCIeLink.

PostedRequest ARequestpacketforwhichnocompletionisexpected.
There are only two such requests defined by the spec:
MemoryWritesandMessages.

981
PCIe 3.0.book Page 982 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Term Definition
PrefetchableMemory Memorythathasnosideeffectsasaresultofbeingread.
Thatpropertymakesitsafetoprefetchsince,ifitsdis
cardedbytheintermediatebuffer,itcanalwaysberead
againlaterifneeded.Thiswasanimportantdistinction
forPCIbridges,whichhadtoguessaboutthedatasize
onreads.Prefetchablespaceallowedspeculativelyread
ing more data and gave a chance for better efficiency.
The distinction is much less interesting for PCIe, since
theexactbytecountforatransferisincludedintheTLP,
butmaintainingitallowsbackwardcompatibility.

PTLP PendingTLPFlowControlcreditsneededtosendthe
currentTLP.

QoS Quality of Service; the ability of the PCIe topology to


assign different priorities for different packets. This
couldjustmeangivingpreferencetopacketsatarbitra
tionpoints,butinmorecomplexsystems,itallowsmak
ingbandwidthandlatencyguaranteesforpackets.

RequesterID TheconfigurationaddressoftheRequesterforatransac
tion,meaningtheBDF(Bus,Device,andFunctionnum
ber) that corresponds to it. This will be used by the
Completer as the return address for the resulting com
pletionpacket.

RootComplex The components that act as the interface between the


CPUcoresinthesystemandthePCIetopology.Thiscan
consistofoneormorechipsandmaybesimpleorcom
plex.FromthePCIeperspective,itservesastherootof
the inverted tree structure that backwardcompatibility
withPCIdemands.

RunLength The number of consecutive ones or zeros in a row. For


8b/10b encoding the run length is limited to 5 bits. For
128b/130b, there is no defined limit, but the modified
scramblingschemeitusesisintendedtocompensatefor
that.

982
PCIe 3.0.book Page 983 Sunday, September 2, 2012 11:25 AM

Glossary

Term Definition
Scrambling The process of randomizing the output bit stream to
avoid repeated patterns on the Link and thus reduce
EMI.ScramblingcanbeturnedoffforGen1andGen2to
allow specifying patterns on the Link, but it cannot be
turned off for Gen3 because it does other work at that
speed and the Link is not expected to be able to work
reliablywithoutit.

SOS SkipOrderedSetusedtocompensatefortheslightfre
quencydifferencebetweenTxandRx.

SSC SpreadSpectrumClocking.Thisisamethodofreducing
EMIinasystembyallowingtheclockfrequencytovary
backandforthacrossanallowedrange.Thisspreadsthe
emittedenergyacrossawiderrangeoffrequenciesand
thusavoidstheproblemofhavingtoomuchEMIenergy
concentratedinoneparticularfrequency.

StickyBits Statusbitswhosevaluesurvivesareset.Thischaracteris
tic is useful for maintaining status information when
errorsaredetectedbyaFunctiondownstreamofaLink
that is no longer operating correctly. The failed Link
must be reset to gain access to the downstream Func
tions, and the error status information in its registers
mustsurvivethatresettobeavailabletosoftware.

Switch A device containing multiple Downstream Ports and


oneUpstreamPortthatisabletoroutetrafficbetweenits
Ports.

Symbol EncodedunitsentacrosstheLink.For8b/10btheseare
the 10bit values that result from encoding, while for
128b/130btheyre8bitvalues.

SymbolLock Finding the Symbol boundaries at the Receiver when


using8b/10bencodingsoastorecognizeincomingSym
bolsandthusthecontentsofpackets.

Symboltime The time it takes to send one symbol across the Link
4nsforGen1,2nsforGen2,and1nsforGen3.

983
PCIe 3.0.book Page 984 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

Term Definition
TLP TransactionLayerPacket.ThesearecreatedintheTrans
actionLayerandpassedthroughtheotherlayers.

Token Identifierofthetypeofinformationbeingdelivereddur
ingaDataStreamwhenoperatingatGen3speed.

TPH TLPProcessingHints;thesehelpsystemroutingagents
makechoicestoimprovelatencyandtrafficcongestion.

UI UnitInterval;thetimeittakestosendonebitacrossthe
Link0.4nsforGen1,0.2nsforGen2,0.125nsforGen3

UncorrectableErrors Errorsthatcantbecorrectedbyhardwareandthuswill
ordinarily require software attention to resolve. These
aredividedintoFatalerrorsthosethatrenderfurther
Link operation unreliable, and Nonfatal errors those
thatdonotaffecttheLinkoperationinspiteoftheprob
lemthatwasdetected.

USP Upstream Port, meaning a Port that faces upstream, as


foranEndpointoraSwitchUpstreamPort.Thisdistinc
tionismeaningfulintheLTSSMbecausethePortshave
assignedrolesduringConfigurationandRecovery.

984
PCIe 3.0.book Page 985 Sunday, September 2, 2012 11:25 AM

Glossary

Term Definition
Variables Anumberofflagsareusedtocommunicateeventsand
status between hardware layers. These are specific to
statetransitionsinthehardwarearenotusuallyvisible
tosoftware.Someexamples:
LinkUpIndicationfromthePhysicalLayertothe
Data Link Layer that training has completed and
thePhysicalLayerisnowoperational.
idle_to_rlock_transitioned This counter tracks
the number of times the LTSSM has transitioned
from Configuration.Idle to the Recovery.RcvrLock
state. Any time the process of recognizing TS2s to
leaveConfigurationdoesntwork,theLTSSMtran
sitions to Recovery to take appropriate steps. If it
stilldoesntworkafter256passesthroughRecovery
(counterreachesFFh),thenitgoesbacktoDetectto
startover.ItmaybethatsomeLanesarenotwork
ing.

WAKE# Sideband pin used to signal to the system that the


powershouldberestored.ItsusedinsteadoftheBeacon
in systems where power conservation is an important
consideration.

985
PCIe 3.0.book Page 986 Sunday, September 2, 2012 11:25 AM

PCIExpressTechnology

986
PCIe 3.0.book Page 985 Sunday, September 2, 2012 11:25 AM

Numerics AtomicOp 150


AtomicOps 897, 974
128b/130b 43 Attention Button 854, 862
128b/130b Encoding 973 Attention Indicator 854, 859
1x Packet Format 374, 375 Aux_Current field 726
3DW Header 152
3-Tap Transmitter Equalization 585
4DW Headers 152 B
4x Packet Format 374 Bandwidth 42
8.0 GT/s 410 Bandwidth Congestion 281
8b/10b 42 Bandwidth Management 974
8b/10b Decoder 367 BAR 126, 960, 974
8b/10b Encoder 366 Base Address Registers 126
8b/10b Encoding 973 Base and Limit Registers 136
BDF 85
A Beacon 483, 772, 974
BER 974
AC Coupling 468 BIOS 712, 853
ACK 318 Bit Lock 78, 395, 507, 742, 974
Ack 311 Bit Tracer 929
ACK DLLP 75, 312 Block 974
ACK/NAK DLLP 312 Block Alignment 435
ACK/NAK Latency 328 Block Encoding 410
ACK/NAK Protocol 318, 320, 329, 973 Block Lock 507, 975
Ack/Nak Protocol 74 Boost 476
ACKD_SEQ Count 323 Bridge 975
ACKNAK_Latency_Timer 328, 343 Bus 85
ACPI 711, 973 Bus Master 20
ACPI Driver 706 Bus Number register 93
ACPI Machine Language 712 Byte Count Modified 201
ACPI Source Language 712 Byte Enables 181
ACPI spec 705 Byte Striping 371, 372, 373, 975
ACPI tables 712 byte striping 371
ACS 973 Byte Striping logic 365
Active State Power Management 405, 735 Byte Un-Striping 402
Address Routing 158
Address Space 121
Address Translation 958, 959 C
Advanced Correctable Error Reporting 690 Capabilities List bit 818
Advanced Correctable Error Status 689 Capabilities Pointer register 713
Advanced Correctable Errors 688 Capability ID 713, 814
Advanced Error Reporting 685 Capability Structures 88
Advanced Source ID Register 697 Card Connector Power Switching Logic 854
Advanced Uncorrectable Error Handling 691 Card Insertion 855
Advanced Uncorrectable Error Status 691 Card Insertion Procedure 857
Aggregate Bandwidth 408 Card Present 854
Alternative Routing-ID Interpretation 909 Card Removal 855
AML 712 Card Removal Procedure 856
AML token interpreter 712 Card Reset Logic 854
Arbitration 20, 270 CC 975
Arbor 117 CDR 435, 437, 975
Architecture Overview 39 Character 79, 366, 975
ARI 909, 974 CL 976
ASL 712 Class driver 706
ASPM 735, 742, 910, 974 Clock Requirements 452
ASPM Exit Latency 756, 757 Code Violation 400
Assert_INTx messages 806 Coefficients 584
Async Notice of Slot Status Change 876 Cold Reset 834
PCIe 3.0.book Page 986 Sunday, September 2, 2012 11:25 AM

COM 386 D
Common-Mode Noise Rejection 452
Completer 33 D0 709, 710, 714, 734
Completer Abort 664 D0 Active 714
Completion Packet 197 D0 Uninitialized 714
Completion Status 200 D1 709, 710, 716, 734
Completion Time-out 665 D1_Support bit 725
Completion TLP 184 D2 709, 710, 717, 734
Completions 196, 218 D2_Support bit 725
Compliance Pattern 537 D3 709, 710, 719
Compliance Pattern - 8b/10b 529 D3cold 721, 734
Configuration 85 D3hot 719, 734
Configuration Address Port 92, 93 Data Characters 976
Configuration Address Space 88 Data Link Layer 55, 72
Configuration Cycle Generation 26 Data Link Layer Packet 72
Configuration Data Port 92, 93 Data Link Layer Packet Format 310
Configuration Headers 50 Data Link Layer Packets 73
Configuration Read 151 Data Poisoning 660
Configuration Read Access 104 Data Register 731
Configuration Register Space 27, 89 Data Stream 977
Configuration Registers 90 Data_Scale field 729
Configuration Request Packet 193 Data_Select field 729
Configuration Requests 99, 192 DC Common Mode 462
Configuration Space 122 DC Common Mode Voltage 466
Configuration State 520, 540 DC Common-Mode Voltage 467
Configuration Status Register 676 Deadlock Avoidance 303
Configuration Status register 713 Deassert_INTx messages 806
Configuration Transactions 91 Debugging PCIe Traffic 917
Configuration Write 151 Decision Feedback Equalization 495
Configuration.Complete 562 De-emphasis 450, 468, 469, 471,
Configuration.Idle 566 476, 977
Configuration.Lanenum.Accept 560 De-Scrambler 367
Configuration.Lanenum.Wait 559 Deserializer 395
Configuration.Linkwidth.Accept 558 De-Skew 399
Configuration.Linkwidth.Start 553 Detect State 519, 522
Congestion Avoidance 897 Detect.Active 524
Continuous-Time Linear Equalization 493 Detect.Quiet 523
Control Character 976 Device 85
Control Character Encoding 386 Device Capabilities 2 Register 899
Control Method 712 Device Capabilities Register 873
Conventional Reset 834 Device Context 709
Correctable Errors 651, 976 Device Core 59
CR 976 Device core 55
CRC 976 Device Driver 706
CRD 383 device driver 853
Credit Allocated Count 229 Device Layers 54
Credit Limit counter 228 Device PM States 713
CREDIT_ALLOCATED 229 device PM states 709
Credits Consumed counter 228 Device Status Register 681
Credits Received Counter 229 Device-Specific Initialization (DSI) bit 727
CREDITS_RECEIVED 229 DFE 493, 495, 497
CTLE 493, 494 Differential Driver 389
Current Running Disparity 383 Differential Receiver 393, 435, 451
Cursor Coefficient 584 Differential Signaling 463
Cut-Through 354 Differential Signals 44
Cut-Through Mode 976 Differential Transmitter 451
Digest 180, 977
Direct Address Translation 949
PCIe 3.0.book Page 987 Sunday, September 2, 2012 11:25 AM

Disable State 521, 613 ESD 459


Discrete Time Linear Equalizer 493 ESD standards 448
Discrete-Time Linear Equalizer 494 Exerciser Card 931
Disparity 383 Extended Configuration Space 89
Disparity Error Detection 400 Eye Diagram 486
DLCMSM 977
DLE 493, 494 F
DLL 437
DLLP 73, 170, 238, 308, 311, 977 Failover 942, 944, 952
DLLP Elements 307 FC Initialization 223
DMA 937 FC Initialization Sequence 223
DPA 910, 977 FC_Init1 224
Driver Characteristics 489 FC_Init2 225
DSI bit 727 FC_Update 238
DSP 977 First DW Byte Enables 178, 181
D-State Transitions 722 Flow Control 72, 76, 215, 217, 299,
Dual Simplex 363 928, 978
Dual-Simplex 40 Flow Control Buffer 217, 229
Dual-Star Fabric 957 Flow Control Buffers 217
Dynamic Bandwidth Changes 618 Flow Control Credits 216, 219
Dynamic Link Speed Changes 619 Flow Control Elements 227, 231
Dynamic Link Width Changes 629 Flow Control Initialization 227, 230, 237
Dynamic Power Allocation 910 Flow Control Packet 239
Flow Control Packet Format 314
E Flow Control Update Frequency 239
Flow Control Updates 237
ECRC 63, 180, 978 FLR 842, 844, 845, 978
ECRC Generation and Checking 657 Flying Lead Probe 924
EDB 373, 387 Format Field 179
Egress Port 978 Framing Symbols 171, 979
EIE 387 FTS 387
EIEOS 389, 739, 740 FTS Ordered Set 388
EIOS 388, 737 FTSOS 388
Elastic Buffer 366, 435, 978 Function 85
Electrical Idle 388, 736, 738, 741 Function Level Reset 842, 843
Electrical Idle Exit Ordered Set 389 Function PM State Transitions 722
Electrical Idle Ordered Set 388 Function State Transition Delays 724
EMI 77, 978 Fundamental Reset 834
Encoding 410
END 373, 387
Endpoint 978 G
End-to-End CRC 180 Gen1 43, 77, 979
Enhanced Configuration Access Gen2 43, 77, 979
Mechanism 96 Gen3 44, 77, 407, 979
Enumeration 51, 104, 978 Gen3 products 936
Equalization 474, 978
Equalization - Phase 0 578 H
Equalization - Phase 1 581
Equalization - Phase 2 583 handler 712
Equalization - Phase 3 586 Hardware Based Fixed Arbitration 256
Equalization Control 513 Hardware Fixed VC Arbitration 257
Equalization Control Registers 579 Hardware-Fixed Port Arbitration 265
Equalizer 475 Header Type 0 29
Equalizer Coefficients 479 Header Type 1 28
Error Classifications 651 Header Type/Format Field 178
Error Handling 282, 699 High Speed Signaling 451
Error Isolation 937 host/PCI bridge 94
Error Messages 209, 668 Hot Plug 847, 852
PCIe 3.0.book Page 988 Sunday, September 2, 2012 11:25 AM

Hot Plug Controller 863 J


Hot Plug Elements 852
Hot Plug Messages 211 Jitter 485, 487
Hot Reset 839
Hot Reset State 521, 612 L
Hot-Plug 116, 853 L0 State 500, 520, 568
Hot-Plug Controller 853, 864 L0s 744
hot-plug primitives 874 L0s Receiver State Machine 605
Hot-Plug Service 852 L0s State 520, 603, 744
Hot-Plug System Driver 852 L0s Transmitter State Machine 603
HPC Applications 940 L1 ASPM 736, 747
Hub Link 32 L1 ASPM Negotiation 748
L1 ASPM State 747
I L1 State 520, 607, 760
ID Based Ordering 301 L2 State 521, 609, 767
ID Routing 155 L2/L3 Ready 767
ID-based Ordering 301, 909, 979 L2/L3 Ready state 763, 764
IDL 387 Lane 40, 78, 365, 979
IDO 301, 302, 909, 979 Lane # 511
IEEE 1394 Bus Driver 711 Lane Number Negotiation 543, 547
Ignored Messages 211 Lane Reversal 507
Implicit Routing 148, 979 Lane-Level Encoding 410
In-band Reset 837 Lane-to-Lane de-skew 78
Infinite Credits 221 Lane-to-Lane Skew 979
Infinite Flow Control Credits 219 Last DW Byte Enables 178, 181
Ingress Port 979 Latency Tolerance Reporting 910
InitFC1-Cpl 312 LCRC 63, 325, 329
InitFC1-NP 311 LeCroy 922, 923, 933
InitFC1-P DLLP 311 LeCroy Tools 917
InitFC2-Cpl 312 Legacy Endpoint 816, 979
InitFC2-NP 312 Legacy Endpoints 972
InitFC2-P 312 LFSR 980
Intelligent Adapters 943, 944, 951 Link 40, 980
Internal Error Reporting 911 Link # 511
Interrupt Disable 803 Link Capabilities 2 Register 640
Interrupt Latency 829 Link Capability Register 743
interrupt latency 829 Link Configuration - Failed Lane 549
Interrupt Line Register 802 Link Control 841
Interrupt Pin Register 801 Link Data Rate 509
Interrupt Status 804 Link data rate 78
Inter-symbol Interference 469 Link Equalization 577
INTx Interrupt Messages 206 Link Errors 683
INTx Interrupt Signaling 206 Link Flow Control-Related Errors 666
INTx Message Format 807 Link Number Negotiation 542, 546
INTx# Pins 800 Link Power Management 733
INTx# Signaling 803 Link Status Register 641
IO 126 Link Training and Initialization 78
IO Address Spaces 122 Link Training and Status State
IO Range 141 Machine (LTSSM) 518
IO Read 151 Link Upconfigure Capability 512
IO Requests 184 Link Width 507
IO Virtualization 937 Link width 78
IO Write 151 Link Width Change 570
ISI 979 Link Width Change Example 630
Isochronous Packets 279 Lock 964
Isochronous Support 272 Locked Reads 66
Isochronous Transaction Support 272 Locked Transaction 209
PCIe 3.0.book Page 989 Sunday, September 2, 2012 11:25 AM

Locked Transactions 963 N


Logic Analyzer 921
Logical Idle Sequence 370 N_FTS 511
Loopback Master 615 Nak 311
Loopback Slave 616 NAK_SCHEDULED Flag 327
Loopback State 521, 613 namespace 712
Loopback.Active 617 Native PCI Express Endpoints 972
Loopback.Entry 614 NEXT_RCV_SEQ 313, 326, 341
Loopback.Exit 618 Noise 485
Low-priority VC Arbitration 255 Non-Posted 150
LTR 784, 910, 980 non-posted 60
LTR Messages 786 Non-posted Request 980
LTR Registers 784 Non-Posted Transactions 65, 218
LTSSM 507, 518, 839, 927, 980 Non-prefetchable 123
Non-prefetchable Memory 980
Non-Prefetchable Range 139
M North Bridge 21
Malformed TLP 666 NP-MMIO 126, 139
Memory Address Space 122 NT bridging 936
Memory Read 150 Nullified Packet 388, 689, 980
Memory Read Lock 150
Memory Request Packet 188 O
Memory Requests 188
Memory Write 150 OBFF 776, 910, 981
Memory Writes 69 OBFF Messages 213
Message 151 OnNow Design Initiative 707
Message Address Register 816 Optimized Buffer Flush and Fill 776,
Message Address register 816, 818 910, 981
Message Control Register 814 Optimized Buffer Flush and Fill Messages 213
Message Control register 814, 818 Ordered Sets 981
Message Data register 817, 818 Ordered-Sets 370
Message Request Packet 203 Ordering Rules 287
Message Requests 70, 203 Ordering Rules Table 288, 289
Message Writes 70 Ordering Table 914
Messages 148 Oscilloscope 919
Mid-Bus Probe 923
MindShare Arbor 117 P
Miniport Driver 706
MMIO 123 Packet Format 151
Modified Compliance Pattern 537 Packet Generation 937
Modified Compliance Pattern - 8b/10b 532 Packet-Based Protocol 169
MR-IOV 937, 939 Packet-based Protocol 46
MSI Capability Register 812 PAD 386
MSI Configuration 817 Pause command 853, 874
Multicast 893, 896 Pausing a Driver 874
Multicast Capabilities 163 PCI 981
Multicast Capability Registers 889 PCI Bus Driver 706, 707, 711
Multi-casting 888 PCI Bus PM Interface Specification 705
Multi-Function Arbitration 272 PCI Express 39
Multi-Host System 96 PCI PM 705
Multi-Host Systems 943 PCI power management 647, 703, 793
Multiple Message Capable field 818 PCI Transaction Model 18
Multiple Messages 820 PCI-Based System 11
Multi-Root 938 PCI-Compatible Error Reporting 674
Multi-Root Enumeration 114 PCIe System 53, 54
Multi-Root System 97, 116 PCI-X 981
PERST# 835, 849
PETracer 918, 924
PCIe 3.0.book Page 990 Sunday, September 2, 2012 11:25 AM

PETrainer 932 Pre-Cursor Coefficient 584


Physical Layer 55, 76 Prefetchable 123
Physical Layer Electrical 449 Prefetchable Memory 982
PLL 435, 437 Prefetchable Range 137
PLX Technology 935, 943 Presets 478
PM Capabilities (PMC) Register 724 Pre-shoot 476
PM Capability Registers 713 Pre-Silicon 931
PM Control/Status (PMCSR) Register 727 Pre-silicon Debugging 918
PM Registers 724, 732 Primitives 852
PM_Active_State_Request_L1 311 Primitives, hot-plug 852, 874
PM_Enter_L1 DLLP 311 Producer/Consumer Model 290
PM_Enter_L23 311 Producer/Consumer model 290
PM_Request_Ack 311 Protocol Analyzer 920
PMC Register 724 PTC card 932
PMCSR 727, 728 PTLP 982
PMCSR Register 727
PME 981 Q
PME Clock bit 727
PME Context 710 QoS 70, 245, 272, 982
PME Generation 768 Quality of Service 70, 245
PME Message 769 Query Hot-Plug System Driver 875
PME_En bit 730 Query Slot Status 875
PME_Status bit 728 quiesce 873
PME_Support field 725 Quiesce command 853
P-MMIO 126, 137 Quiescing Card 873
Poisoned TLP 981 Quiescing Card and Driver 873
Polarity Inversion 78, 508, 981 Quiescing Driver 873
Polling State 519, 525
Polling.Active 526 R
Polling.Compliance 529 Rate ID 512
Polling.Configuration 527 Ratios 478
Port 981 Receive Buffer 403
Port Arbitration 261, 265 Receive Logic 366, 392
Port Arbitration Table 267 Receiver Characteristics 492, 497
Port Arbitration Tables 263 Recovery Process 572
Post-Cursor Coefficient 584 Recovery State 520, 571
Posted 150 Recovery State - Entry 572
posted 60 Recovery.Equalization 587
Posted Request 981 Recovery.RcvrCfg 574, 575, 576, 598
Posted Transactions 218 Recovery.RcvrLock 573, 576
Posted Writes 69 Recovery.Speed 575, 595
Post-Silicon 931 Refclk 455
Post-Silicon Debug 919 Relaxed Ordering 286, 296, 299
Power Budget Capabilities Register 883 Replay Mechanism 74
Power Budget Capability Registers 884 Replay Timer 690
Power Budget Registers 878 Request TLP 184
Power Budgeting 847, 876 Request Types 59
Power Indicator 854, 860 Requester 33
Power Management 76, 703, 711 Requester ID 982
power management 647, 703, 793 Reset 846
Power Management DLLP 313 Resizable BARs 135, 911
Power Management DLLP Packet 313 Resume command 853
Power Management Message 208 Retention Latch 861
Power Management Messages 208 Retention Latch Sensor 861
Power Management Policy Owner 711 Retry 21
power management register set 713, 724 RO 297
Power Management States 500 Root Complex 91, 109, 147, 163, 668,
PowerState field 730
PCIe 3.0.book Page 991 Sunday, September 2, 2012 11:25 AM

696, 812, 972, 982 SR-IOV 937


Root Complex Error Status 696 SSC 983
Root Error Command Register 698 SSC Modulation 455
Routing Elements 147 SSD Modules 940
Routing Mechanisms 155 Start command 853
RST# 854 Sticky Bits 688, 983
RTL Simulation 918 STP 373, 387
Run Length 982 Strict Priority VC Arbitration 253
Rx Buffer 403 Strong Ordering 286
Rx Clock 435 Subordinate Bus Number register 93
Rx Clock Recovery 394, 437 Surprise Removal 849
Rx Equalization 493 Surprise Removal Notification 849
Rx Preset Hint Encodings 580 Switch 269, 278, 938, 971, 983
Switch Arbitration 269
S Switch Port 57
Switch Routing 161
Scrambler 366, 377 Switches 941
Scrambler implementation 379 Symbol 366, 983
Scrambling 983 Symbol Lock 78, 396, 507, 983
SDP 373, 387 Symbol time 983
Secondary Bus Reset 840 Symbols 381
Sequence Number 326 Sync Header 364
Serial Transport 41 System PM States 708
Serializer 389 System Reset 833
Service Interval 279
Set Slot Power Limit Message 210
Set Slot Status 875 T
Severity of Error 693 Target 21, 22
Short Circuit Requirements 459 TBWRR 266, 279
SHPC 1.0 848 TC 247, 285, 287
SI 278 TC to VC Mapping 249
Signal Attenuation 485 TC/VC Mapping 248, 252
Simplified Ordering Rule 287 Time-Based, Weighted Round Robin
Simplified Ordering Table 914 Arbitration 266
Single Host System 94 TLP 60, 61, 170, 172, 984
Single-Root System 113 TLP Elements 169
SKIP 386, 387 TLP Header 154
SKIP ordered set 392 TLP Header Format 175
Skip Ordered Set 983 TLP Prefixes 908
SKP 386 TLP Processing Hints 899, 984
SKP Ordered Set 389 TLP Routing 145, 147
Slot Capabilities 865 TLP Structure 174
Slot Capabilities Registers 865 Token 984
Slot Control 868 token 712
Slot Control Register 869 TPH 899, 900, 984
Slot Numbering 862 TPH Capability 907
Slot Numbering Identification 862 TPH Control 907
Slot Power Limit Control 867, 881 Trace Viewer 924
Slot Power Limit Message 210 Traffic Class 71, 174, 176, 183, 247, 248
Slot Status 870 Training Control 512
Soft Off 708 Training Examples 542
SOS 389, 983 Training Sequence 1 369
South Bridge 11 Transaction Attributes 183
Spec Revision 2.1 887 Transaction Descriptor 182
Speed Change 568 Transaction ID 183
Speed Change Example 576, 622 Transaction Layer 55, 59
Speed Changes - Software 627 Transaction Layer Packet 60, 172
Split Transaction Protocol 149 Transaction Ordering 71, 285
PCIe 3.0.book Page 992 Sunday, September 2, 2012 11:25 AM

Transaction Routing 121 WDM Device Driver 706


Transaction Stalls 300 Weak Ordering 286, 299
Transactions 150 Weighted Round Robin Arbitration 256
Transactions Pending Buffer 228 Weighted Round Robin Port Arbitration 265
Translating Slot IDs 873 Weighted Round Robin VC Arbitration 257
Transmission Loss 468 Working state 708
Transmit Logic 364, 368 Write Transaction 68
TS1 388 WRR 256
TS1 and TS2 Ordered Sets 510
TS1 Ordered-Set 842
Turning Slot Off 855
Turning Slot On 855
Tx Buffer 368, 435
Tx Clock 390
Tx Equalization 448
Tx Equalization Tolerance 448
Tx Preset Encodings 579
Tx Signal Skew 390
Type 0 Configuration Request 99
Type 1 Configuration Request 100
Type 1 configuration transaction 93
Type Field 179

U
UI 984
Uncorrectable Error Reporting 694
Uncorrectable Errors 984
Uncorrectable Fatal Errors 652
Uncorrectable Non-Fatal Errors 652
Unexpected Completion 664
Unit Interval 984
Unlock Message 209
Unsupported Request 663
UpdateFC-Cpl 312
UpdateFC-NP 312
UpdateFC-P 312
USB Bus Driver 711
USP 984

V
Variables 985
VC 216, 247, 287
VC Arbitration 252, 257
VC Buffers 301
Vendor Specific 311
Vendor Specific DLLP 311
Vendor-Defined Message 210
Virtual Channel 218, 258, 301
Virtual Channel Arbitration Table 258
Virtual Channel Capability Registers 246
Virtual Channels 247

W
WAKE# Signal 772
WAKE# signal 773
Warm Reset 834
World Leader in PCI Express
P t
Protocol
lTTestt and
dVVerification
ifi ti
LeCroy leads the protocol test and verication market with the most advanced and widest
range of protocol test tools available on the market today. LeCroys dedication to PCI Express
development and test is demonstrated by our history of being rst-to-market with new test
capabilities to help you to be rst-to-market with new PCI Express products. Among our
accomplishments are:

First PCIe 1.0 Protocol Analyzer First PCIe 3.0 Host Emulator
First PCIe 2.0 Protocol Analyzer First PCIe 3.0 Active Interposer
First PCIe 2.0 Exerciser First PCIe 3.0 MidBus Probe
First PCIe 2.0 Protocol Test Card First PCIe 3.0 ExpressModule
First PCIe 3.0 Protocol Analyzer Interposer
First PCIe 3.0 Device Emulator First to support NVM Express

LeCroy provides you the widest range of test tools and specialty probes to simplify and
accelerate test and debug of all PCI Express products, providing tools with capabilities and
price points to meet any customers test requirements and budget.

Summit T3-16 Summit T3-8 Summit T2-16 Summit T28 Edge T1-4
Protocol Analyzer Protocol Analyzer Protocol Analyzer Protocol Analyzer Protocol Analyzer

Summit Z3-16 Summit Z3-16 Gen2 Protocol SimPASS PE Gen3 x16 Active
Device
D i E Emulator
l t Hostt E
H Emulator
l t T tC
Test Card
d Simulation Analysis I t
Interposer

For many additional PCIe


products and specialty
probes, contact your local
LeCroy representative or
visit our website

MidBus Probe Multi-lead Probe AMC Interposer MiniCard Interposer

www lecroy com


www.lecroy.com
For more information on LeCroy protocol verification solutions, please contact your
Regional Sales Engineer: 1-800-909-7211 or 408-653-1262; or PSGsales@lecroy.com
Book Ad.fm Page 0 Wednesday, August 29, 2012 5:37 PM

MindShare Live Training and Self-Paced Training


Intel Architecture Virtualization Technology
IntelIvyBridgeProcessor PCVirtualization
Intel64(x86)Architecture IOVirtualization
IntelQuickPathInterconnect(QPI)
ComputerArchitecture

AMD Architecture IO Buses


MDOpteronProcessor(Bulldozer) PCIExpress3.0
MD64Architecture USB3.0/2.0
xHCIforUSB

Firmware Technology Storage Technology


UEFIArchitecture SASArchitecture
BIOSEssentials SerialATAArchitecture
NVMeArchitecture

ARM Architecture Memory Technology


ARMArchitecture odernDRAMArchitecture

Graphics Architecture High Speed Design


GraphicsHardwareArchitecture HighSpeedDesign
EMI/EMC

Programming Surface-Mount Technology (SMT)


X86ArchitectureProgramming SMTManufacturing
X86AssemblyLanguageBasics SMTTesting
OpenCLProgramming

Areyourcompanystechnicaltrainingneedsbeingaddressedinthemosteffectivemanner?

MindSharehasover25yearsexperienceinconductingtechnicaltrainingoncuttingedgetechnologies.
Weunderstandthechallengescompanieshavewhensearchingforquality,effectivetrainingwhich
reducesthestudentstimeawayfromworkandprovidescosteffectivealternatives.MindShareoffers
manyflexiblesolutionstomeetthoseneeds.Ourcoursesaretaughtbyhighlyskilled,enthusiastic,
knowledgeableandexperiencedinstructors.Webringlifetoknowledgethroughawidevarietyoflearn
ingmethodsanddeliveryoptions.
MindShareoffersnumerouscoursesinaselfpacedtrainingformat(eLearning).Wevetakenour25+
yearsofexperienceinthetechnicaltrainingindustryandmadethatknowledgeavailabletoyouatthe
clickofamouse.

training@mindshare.com 18006331440 www.mindshare.com


ArborAdEnd.fm Page 1 Wednesday, August 29, 2012 8:52 PM

The Ultimate Tool to View,


Edit and Verify Configuration
Settings of a Computer
BY

MindShare Arbor is a computer system debug, validation, analysis and learning tool
that allows the user to read and write any memory, IO or configuration space address.
The data from these address spaces can be viewed in a clean and informative style as
well as checked for configuration errors and non-optimal settings.

View Reference Info


MindShare Arbor is an excellent reference tool to quickly look at standard PCI, PCI-X and PCIe
structures. All the register and field definitions are up-to-date with the PCI Express 3.0.
x86, ACPI and USB reference info will be coming soon as well.

Decoding Standard and Custom Structures from a Live System


MindShare Arbor can perform a scan of the system it is running on to record the config space from
all PCI-visible functions and show it in a clean and intuitive decoded format. In addition to scanning
PCI config space, MindShare Arbor can also be directed to read any memory address space and IO
address space and display the collected data in the same decoded fashion.

Run Rule Checks of Standard and Custom Structures


In addition to capturing and displaying headers and capability structures from PCI config space, Arbor
can also check the settings of each field for errors (e.g. violates the spec) and non-optimal values
(e.g. a PCIe link trained to something less than its max capability). MindShare Arbor has scores of
these checks built in and can be run on any system scan (live or saved). Any errors or warnings are
flagged and displayed for easy evaluation and debugging.
MindShare Arbor allows users to create their own rule checks to be applied to system scans. These
rule checks can be for any structure, or set of structures, in PCI config space, memory space or IO space.
The rule checks are written in JavaScript. (Python support coming soon.)

Write Capability
MindShare Arbor provides a very simple interface to directly edit a register in PCI config space, memory
address space or IO address space. This can be done in the decoded view so you see what the
meaning of each bit, or by simply writing a hex value to the target location.

Saving System Scans (XML)


After a system scan has been performed, MindShare Arbor allows saving of that system's scanned
data (PCI config space, memory space and IO space) all in a single file to be looked at later or sent to
a colleague. The scanned data in these Arbor system scan files (.ARBSYS files) are XML-based and
can be looked at with any text editor or web browser. Even scans performed with other tools can be
easily converted to the Arbor XML format and evaluated with MindShare Arbor.
ARBOR BY

The Ultimate Tool to View, Edit and Verify


Configuration Settings of a Computer

Decode Data from


Live Systems Feature List
Scan config space for all PCI-visible
functions in system
Run standard and custom rule checks
to find errors and non-optimal settings
Write to any config space location,
memory address or IO address
Apply Standard and
Custom Rule Checks View standard and non-standard
structures in a decoded format
Import raw scan data from other
tools (e.g. lspci) to view in Arbors
decoded format
Decode info included for standard
PCI, PCI-X and PCI Express structures
Decode info included for some
x86-based structures and device-
Directly Edit Config, specific registers
Memory and IO Space
Create decode files for structures in
config space, memory address space
and IO space
Save system scans for viewing later
or on other systems
All decode files and saved system
Everything Driven from scans are XML-based and open-format
Open Format XML
COMING SOON
Decoded view of x86 structures
(MSRs, ACPI, Paging, Virtualization, etc.)
mindshare.com | 800.633.1440 | training @mindshare.com

You might also like