Professional Documents
Culture Documents
Unified Scalable Graphic Processing Unit (GPU) optimized for Graphics and Compute
Multiple Engine Architecture with Multi-Task Capabilities
Compute Unit Architecture
Multi-Level R/W Cache Architecture
GFX
Command Processor
RB RB
RB RB
RB RB
multiple data
Export/GDS Decode
Instruction Arbitration
Vector Memory Decode
label0:
label0:
Optional: s_andn2_b64 exec,s0,exec //Do else(s0 & !exec)
Use based on the number of s_cbranch_execz label1 //Branch if all lanes fail
instruction in conditional section. v_sub_f32 r2,r1,r0 //result = b a
Executed in branch unit v_mul_f32 r2,r2,r1 //result = result * b
label1:
label1:
s_mov_b64 exec,s0 //Restore exec mask
Msg
Bus
SIMD Branch &
PC & IB MSG Unit
Instruction Fetch Arbitration
Export/GDS Decode
Instruction Arbitration
SIMD Vector Memory Decode
PC & IB
Scalar Scalar Unit SIMD0 SIMD1 SIMD2 SIMD3 Export
Decode 8 KB Registers 64 KB 64 KB 64 KB 64 KB
Bus
Registers Registers Registers Registers R/W
SIMD Integer ALU data R/W L2
PC & IB MP MP MP MP L1
Vector Vector Vector Vector Vector
Decode ALU ALU ALU ALU 16KB
SIMD
PC & IB
LDS 64 KB LDS Memory
Decode
SIMD
I-Cache 32 KB 64B lines, 4 Way Assoc, Least Recently Used (LRU)
PC & IB Peak Bandwidth per CU is 32 bytes/cycle (Optimally 8 instructions)
Special Instructions consumed in 1 cycle in the Instruction Buffers
SIMD
PC & IB s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio, s_setvskip, s_halt
16 Work Group Barrier resources supported per CU
A Kernel freely mixes instruction types (Simplistic Programming Model, no weird rules)
Scalar/Scalar Memory, Vector, Vector Memory, Shared Memory, etc.
Use of predication & control flow enables any single work-item a unique execution path
Every clock cycle, waves on one SIMDs are considered for instruction issue.
At most, one instruction from each category may be issued.
At most one instruction per wave may be issued.
Up to a maximum of 5 instructions can issue per cycle, not including internal instructions.
Integer ALU
2 Operand
Immediate Constant
1 Operand
Compare Ops
Program
Flow Control
2 KB Registers 64 KB
Registers
Vector ALU
Decode
64 KB LDS 16 SP MP
Memory Vector ALU
3 Operand
Encoding for Local Data Share (LDS) indexed ops, and Global Data Share (GDS) ops
Fully decoupled from ALU instructions
Instructions
Load, Store, Atomic (including fpmin, fpmax, fpcmpswp )
2 source operands Ops from shared memory for accelerated reduction ops
Special lane swizzle ops without memory usage
GDS Global Wave Sync, Ordered Count Ops executed by export unit
Export/GDS Decode
Memory Image
Command Processors
64-128KB R/W L2 per
Compute Unit 16KB Vector DL1 MC Channel
X
16 KB Scalar DL1 B
32 KB Instruction L1 A 64-128KB R/W L2 per
R MC Channel
Msg
Bus
SIMD Branch &
PC & IB MSG Unit
Instruction Fetch Arbitration
Export/GDS Decode
Instruction Arbitration
SIMD Vector Memory Decode
PC & IB
Scalar Scalar Unit SIMD0 SIMD1 SIMD2 SIMD3 Export
Decode 8 KB Registers 64 KB 64 KB 64 KB 64 KB
Bus
Registers Registers Registers Registers R/W
SIMD Integer ALU data R/W L2
PC & IB MP MP MP MP L1
Vector Vector Vector Vector Vector
Decode ALU ALU ALU ALU 16KB
SIMD
PC & IB
LDS 64 KB LDS Memory
Decode
Export/GDS Decode
Instruction Arbitration
SIMD Vector Memory Decode
PC & IB
Scalar Scalar Unit SIM SIM SIM SIM Export
Decode 8 KB Registers 64D0
KB 64D1
KB 64D2
KB 64D3
KB Bus
No more clauses
Registers Registers Registers Registers R/W
SIMD Integer ALU data
PC & IB R/W
L1
Vector MP Vector MP Vector MP Vector MP Vector L2
Decode ALU ALU ALU ALU 16KB
Control flow more directly programmed 4 CU Shared 16KB Scalar Read Only L1
Rqst
Arb R/W
4 CU Shared 32KB Instruction L1
L2
Scalar engine
Lower latency wavefront issue from distributed sequencing of wavefronts
Improved performance in previously clause bound cases
Lower power handling of control flow as control is closer
Improved debug support
Added HW functionality to improve debug support
Export/GDS Decode
Instruction Arbitration
SIMD Vector Memory Decode
PC & IB
Scalar Scalar Unit SIM SIM SIM SIM Export
Decode 8 KB Registers 64D0
KB 64D1
KB 64D2
KB 64D3
KB Bus
Media ops
Registers Registers Registers Registers R/W
SIMD Integer ALU data
PC & IB R/W
L1
Vector MP Vector MP Vector MP Vector MP Vector L2
Decode ALU ALU ALU ALU 16KB
Floating point atomics (min, max, cmpxchg) 4 CU Shared 16KB Scalar Read Only L1
Rqst
Arb R/W
4 CU Shared 32KB Instruction L1
L2
label0:
s_andn2_b64 exec,s0,exec //Do else(s0 & !exec)
s_cbranch_execz label1 //Branch if all lanes fail
v_sub_f32 r2,r1,r0 //result = b a
v_mul_f32 r2,r2,r1 //result = result * b
label1:
s_mov_b64 exec,s0 //Restore exec mask
RB RB
Scalable
Work Distributor
Graphics Engine
Asynchronous Compute Engine (ACE) Primitive
Pipe 0
Primitive
Pipe n
MC
HUB
&
Command processor
HOS HOS MEM
CS CS CS Pixel Pixel
Pipe Pipe Pipe Tessellate Tessellate Pipe 0 Pipe n R/W
0 n L2
Hardware Command Queue Fetcher Scan Scan
Geometry Geometry Conversion Conversion
Can synchronize with other command queues and Unified Shader Core
engines
Independent and concurrent grid/workgroup Multiple Concurrent Contexts
dispatch Multiple ACE paths
Queue priority controls Lower overhead dispatch and submission
Device resource allocation and control Improved resource management and allocation
Compute Generated Task Graph Processing Out-of-order completion to free up resources
User task queues as early as possible
HW scheduling of work and resources
Task queue context switching
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited
to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no
obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to
make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY
DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL
OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF
EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this
presentation are for informational purposes only and may be trademarks of their respective owners.
PC Program Counter EXEC [63:0] Execute mask bit per vector lane
VGPR (0-255) 32 bit, General Purpose Vector Register SGPR (0-103) 32 bit General Purpose Scalar Register
LDS Local data share space. Up to 32KB Status Register Misc status bits (EXECZ, VCCZ, etc)
VCC [63:0] Vector Condition Code bit per vector lane SCC Scalar condition code resulting from scalar-ALU op.
VMCNT, LGKMCNT, EXPCNT Dependency counters ALLOC SGPR, VGPR, LDS Base and size
M0 Memory Descriptor register (Use shared memory access) TIME 64 bit time
B B B B B B B B B B B B B B B B B B B B B B
B B B B B B B B B B
Conflict 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9
Detection 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
&
Scheduling Read Data Cross Bar
Local Data Share (LDS) 64 KB, 32 banks with Integer Atomic Units