Professional Documents
Culture Documents
Chapter 2
TMS320C6000 Architectural Overview
Learning Objectives
External
Memory
Central
Processing
Unit
P
E
R
I
P
H
E
R
A
L
S
Y =
an * xn
n = 1
= a1 * x1 + a2 * x2 +... + aN * xN
Two basic
operations are required
for this algorithm.
(1) Multiplication
(2) Addition
Therefore two basic
instructions are required
Y =
an * xn
n = 1
= a1 * x1 + a2 * x2 +... + aN * xN
Two basic
operations are required
for this algorithm.
(1) Multiplication
(2) Addition
Therefore two basic
instructions are required
Multiply (MPY)
N
Y =
an * xn
n = 1
= a1 * x1 + a2 * x2 +... + aN * xN
a1, x1, Y
Y =
.M
.M
an * xn
n = 1
.M
a1, x1, Y
Addition (.?)
40
Y =
an * xn
n = 1
.M
.M
.?.?
MPY
.M
ADD
.?
Y, prod, Y
Y =
an * xn
n = 1
.M
.M
.L
.L
MPY
.M
ADD
.L
Y, prod, Y
Register File - A
40
Register File A
A0
A1
A2
A3
Y =
a1
x1
prod
Y
.M
.M
.
.
.
.L
.L
an * xn
n = 1
MPY
.M
ADD
.L
Y, prod, Y
A15
32-bits
Register File A
A0
A1
A2
A3
Y =
a1
x1
prod
Y
.M
.M
.
.
.
.L
.L
an * xn
n = 1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
A15
32-bits
Register File A
A0
A1
A2
A3
Y =
a1
x1
prod
Y
.M
.M
.
.
.
.L
.L
an * xn
n = 1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
A15
32-bits
Data loading
Register File A
A0
A1
A2
A3
a1
x1
prod
Y
.M
.M
.
.
.
.L
.L
A15
32-bits
Load Unit .D
Register File A
A0
A1
A2
A3
a1
x1
prod
Y
.M
.M
.
.
.
.L
.L
.D
.D
A15
32-bits
Data Memory
Load Unit .D
Register File A
A0
A1
A2
A3
a1
x1
prod
Y
.M
.M
.
.
.
.L
.L
.D
.D
A15
32-bits
Data Memory
Load Instruction
Register File A
A0
A1
A2
A3
a1
x1
prod
Y
.M
.M
.
.
.
.L
.L
.D
.D
A15
32-bits
Data Memory
a1
x1
prod
Y
.M
.M
.
.
.
.L
.L
.D
.D
A15
32-bits
Data Memory
Data
address
00000000
00000002
00000004
00000006
00000008
FFFFFFFF
16-bits
Data
address
a1
x1
00000000
00000002
00000004
prod
Y
00000006
00000008
and
Rm is the destination register.
FFFFFFFF
16-bits
Data
address
a1
x1
00000000
00000002
00000004
prod
Y
00000006
00000008
FFFFFFFF
16-bits
Data
address
a1
x1
00000000
00000002
00000004
prod
Y
00000006
00000008
FFFFFFFF
16-bits
Data
0xA
0xB
0xC
0xD
0x2
0x1
0x4
0x3
0x6
0x5
0x8
0x7
LD *Rn,Rm
Example:
address
00000000
00000002
00000004
00000006
00000008
FFFFFFFF
16-bits
Data
0xA
0xB
0xC
0xD
0x2
0x1
0x4
0x3
0x6
0x5
0x8
0x7
LD *Rn,Rm
Example:
address
00000000
00000002
00000004
00000006
00000008
FFFFFFFF
16-bits
Data
0xA
0xB
0xC
0xD
0x2
0x1
0x4
0x3
0x6
0x5
0x8
0x7
LD *Rn,Rm
Example:
address
00000000
00000002
00000004
00000006
00000008
FFFFFFFF
16-bits
Data
0xA
0xB
0xC
0xD
0x2
0x1
0x4
0x3
0x6
0x5
0x8
0x7
LD *Rn,Rm
Example:
address
00000000
00000002
00000004
00000006
00000008
FFFFFFFF
16-bits
Data
0xA
0xB
0xC
0xD
0x2
0x1
0x4
0x3
0x6
0x5
0x8
0x7
LD *Rn,Rm
Example:
address
00000000
00000002
00000004
00000006
00000008
FFFFFFFF
16-bits
address
Data
0xA
0xB
0xC
0xD
0x2
0x1
0x4
0x3
0x6
0x5
0x8
0x7
00000000
00000002
00000004
00000006
00000008
FFFFFFFF
16-bits
.?
a, A5
(a is a constant or label)
eg.
MVKH
.?
a, A5
(a is a constant or label)
ah
al
ah
A5
a, A5
MVKH
a, A5
0x1234FABC, A5
MVKH
0x1234FABC, A5
A5 = 0x1234FABC ; OK
Example 2
MVKH
A5 = 0x12344321
0x1234FABC, A5
MVKL
0x1234FABC, A5
A5 = 0xFFFFFABC ; Wrong
a
x
A2
A3
A4
prod
Y
.M
.M
.
.
.
.L
.L
.D
.D
A15
MVKL
MVKH
pt1, A5
pt1, A5
MVKL
MVKH
pt2, A6
pt2, A6
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
32-bits
pt1 and pt2 point to some locations
Data Memory
Creating a loop
So far we have only
implemented the SOP
for one tap only, i.e.
Y= a1 * x1
So lets create a loop
so that we can
implement the SOP
for N Taps.
MVKL
MVKH
pt1, A5
pt1, A5
MVKL
MVKH
pt2, A6
pt2, A6
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
Creating a loop
So far we have only
implemented the SOP
for one tap only, i.e.
Y= a1 * x1
So lets create a loop
so that we can
implement the SOP
for N Taps.
loop
MVKL
MVKH
pt1, A5
pt1, A5
MVKL
MVKH
pt2, A6
pt2, A6
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
loop
MVKL
MVKH
pt1, A5
pt1, A5
MVKL
MVKH
pt2, A6
pt2, A6
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
.?
loop
a
x
.S
.S
prod
Y
.M
.M
.M
.M
.
.
.
.L
.L
.L
.L
.D
.D
.D
.D
A15
32-bits
Data Memory
loop
MVKL
MVKH
pt1, A5
pt1, A5
MVKL
MVKH
pt2, A6
pt2, A6
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
.?
loop
a
x
.S
.S
prod
Y
.M
.M
.M
.M
.
.
.
.L
.L
.L
.L
.D
.D
.D
.D
A15
32-bits
Data Memory
loop
MVKL
MVKH
.S
.S
pt1, A5
pt1, A5
MVKL
MVKH
.S
.S
pt2, A6
pt2, A6
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
.S
loop
a
x
.S
.S
prod
Y
.M
.M
.M
.M
.
.
.
.L
.L
.L
.L
.D
.D
.D
.D
A15
32-bits
Data Memory
loop
MVKL
MVKH
.S
.S
pt1, A5
pt1, A5
MVKL
MVKH
MVKL
.S
.S
.S
pt2, A6
pt2, A6
count, B0
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
.S
loop
a
x
.S
.S
prod
Y
.M
.M
.M
.M
.
.
.
.L
.L
.L
.L
.D
.D
.D
.D
A15
32-bits
Data Memory
loop
MVKL
MVKH
.S
.S
pt1, A5
pt1, A5
MVKL
MVKH
MVKL
.S
.S
.S
pt2, A6
pt2, A6
count, B0
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
.S
loop
Instruction
Label
e.g.
[B1]
loop
Label
e.g.
[!B0]
loop ;branch if B0 = 0
[B0]
loop
;branch if B0 != 0
a
x
.S
.S
prod
Y
.M
.M
.M
.M
.
.
.
.L
.L
.L
.L
.D
.D
.D
.D
A15
32-bits
Data Memory
loop
[B0]
MVKL
MVKH
.S2
.S2
pt1, A5
pt1, A5
MVKL
MVKH
MVKL
.S2
.S2
.S2
pt2, A6
pt2, A6
count, B0
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
.S
loop
B
Case 1:
B .S1
label
Relative branch.
Label limited to +/- 220 offset.
B
Case 2:
B .S2
Absolute branch.
Operates on .S2 ONLY!
5-bit register
code
register
.S2
.S2
pt1, A5
pt1, A5
MVKL
MVKH
MVKL
.S2
.S2
.S2
pt2, A6
pt2, A6
count, B0
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
.S
loop
loop
a0*x0 + a0*x0 + a0*x0 + + a0*x0
[B0]
loop
A5 and A6.
[B0]
MVKL
MVKH
.S2
.S2
pt1, A5
pt1, A5
MVKL
MVKH
MVKL
.S2
.S2
.S2
pt2, A6
pt2, A6
count, B0
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
.S
loop
Indexing Pointers
Syntax
Description
*R
Pointer
Pointer
Modified
No
Indexing Pointers
Syntax
Description
*R
*+R[disp]
*-R[disp]
Pointer
+ Pre-offset
- Pre-offset
Pointer
Modified
No
No
No
Indexing Pointers
Syntax
Description
*R
*+R[disp]
*-R[disp]
*++R[disp]
*--R[disp]
Pointer
+ Pre-offset
- Pre-offset
Pre-increment
Pre-decrement
Pointer
Modified
No
No
No
Yes
Yes
Indexing Pointers
Syntax
Description
*R
*+R[disp]
*-R[disp]
*++R[disp]
*--R[disp]
*R++[disp]
*R--[disp]
Pointer
+ Pre-offset
- Pre-offset
Pre-increment
Pre-decrement
Post-increment
Post-decrement
Pointer
Modified
No
No
No
Yes
Yes
Yes
Yes
Indexing Pointers
Syntax
Description
*R
*+R[disp]
*-R[disp]
*++R[disp]
*--R[disp]
*R++[disp]
*R--[disp]
Pointer
+ Pre-offset
- Pre-offset
Pre-increment
Pre-decrement
Post-increment
Post-decrement
Pointer
Modified
No
No
No
Yes
Yes
Yes
Yes
loop
[B0]
MVKL
MVKH
.S2
.S2
pt1, A5
pt1, A5
MVKL
MVKH
MVKL
.S2
.S2
.S2
pt2, A6
pt2, A6
count, B0
LDH
.D
*A5++, A0
LDH
.D
*A6++, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
.S
loop
loop
[B0]
MVKL
MVKH
.S2
.S2
pt1, A5
pt1, A5
MVKL
MVKH
MVKL
.S2
.S2
.S2
pt2, A6
pt2, A6
count, B0
LDH
.D
*A5++, A0
LDH
.D
*A6++, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
.S
loop
STH
.D
A4, *A7
loop
[B0]
MVKL
MVKH
.S2
.S2
pt1, A5
pt1, A5
MVKL
MVKH
MVKL
.S2
.S2
.S2
pt2, A6
pt2, A6
count, B0
LDH
.D
*A5++, A0
LDH
.D
*A6++, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
.S
loop
STH
.D
A4, *A7
.S2
.S2
pt1, A5
pt1, A5
MVKL
MVKH
.S2
.S2
pt2, A6
pt2, A6
loop
[B0]
LDH
.D
*A5++, A0
LDH
.D
*A6++, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
.S
loop
STH
.D
A4, *A7
A4 is used as an accumulator,
so it needs to be reset to zero.
loop
[B0]
MVKL
MVKH
.S2
.S2
pt1, A5
pt1, A5
MVKL
MVKH
.S2
.S2
pt2, A6
pt2, A6
MVKL
MVKH
MVKL
ZERO
LDH
.S2
.S2
.S2
.L
.D
pt3, A7
pt3, A7
count, B0
A4
*A5++, A0
LDH
.D
*A6++, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
.S
loop
STH
.D
A4, *A7
.S1
.S1
.M1
.M1
.
.
.
.L1
.L1
.D1
.D1
A15
32-bits
Data Memory
.S1
.S1
.M1
.M1
.
.
.
.L1
.L1
.D1
.D1
A15
32-bits
Data Memory
.
.
.
A15
Register File B
.S1
.S1
.S2
.S2
.M1
.M1
.M2
.M2
.L1
.L1
.L2
.L2
.D1
.D1
.D2
.D2
32-bits
B0
B1
B2
B3
B4
.
.
.
B15
32-bits
Data Memory
.
.
.
A15
Register File B
.S1
.S1
.S2
.S2
.M1
.M1
.M2
.M2
.L1
.L1
.L2
.L2
.D1
.D1
.D2
.D2
32-bits
B0
B1
B2
B3
B4
.
.
.
B15
32-bits
Data Memory
TMS320C67x Data-Path
<dst>
.L1
.M1
.S1
<src>
<src>
2
x
1
x
<dst>
eg:
ADD
MPY
SUB
|| ADD
.L1
.M1
.S1
.L1x
.M1x
.S1x
.L1x
<src>
<src>
A0,A1,B2
A0,B6,A9
A8,B2,A8
A0,B0,A2
2
x
1
x
<dst>
eg:
ADD
MPY
SUB
|| ADD
.L1
.M1
.S1
.L1x
.M1x
.S1x
.L1x
<src>
A0,A1,B2
A0,B6,A9
A8,B2,A8
A0,B0,A2
NOT VALID!
<src>
2
x
1
x
<dst>
<dst>
.L1
.M1
.S1
.L2
.M2
.S2
<src>
<src>
2
x
<src>
<src>
1
x
.D1
LDW.D1T1
LDW.D1T1
STW.D1T1
STW.D1T1
*A0,A5
*A0,A5
A5,*A0
A5,*A0
DA2 = T2
Data2
A5
.D1
LDW.D1T1
LDW.D1T1
LDW.D1T2
LDW.D1T2
*A0
*A0,A5
*A0,A5
*A0,B5
*A0,B5
B5
A
B
DA2 = T2
A5
.D1
*A0
.D2
*B0
LDW.D1T1
LDW.D1T1
||
|| LDW.D2T2
LDW.D2T2
*A0,A5
*A0,A5
*B0,B5
*B0,B5
B5
A
B
DA2 = T2
A5
.D1
*A0
.D2
*B0
LDW.D1T2
LDW.D1T2
||
|| STW.D2T1
STW.D2T1
*A0,B5
*A0,B5
A5,*B0
A5,*B0
B5
A
B
DA2 = T2
.D1
*A0
.D2
*B0
LDW.D1__
LDW.D1__
||
|| STW.D2__
STW.D2__
*A0,B5
*A0,B5
B6,*B0
B6,*B0
A
B
Not Allowed!
Parallel accesses: both cross or neither cross
Data1
DA2 = T2
.D1
*A0
.D2
*B0
LDW.D1
LDW.D1T2
T2
||
|| STW.D2
STW.D2T2
T2
B5
*A0,B5
*A0,B5 B6
B6,*B0
B6,*B0
A
B
ADD
LDW
.L1
.D2
A2,A0,A4
*B0,B5
C67x
Data
Destination register on same side as unit.
Source registers - up to one cross path per execute
packet per side.
Use x to indicate cross-path.
Address
Pointer must be on same side as unit.
Data can be transferred to/from either side.
Parallel accesses: both cross or neither cross.
Conditionals Dont Use Cross Paths.
Y =
MVK
loop: LDH
LDH
MPY
ADD
SUB
[A2] B
STH
.S1
.D1
.D1
.M1
.L1
.L1
.S1
.D1
an * xn
n = 1
40, A2
*A5++, A0
*A6++, A1
A0, A1, A3
A3, A4, A4
A2, 1, A2
loop
A4, *A7
Note: Assume that A4 was previously cleared and the pointers are initialised.
Register File B
A0
A1
A2
A3
A4
A5
A6
A7
..
A15
cn
xn
cnt
prd
sum
*c
*x
*y
.S1
.S1
.S2
.S2
.M1
.M1
.M2
.M2
.L1
.L1
.L2
.L2
..
.D1
.D1
.D2
.D2
A31
32-bits
or
..
32-bits
B0
B1
B2
B3
B4
B5
B6
B7
..
B15
or
B31
y =
Register File A
n = 1
A0
A1
A2
A3
A4
A5
A6
A7
..
A15
cn
xn
cnt
prd
sum
*c
*x
*y
.M1
.M1
..
.D1
.D1
A31
32-bits
or
.S1
.S1
loop:
.L1
.L1
cn * xn
MVK
.S1
40, A2
LDH
.D1
*A5++, A0
LDH
.D1
*A6++, A1
MPY
.M1
A0, A1, A3
ADD
.L1
A4, A3, A4
SUB
.S1
A2, 1, A2
.S1
loop
.D1
A4, *A7
[A2] B
STW
External
Memory
Internal Buses
.D1 .D2
.M1 .M2
.L1 .L2
.S1 .S2
Reggister Set B
Register Set A
P
E
R
I
P
H
E
R
A
L
S
On-chip
Memory
CPU
.S
.S
.L
.L
.D
.D
.M
.M
ADD
ADDK
ADD2
AND
B
CLR
EXT
MV
MVC
MVK
MVKH
NEG
NOT
OR
SET
SHL
SHR
SSHL
SUB
SUB2
XOR
ZERO
.L Unit
ABS
ADD
AND
CMPEQ
CMPGT
CMPLT
LMBD
MV
NEG
NORM
NOT
OR
SADD
SAT
SSUB
SUB
SUBC
XOR
ZERO
.M Unit
.D Unit
ADD
NEG
ADDAB (B/H/W) STB
(B/H/W)
LDB
(B/H/W) SUB
SUBAB (B/H/W)
MV
ZERO
MPY
MPYH
MPYLH
MPYHL
SMPY
SMPYH
No Unit Used
NOP
IDLE
Logical
ABS
ADD
ADDA
ADDK
ADD2
MPY
MPYH
NEG
SMPY
SMPYH
SADD
SAT
SSUB
SUB
SUBA
SUBC
SUB2
ZERO
AND
CMPEQ
CMPGT
CMPLT
NOT
OR
SHL
SHR
SSHL
XOR
Bit Mgmt
CLR
EXT
LMBD
NORM
SET
Data Mgmt
LDB/H/W
MV
MVC
MVK
MVKL
MVKH
MVKLH
STB/H/W
Program Ctrl
B
IDLE
NOP
Note: Refer to the 'C6000 CPU Reference Guide for more details.
.S
.S
.L
.L
.D
.D
.M
.M
ADD
ADDK
ADD2
AND
B
CLR
EXT
MV
MVC
MVK
MVKH
NEG
NOT
OR
SET
SHL
SHR
SSHL
SUB
SUB2
XOR
ZERO
ABSSP
ABSDP
CMPGTSP
CMPEQSP
CMPLTSP
CMPGTDP
CMPEQDP
CMPLTDP
RCPSP
RCPDP
RSQRSP
RSQRDP
SPDP
.D Unit
ADD
NEG
ADDAB (B/H/W) STB
(B/H/W)
LDB
(B/H/W) SUB
LDDW
SUBAB (B/H/W)
MV
ZERO
.L Unit
ABS
ADD
AND
CMPEQ
CMPGT
CMPLT
LMBD
MV
NEG
NORM
NOT
OR
SADD
SAT
SSUB
SUB
SUBC
XOR
ZERO
ADDSP
ADDDP
SUBSP
SUBDP
INTSP
INTDP
SPINT
DPINT
SPRTUNC
DPTRUNC
DPSP
.M Unit
MPY
MPYH
MPYLH
MPYHL
SMPY
SMPYH
MPYSP
MPYDP
MPYI
MPYID
No Unit Required
NOP
IDLE
.D
.D
Dual/Quad Arith
SADD2
SADDUS2
SADD4
Data Pack/Un
PACK2
PACKH2
PACKLH2
PACKHL2
Bitwise Logical UNPKHU4
ANDN
UNPKLU4
Shifts & Merge SWAP2
SPACK2
SHR2
SPACKU4
SHRU2
SHLMB
SHRMB
Dual Arithmetic Mem Access
ADD2
LDDW
SUB2
LDNW
LDNDW
Bitwise Logical STDW
AND
STNW
ANDN
STNDW
OR
XOR
Load Constant
MVK (5-bit)
Address Calc.
ADDAD
Compares
CMPEQ2
CMPEQ4
CMPGT2
CMPGT4
.L
.L
Branches/PC
BDEC
BPOS
BNOP
ADDKPC
Dual/Quad Arith
ABS2
ADD2
ADD4
MAX
MIN
SUB2
SUB4
SUBABS4
Bitwise Logical
ANDN
.M
.M
Average
AVG2
AVG4
Shifts
ROTL
SSHVL
SSHVR
Data Pack/Un
PACK2
PACKH2
PACKLH2
PACKHL2
PACKH4
PACKL4
UNPKHU4
UNPKLU4
SWAP2/4
Multiplies
MPYHI
Shift & Merge
MPYLI
SHLMB
MPYHIR
SHRMB
MPYLIR
Load Constant
MPY2
MVK (5-bit)
SMPY2
Bit Operations DOTP2
DOTPN2
BITC4
DOTPRSU2
BITR
DOTPNRSU2
DEAL
DOTPU4
SHFL
DOTPSU4
Move
GMPY4
MVD
XPND2/4
Used In
Asm
Cycles
Assembly Time
(s)
C Cycles
(Rel 4.0)
For motion
compensation of
image data
348
1.16
402
1.34
87%
Codebook Search
977
3.26
961
3.20
100%
Vector Max
40 element input vector
Search Algorithms
61
0.20
59
0.20
100%
238
0.79
280
0.93
85%
Search Algorithms
1185
3.95
1318
4.39
90%
IIR Filter
16 coefficients
Filter
43
0.14
38
0.13
100%
Filter
70
0.23
75
0.25
93%
61
0.20
58
0.19
100%
51
0.17
47
0.16
100%
279
0.93
274
0.91
100%
MAC
Two 40 sample vectors
Vector Sum
Two 44 sample vectors
MSE
MSE between two 256 element
vectors
C Time (s)
% Efficiency vs
Hand Coded
Completely
Completely natural
natural C
C code
code (non
(non C6000
C6000 specific)
specific)
Code
Code available
available at:
at: http://www.ti.com/sc/c6000compiler
http://www.ti.com/sc/c6000compiler
TI C62x Compiler Performance Release 4.0: Execution Time in s @ 300
MHz Versus hand-coded assembly based on cycle count
Cycle Count
Performance
C62x
C64x
1680
470
38.25
12.7
0.77
Correlation - 3x3
(8-bit)
4.5
9.0
cycles/packet
14*
cycles/output
6.0
cycles/data
0.33
cycles/output/filter tap
1.28
cycles/pixel
2.1
cycles/pixel
0.953
0.126
cycles/pixel
Includes traceback
Cycle Improvement
C64:C62
720MHz C64x vs
300MHz C62x
3.5x
8.4x
2.7x
6.5x
2.1x
5x
2.3x
5.5x
3.5x
8.4x
4.3x
10.3x
7.6x
18.2x
Used In
Asm
Cycles
Assembly Time
(s)
C Cycles
(Rel 4.0)
For motion
compensation of
image data
348
1.16
402
1.34
87%
Codebook Search
977
3.26
961
3.20
100%
Vector Max
40 element input vector
Search Algorithms
61
0.20
59
0.20
100%
238
0.79
280
0.93
85%
Search Algorithms
1185
3.95
1318
4.39
90%
IIR Filter
16 coefficients
Filter
43
0.14
38
0.13
100%
Filter
70
0.23
75
0.25
93%
61
0.20
58
0.19
100%
51
0.17
47
0.16
100%
279
0.93
274
0.91
100%
MAC
Two 40 sample vectors
Vector Sum
Two 44 sample vectors
MSE
MSE between two 256 element
vectors
C Time (s)
% Efficiency vs
Hand Coded
Completely
Completely natural
natural C
C code
code (non
(non C6000
C6000 specific)
specific)
Code
Code available
available at:
at: http://www.ti.com/sc/c6000compiler
http://www.ti.com/sc/c6000compiler
TI C62x Compiler Performance Release 4.0: Execution Time in s @ 300
MHz Versus hand-coded assembly based on cycle count
TMS320C6000 Memory
C6701
C6205
Internal
EMIFA
EMIFB
P
D
=
=
64 kB
64 kB
C6202
P
D
=
=
256 kB
128 kB
C6203
P
D
=
=
384 kB
512 kB
L1P
L1D
L2
=
=
=
4 kB
4 kB
64 kB
C6713
L1P
L1D
L2
=
=
=
4 kB
4 kB
256 kB
128M Bytes
(32-bits wide)
N/A
C6411
DM642
L1P
L1D
L2
=
=
=
16 kB
16 kB
256 kB
128M Bytes
(32-bits wide)
N/A
C6414
C6415
C6416
L1P
L1D
L2
=
=
=
16 kB
16 kB
1 MB
256M Bytes
(64-bits wide)
C6211
C6711
C6712
52M Bytes
(32-bits wide)
128M Bytes
(32-bits wide)
N/A
N/A
64M Bytes
(16-bits wide)
64M Bytes
(16-bits wide)
Internal
(L2)
External
C6211
C6711
C6713
64 kB
512M
(32-bit wide)
C6712
256 kB
512M
(16-bit wide)
Devices
Internal
(L2)
C6414
C6415
C6416
1 MB
DM642
256 kB
C6411
256 kB
External
A: 1GB
B: 256kB
(64-bit)
(16-bit)
1GB (64-bit)
256MB (32-bit)
Performance
Making use of Parallelism
40
y =
n = 1
c
x
cnt
prod
y
*cp
*xp
*yp
cn * xn
.S1
.S1
.M1
.M1
MVK
.S1
40, cnt
LDH
.D1
*cp++, c
LDH
.D1
*xp++, x
MPY
.M1
c, x, prod
ADD
.L1
y, prod, y
SUB
.L1
cnt, 1, cnt
.S1
loop
STW
.D
y, *yp
loop:
.L1
.L1
.D1
.D1
[cnt]
MPY
||
MPYH
|| [B0] B
||
LDW
||
LDW
.M2
.M1
.S1
.D1
.D2
B7,A3,B4
B7,A3,A5
L3
*A4++,A3
*B6++,B7
L3:
[B0] B
.S1 L3
||
LDW .D1 *A4++,A3
||
LDW .D2 *B6++,B7
[B0] B
.S1 L3
||
LDW .D1 *A4++,A3
||
LDW .D2 *B6++,B7
[B0] B
.S1 L3
||
LDW .D1 *A4++,A3
||
LDW .D2 *B6++,B7
||
||
||
||
||
||
||
;** -----------------------*
Sum-of-Products
Sum-of-Productsper
periteration
iteration
..
A15
.M1
.M1
.M2
.M2
.L1
.L1
.L2
.L2
.S1
.S1
.S2
.S2
Controller/Decoder
Controller/Decoder
..
B15
;** --------------------------------------------------*
LOOP: ; PIPED LOOP KERNEL
LDDW .D1
A4++,A7:A6
||
LDDW .D2
B4++,B7:B6
||
MPYSP .M1X
A6,B6,A5
||
MPYSP .M2X
A7,B7,B5
||
ADDSP .L1
A5,A8,A8
||
ADDSP .L2
B5,B8,B8
|| [A1] B
.S2
LOOP
|| [A1] SUB
.S1
A1,1,A1
;** --------------------------------------------------*
Can the 'C64x do better?
DOTP2
m1
n1
m0
A5
n0
B5
=
m1*n1 + m0*n0
A6
+
running sum
A7
MMACs
LDH
||
LDH
MPY
5 x 3 = 15
______________
cycles
ADD
.D1
.D2
ldh
ldh
.M1
.L2
.S1
.S2
add
ldh
ldh
mpy
6
7
.L1
mpy
3
4
.M2
add
ldh
ldh
.D1
.D2
ldh
ldh
ldh
ldh
mpy
ldh
ldh
mpy
add
ldh
ldh
mpy
add
Completes
ininonly
Completes
only77cycles
cycles
ldh
ldh
mpy
add
mpy
add
6
7
.M1
.M2
.L1
.L2
.S1
.S2
add
Cycle
.D1
ldh
.D2
ldh
ldh
ldh
mpy
ldh
ldh
mpy
c2:
add
ldh
ldh
mpy
add
ldh
ldh
mpy
add
mpy
add
6
7
add
||
LDH
LDH
||
||
MPY
LDH
LDH
||
||
||
ADD
MPY
LDH
LDH
.S1
c3:
.S2
DSK
Code Composer Studio
C6416 DSK
C6416 DSK
DSK hardware
Verify
USB
emulation link
Use Advanced
tests to facilitate
debugging
Reset
DSK
hardware
Standard
Runtime
Libraries
Compiler
Asm Opto
DSK
Asm
Edit
Link
.out
Debug
EVM
DSP/BIOS
Config
Tool
DSP/BIOS
Libraries
Third
Party
XDS
DSP
Board
Code Generation
Asm
Optimizer
Link.cmd
.sa
Editor
.asm
Asm
.obj
Linker
.c / .cpp
.map
Compiler
.out
What is a Project?
Project (.PJT) file contain:
References to files:
Source
Libraries
Linker, etc
Project settings:
Compiler Options
DSP/BIOS
Linking, etc
Project Menu
Hint:
Hint:
Project Menu
Access
Create
and
open
projects
pull-down
menu
Create
andvia
open
projects
or by right-clicking
.pjt file
from
Project
menu,
frominthe
the
Project
menu,
project explorer window
not
the
not the File
File menu.
menu.
Build Options...
Next slide
Build Options
-g -q -fr"c:\modem\Debug" -mv6700
Eight Categories of
Compiler options
debug
options
Description
-mv6700
-mv6400
-fr <dir>
-fs <dir>
-q
-g
-s
Simplifies
Simplifies system
system design
design by:
by:
Automatically
Automaticallyincludes
includesthe
theappropriate
appropriate
runtime
runtimesupport
supportlibraries
libraries
Automatically
Automaticallyhandles
handlesinterrupt
interrupt vectors
vectors
and
and system
systemreset
reset
Handles
Handlessystem
systemmemory
memoryconfiguration
configuration
(builds
(buildsCMD
CMDfile)
file)
Generates
Generates55files
fileswhen
whenCDB
CDBfile
fileisissaved:
saved:
C
C file,
file,Asm
Asmfile,
file,22header
headerfiles
filesand
andaa
linker
linkercommand
command(.cmd)
(.cmd) file
file
More
Moreto
tobe
bediscussed
discussedlater
later
Size
Representation
8 bits
8 bits
16 bits
16 bits
32 bits
32 bits
40 bits
40 bits
32 bits
32 bits
64 bits
64 bits
32 bits
ASCII
ASCII
2s complement
binary
2s complement
binary
2s complement
binary
2s complement
IEEE 32-bit
IEEE 64-bit
IEEE 64-bit
binary
GEL Scripting
GEL:
GEL: General
GeneralExtension
Extension
Language
Language
C style syntax
C style syntax
Large number of debugger
Large number of debugger
commands
commandsas
asGEL
GELfunctions
functions
Write your own functions
Write your own functions
Create GEL menu items
Create GEL menu items
CCS Scripting
hello_dsk62cfg.tcf
hello.tci
/* generate
create a new
user(and
log, named
trace
/*
cfg files
CDB file)
*/ */
= 32;
Yourtrace.bufLen
Application
hello.c
/* initialize<log.h>
its length to 32 (words) */
#include
extern LOG_Obj trace;
/* created in hello.tci */
int main() {
AAtextual
textualway
waytotoconfigure
configureCDB
CDBfiles
files
LOG_printf(&trace, "Hello World!\n");
Runs
return (0);
Runson
onboth
bothPC
PCand
andUnix
Unix
}
Create
Create#include
#includetype
typefiles
files(.tci)
(.tci)
More
Moreflexible
flexiblethan
thanConfig
ConfigTool
Tool
Chapter 2
TMS320C6000 Architectural Overview
- End -