You are on page 1of 5

In LSTM cell box, we use the su xes ; and !

to refer to the input gate,


forget gate and output gate weights.
c refers to an element of the set of cells C.
wij is the weight from unit j to unit i.
f is the squashing function of the gates, and g and h are respectively the
cell input and output squashing functions.
Cell Input:
8c 2 C; xc =

wcj yj (

1)

j2N

where yj (
1) is the activation of network input(xj (
1
Input Gate:

1)) to unit j at time

xin =

w j yj (

1) +

j2N

w c sc (

1)

c2C

yin = f (xin )
Forget Gate:
xf or =

w j yj (

1) +

j2N

w c sc (

1)

c2C

yf or = f (xf or )
The Cell value, which is similar to the hidden node in RNN
sc = yf or sc (

1) + yin g(xc )

Output Gate:
xout =

w!j yj (

1) +

j2N

w!c sc ( )

c2C

yout = f (xout )
Cell Output:
8c 2 C; yc = yout h(sc )
output layer:
xk =

wkc yc

c2C

yk = sof t max(x)k
Error function is cross-entropy for a softmax output layer
Etotal =

1
X

E( ) =

E( )

n X
C
X

tki log(yki )

k=1 i=c

where C is class number, n is sample size

Using the BPTT propagate the output errors backwards through the net
(error from answer).
)=

@E( )
@xk

) = yk ( )

tk ( )

dene
k(

k(

k2output units
For each LSTM block the delta are calculated as follows:
Cell Output:
8c 2 C; dene

c(

)=

X
@E( )
=
wjc j ( )
@yc
j2N

Output Gate:

out (

)=

X
@E( ) @yc @yout
@E( )
=
= f 0 (xout )
@xout
@yc @yout @xout

c(

)h(sc )

c2C

where

@yout
= f 0 (xout )
@xout

@yc
= h(sc )
@yout

) rx( ) E = f 0 (xout )
out

ry ( ) E

h(sc )

States:
@E( )
@sc

@E( ) @yc
@E( ) @sc ( + 1)
+
@yc @sc
@sc ( + 1)
@sc
@xf or ( + 1) @E( ) @xout
@E( ) @xin ( + 1)
@E( )
+
+
+
@xin ( + 1)
@sc
@xf or ( + 1)
@sc
@xout @sc
@E(
+
1)
= c yout h0 (sc ) +
yf or ( + 1)
@sc
+ in ( + 1)w c + f or ( + 1)w c + out w!c

r s( ) E = r y ( ) E
c

+rx(

+1)
in

h0 (sc ) + rs(

yout

w c + rx(

+1)
f or

+1)

( )
rxout

yf or ( + 1)
w!c

This result is dierent fromRNN. The iterative term multiply with forget
3

gate value which control the hidden layer remember things or not.
Cells:
8c 2 C;

c(

)=

@E( )
@E( ) @sc
@E( )
=
= yin g 0 (xc )
@xc
@sc @xc
@sc
g 0 (xc )

) rx( ) E = yin
c

r s( ) E
c

Forget Gate:

f or (

)=

X @E( )
@E( )
@E( ) @sc @yf or
=
= f 0 (xf or )
sc (
@xf or
@sc @yf or @xf or
@sc

1)

c2C

where

@sc
= sc (
@yf or

@yf or
= f 0 (xf or )
@xf or

1)

X @E( )
@sc

) rx( ) E = f 0 (xf or )
f or

Input Gate:
in (

)=

sc (

1)

c2C

X @E( )
@E( )
@E( ) @sc @yin
=
= f 0 (xin )
g(xc )
@xin
@sc @yin @xin
@sc
c2C

where

@sc
= g(xc )
@yin

@yin
= f 0 (xin )
@xin
X @E( )
@sc

) rx( ) E = f 0 (xin )
in

g(xc )

c2C

Now using the s to get the partial derivatives of the cumulative sequence
error

Etotal

1
X

de ner (S)

E( )

@Etotal
@

nally the gradients for the weights:


1
X
@Etotal
=
h (t); yj (t
@w
t=
0

1)i

where

can be any of { j; j; !j; cj}


1
X
@Etotal
=
h ? (t); sc (t
@w?
t=

1)i

where ? can be any of { c; c}, and


1
X
@Etotal
=
h
@w!c
t=
0

!c (t); sc (t)i

You might also like