Professional Documents
Culture Documents
ABOUT CONTACT
Stochastic Gradient
POPULAR TUTORIALS
CATEGORIES
CNTK
Deep learning
In the neural
gensim
network
tutorial, I Keras
introduced
Build apps faster. Go from the gradient
Neural networks
deep learning and big data tasks standard gradient descent is not
often used. Rather, a variant of gradient descent called stochastic
1
gradient descent and in particular its cousin mini-batch gradient
descent is used. That is the focus of this post.
W = W J (W , b)
where W are the weights, is the learning rate and is the NEWSLETTER + FREE
gradient of the cost function J (W , b) with respect to changes in EBOOK
the weights. More details can be found in the neural networks
tutorial, but in that tutorial the cost functionJ was de ned as: Email address:
Your email address
m
1
(z) (z)
J (W , b) = J (W , b, x ,y ) SIGN UP
m
z=0
As can be observed, the overall cost function (and therefore the FIND US ON FACEBOOK
gradient) depends on the mean cost function calculatedonall of
the m training samples(x(z) and y (z) refer to each training sample
pair). Is this the best way of doing things? Batch gradient descent
Adventures in Machine L
Liked 2.1K likes
is good because the training progress is nice and smooth if you
plot the average value of the cost function over the number of
iterations / epochs it will look something like this: You like this
http://adventuresinmachinelearning.com/stochastic-gradient-descent/ 2/11
9/28/2017 Stochastic Gradient Descent - Mini-batch and more - Adventures in Machine Learning
(z) (z)
W = W J (W , b, x ,y )
http://adventuresinmachinelearning.com/stochastic-gradient-descent/ 3/11
9/28/2017 Stochastic Gradient Descent - Mini-batch and more - Adventures in Machine Learning
np.transpose(h[l][:,np.newaxis]))
# trib^(l) = trib^(l) +
delta^(l+1)
tri_b[l] = delta[l+1]
# perform the gradient descent step for
http://adventuresinmachinelearning.com/stochastic-gradient-descent/ 4/11
9/28/2017 Stochastic Gradient Descent - Mini-batch and more - Adventures in Machine Learning
http://adventuresinmachinelearning.com/stochastic-gradient-descent/ 5/11
9/28/2017 Stochastic Gradient Descent - Mini-batch and more - Adventures in Machine Learning
Some interesting things can be noted from the above gure. First,
SGD converges much more rapidly than batch gradient descent. In
fact, SGD converges on a minimum J after < 20 iterations.
Secondly, despite what the average cost function plot says,batch
gradient descent after 1000 iterations outperforms SGD. On the
MNIST test set, the SGD run has an accuracy of 94% compared to a
BGD accuracy of 96%. Why is that? Lets zoom into the SGD run to
have a closer look:
Noisy SGD
As you can see in the gure above, SGD is noisy. That is because it
responds to the e ects of each and every sample, and the samples
themselves will no doubt contain an element of noisiness. While
this can be a bene t in that it can act to kick the gradient descent
out of local minimum values of the cost function, it can also hinder
it settling down into a good minimum. This is why, eventually,
batch gradient descent has outperformed SGD after 1000
iterations. It might be argued that this is a worthwhile pay-o , as
the running time of SGD versus BGD is greatly reduced. However,
you might ask is there a middle road, a trade-o ?
http://adventuresinmachinelearning.com/stochastic-gradient-descent/ 6/11
9/28/2017 Stochastic Gradient Descent - Mini-batch and more - Adventures in Machine Learning
(z:z+bs) (z:z+bs)
W = W J (W , b, x ,y )
bs
1
(z:z+bs) (z:z+bs) (z) (z)
J (W , b, x ,y ) = J (W , b, x ,y )
bs
z=0
Whats the bene t of doing it this way? First, it smooths out some
of the noise in SGD, but not all of it, thereby still allowing the kick
out of local minimums of the cost function. Second, the mini-batch
size is still small, thereby keeping the performance bene ts of SGD.
Then our new neural network training algorithm looks like this:
http://adventuresinmachinelearning.com/stochastic-gradient-descent/ 7/11
9/28/2017 Stochastic Gradient Descent - Mini-batch and more - Adventures in Machine Learning
np.transpose(h[l][:,np.newaxis]))
# trib^(l) = trib^(l) +
delta^(l+1)
tri_b[l] += delta[l+1]
# perform the gradient descent step for
the weights in each layer
for l in range(len(nn_structure) - 1, 0,
-1):
W[l] += -alpha * (1.0/bs * tri_W[l] +
lamb * W[l])
b[l] += -alpha * (1.0/bs * tri_b[l])
# complete the average cost calculation
avg_cost = 1.0/m * avg_cost
avg_cost_func.append(avg_cost)
http://adventuresinmachinelearning.com/stochastic-gradient-descent/ 8/11
9/28/2017 Stochastic Gradient Descent - Mini-batch and more - Adventures in Machine Learning
cnt += 1
return W, b, avg_cost_func
http://adventuresinmachinelearning.com/stochastic-gradient-descent/ 9/11
9/28/2017 Stochastic Gradient Descent - Mini-batch and more - Adventures in Machine Learning
Popular New
PREVIOUS NEXT
Improve your Python TensorFlow
neural networks Tutorial Build a
Part 1 [TIPS AND Neural Network
TRICKS]
Leave a Reply
Your email address will not be published.
Comment
Name*
Email*
Website
POST COMMENT
http://adventuresinmachinelearning.com/stochastic-gradient-descent/ 10/11
9/28/2017 Stochastic Gradient Descent - Mini-batch and more - Adventures in Machine Learning
http://adventuresinmachinelearning.com/stochastic-gradient-descent/ 11/11