Stochastic Gradient Descent

9/28/2017 Stochastic Gradient Descent - Mini-batch and more - Adventures in Machine Learning
ADVENTURES IN MACHINE LEARNING

LEARN AND EXPLORE MACHINE LEARNING
ABOUT CONTACT
Stochastic Gradient
POPULAR TUTORIALS
Descent Mini-batch and Neural Networks Tutorial A

Pathway to Deep Learning
more Python TensorFlow Tutorial

Build a Neural Network
March 30, 2017 Andy Deep learning, Neural networks, Convolutional Neural Networks
Optimisation 0 Tutorial in TensorFlow
Keras tutorial build a
convolutional neural network in
11 lines
Word2Vec word embedding
tutorial in Python and
TensorFlow
CATEGORIES
CNTK
Convolutional Neural Networks
Deep learning
In the neural
gensim
network
tutorial, I Keras
introduced
Build apps faster. Go from the gradient
Neural networks
zero to hyperscale with descent NLP

algorithm
App Engine. Optimisation
which is used
to train the TensorFlow
TRY IT FREE
weights in an
Word2Vec
arti cial
neural
network. In
1 reality, for
Shares
http://adventuresinmachinelearning.com/stochastic-gradient-descent/ 1/11
deep learning and big data tasks standard gradient descent is not
often used. Rather, a variant of gradient descent called stochastic
1
gradient descent and in particular its cousin mini-batch gradient
descent is used. That is the focus of this post.
Gradient descent review

The gradient descent optimisation algorithm aims to minimise
some cost/loss function based on that functions gradient.
Successive iterations are employed to progressively approach
either a local or global minimum of the cost function. The gure
below shows an example of gradient descent operating in a single
dimension:
Simple, one-dimensional gradient descent
When training weights in a neural network, normalbatch gradient

descent usually takes the mean squared error ofall the training
samples when it is updating the weights of the network:
W = W J (W , b)
where W are the weights, is the learning rate and is the NEWSLETTER + FREE
gradient of the cost function J (W , b) with respect to changes in EBOOK
the weights. More details can be found in the neural networks
tutorial, but in that tutorial the cost functionJ was de ned as: Email address:
Your email address
m
1
(z) (z)
J (W , b) = J (W , b, x ,y ) SIGN UP
m
z=0
As can be observed, the overall cost function (and therefore the FIND US ON FACEBOOK
gradient) depends on the mean cost function calculatedonall of
the m training samples(x(z) and y (z) refer to each training sample
pair). Is this the best way of doing things? Batch gradient descent
Adventures in Machine L
Liked 2.1K likes
is good because the training progress is nice and smooth if you
plot the average value of the cost function over the number of
iterations / epochs it will look something like this: You like this
Example batch gradient descent progress
As you can see, the line is mostly smooth and predictable.

However, a problem with batch gradient descent in neural
networks is that for everygradient descentupdate in the weights,
you have to cycle through every training sample. For big data sets
i.e. > 50,000 training samples, this can be time prohibitive. Batch
gradient descent also has the following disadvantages:
It requires the loading of the whole dataset into memory,

which can be problematic for big data sets
Batch gradient descent cant be e ciently
parallelised(compared to the techniques about to be
presented) this is because each update in theweight
parameters requires a mean calculation of the cost function
over all the training samples.
The smooth nature of the reducing cost function tends to
ensure that the neural network training will get stuck in local
minimums, which makesit less likely that a global minimum
of the cost function will be found.
Stochastic gradient descent isan algorithm that attempts to

addresssome of these issues.
Stochastic gradient descent

Stochastic gradient descent updates the weight parameters after
evaluation the cost function after each sample. That is, rather than
summing up the cost function results for all the sample then taking
the mean, stochastic gradient descent (or SGD) updates the
weights after every training sample is analysed. Therefore, the
updates look like this:
(z) (z)
W = W J (W , b, x ,y )
Notice that an update to the weights (and bias)is performed after

every samplez in m. This is easily implemented by a minor
variation of the batch gradient descent code in Python, by simply

shifting the update component into the sample loop (the original
train_nn function can be found in the neural networks tutorial and
here):
def train_nn_SGD(nn_structure, X, y, iter_num=3000,

alpha=0.25, lamb=0.000):
W, b = setup_and_init_weights(nn_structure)
cnt = 0
m = len(y)
avg_cost_func = []
print('Starting gradient descent for {}
iterations'.format(iter_num))
while cnt < iter_num:
if cnt%50 == 0:
print('Iteration {} of {}'.format(cnt,
iter_num))
tri_W, tri_b = init_tri_values(nn_structure)
avg_cost = 0
for i in range(len(y)):
delta = {}
# perform the feed forward pass and return
the stored h and z values,
# to be used in the gradient descent step
h, z = feed_forward(X[i, :], W, b)
# loop from nl-1 to 1 backpropagating the
errors
for l in range(len(nn_structure), 0, -1):
if l == len(nn_structure):
delta[l] =
calculate_out_layer_delta(y[i,:], h[l], z[l])
avg_cost +=
np.linalg.norm((y[i,:]-h[l]))
else:
if l > 1:
delta[l] =
calculate_hidden_delta(delta[l+1], W[l], z[l])
# triW^(l) = triW^(l) +
delta^(l+1) * transpose(h^(l))
tri_W[l] = np.dot(delta[l+1]
[:,np.newaxis],
np.transpose(h[l][:,np.newaxis]))
# trib^(l) = trib^(l) +
delta^(l+1)
tri_b[l] = delta[l+1]
# perform the gradient descent step for
the weights in each layer

for l in range(len(nn_structure) - 1, 0,
-1):
W[l] += -alpha * (tri_W[l] + lamb *
W[l])
b[l] += -alpha * (tri_b[l])
# complete the average cost calculation
avg_cost = 1.0/m * avg_cost
avg_cost_func.append(avg_cost)
cnt += 1
return W, b, avg_cost_func
In the above function, to implement stochastic gradient descent,

thefollowing code was simply indented into the sample loop fori
in range(len(y)): (and the averaging over m samples removed):
for l in range(len(nn_structure) - 1, 0, -1):

W[l] += -alpha * (tri_W[l] + lamb * W[l])
b[l] += -alpha * (tri_b[l])
In other words, a very easy transition from batch to stochastic

gradient descent. Where does the stochastic part come in? The
stochastic component is in the selectionof the random selection
oftraining sample. However, if we use the scikit-learn
test_train_split functionthe random selectionhas already
occurred, so we can simply iterate through eachtraining sample,
which has a randomised order.
Stochastic gradient descent

performance
So how does SGD perform? Lets take a look. The plot below
shows the average cost versus the number of training epochs /
iterations for batch gradient descent and SGD on the scikit-learn
MNIST dataset. Note that both of these are operating o the same
optimised learning parameters (i.e. learning rate, regularisation
parameter) which were determined according to the methods
described in this post.
Batch gradient descent versus SGD
Some interesting things can be noted from the above gure. First,
SGD converges much more rapidly than batch gradient descent. In
fact, SGD converges on a minimum J after < 20 iterations.
Secondly, despite what the average cost function plot says,batch
gradient descent after 1000 iterations outperforms SGD. On the
MNIST test set, the SGD run has an accuracy of 94% compared to a
BGD accuracy of 96%. Why is that? Lets zoom into the SGD run to
have a closer look:
Noisy SGD
As you can see in the gure above, SGD is noisy. That is because it
responds to the e ects of each and every sample, and the samples
themselves will no doubt contain an element of noisiness. While
this can be a bene t in that it can act to kick the gradient descent
out of local minimum values of the cost function, it can also hinder
it settling down into a good minimum. This is why, eventually,
batch gradient descent has outperformed SGD after 1000
iterations. It might be argued that this is a worthwhile pay-o , as
the running time of SGD versus BGD is greatly reduced. However,
you might ask is there a middle road, a trade-o ?
There is, and it is called mini-batch gradient descent.
Mini-batch gradient descent
Mini-batch gradient descent is a trade-o between stochastic

gradient descent and batch gradient descent. In mini-batch
gradient descent, the cost function (and therefore gradient) is
averaged over a small number of samples, fromaround 10-500.
This is opposed to the SGD batch size of 1 sample, and the BGD
size ofall the training samples. It looks like this:
(z:z+bs) (z:z+bs)
W = W J (W , b, x ,y )
Where bs is the mini-batch size and the cost function is:
bs
1
(z:z+bs) (z:z+bs) (z) (z)
J (W , b, x ,y ) = J (W , b, x ,y )
bs
z=0
Whats the bene t of doing it this way? First, it smooths out some
of the noise in SGD, but not all of it, thereby still allowing the kick
out of local minimums of the cost function. Second, the mini-batch
size is still small, thereby keeping the performance bene ts of SGD.
To create the mini-batches, we can use the following function:
from numpy import random

def get_mini_batches(X, y, batch_size):
random_idxs = random.choice(len(y), len(y),
replace=False)
X_shuffled = X[random_idxs,:]
y_shuffled = y[random_idxs]
mini_batches = [(X_shuffled[i:i+batch_size,:],
y_shuffled[i:i+batch_size]) for
i in range(0, len(y), batch_size)]
return mini_batches
Then our new neural network training algorithm looks like this:
def train_nn_MBGD(nn_structure, X, y, bs=100,

iter_num=3000, alpha=0.25, lamb=0.000):
W, b = setup_and_init_weights(nn_structure)
cnt = 0
m = len(y)
avg_cost_func = []
print('Starting gradient descent for {}
iterations'.format(iter_num))
while cnt < iter_num:
if cnt%1000 == 0:
print('Iteration {} of {}'.format(cnt,
iter_num))
tri_W, tri_b = init_tri_values(nn_structure)

avg_cost = 0
mini_batches = get_mini_batches(X, y, bs)
for mb in mini_batches:
X_mb = mb[0]
y_mb = mb[1]
# pdb.set_trace()
for i in range(len(y_mb)):
delta = {}
# perform the feed forward pass and
return the stored h and z values,
# to be used in the gradient descent
step
h, z = feed_forward(X_mb[i, :], W, b)
# loop from nl-1 to 1 backpropagating
the errors
for l in range(len(nn_structure), 0,
-1):
if l == len(nn_structure):
delta[l] =
calculate_out_layer_delta(y_mb[i,:], h[l], z[l])
avg_cost +=
np.linalg.norm((y_mb[i,:]-h[l]))
else:
if l > 1:
delta[l] =
calculate_hidden_delta(delta[l+1], W[l], z[l])
# triW^(l) = triW^(l) +
delta^(l+1) * transpose(h^(l))
tri_W[l] += np.dot(delta[l+1]
[:,np.newaxis],
np.transpose(h[l][:,np.newaxis]))
# trib^(l) = trib^(l) +
delta^(l+1)
tri_b[l] += delta[l+1]
# perform the gradient descent step for
the weights in each layer
for l in range(len(nn_structure) - 1, 0,
-1):
W[l] += -alpha * (1.0/bs * tri_W[l] +
lamb * W[l])
b[l] += -alpha * (1.0/bs * tri_b[l])
# complete the average cost calculation
avg_cost = 1.0/m * avg_cost
avg_cost_func.append(avg_cost)
cnt += 1
return W, b, avg_cost_func
Lets see how it performs with a min-batch size of 100 samples:
Mini-batch gradient descent versus the rest
As can be observed in the gure above, mini-batch gradient

descent appears bethe superior method of gradient descentto be
used in neural networks training. The jagged decline in the
averagecost function is evidence thatmini-batch gradient descent
is kicking the cost function out of local minimum values to reach
better, perhaps even the best, minimum. However, it is still able to
nd a good minimum and stick to it. This is con rmed in the test
data the mini-batch method achieves an accuracy of 98%
compared to the next best, batch gradient descent, which has an
accuracy of 96%. The great thing is it gets to these levels of
accuracy after only 150 iterationsor so.
One nal bene t of mini-batch gradient descent is that it can be

performed in a distributed manner. That is, each mini-batch can
be computed in parallel by workers across multiple servers, CPUs
and GPUs to achieve signi cant improvements in training speeds.
There are multiple algorithms and architectures to perform this
parallel operation, but that is a topic for another day. In the mean-
time, enjoy trying out mini-batch gradient descent in your neural
networks.
Popular Popular Popular
Popular New
PREVIOUS NEXT
Improve your Python TensorFlow
neural networks Tutorial Build a
Part 1 [TIPS AND Neural Network
TRICKS]
BE THE FIRST TO COMMENT
Leave a Reply
Your email address will not be published.
Comment
Name*
Email*
Website
POST COMMENT
Note: some posts

contain Udemy a liate
links
Copyright 2017 | WordPress Theme by MH Themes

Stochastic Gradient Descent - Mini-Batch and More - Adventures in Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stochastic Gradient Descent - Mini-Batch and More - Adventures in Machine Learning

Uploaded by

Copyright:

Available Formats

9/28/2017 Stochastic Gradient Descent - Mini-batch and more - Adventures in Machine Learning

ADVENTURES IN MACHINE LEARNING

Descent Mini-batch and Neural Networks Tutorial A

more Python TensorFlow Tutorial

Convolutional Neural Networks

zero to hyperscale with descent NLP

Gradient descent review

Simple, one-dimensional gradient descent

When training weights in a neural network, normalbatch gradient

Example batch gradient descent progress

As you can see, the line is mostly smooth and predictable.

It requires the loading of the whole dataset into memory,

Stochastic gradient descent isan algorithm that attempts to

Notice that an update to the weights (and bias)is performed after

variation of the batch gradient descent code in Python, by simply

def train_nn_SGD(nn_structure, X, y, iter_num=3000,

the weights in each layer

In the above function, to implement stochastic gradient descent,

for l in range(len(nn_structure) - 1, 0, -1):

In other words, a very easy transition from batch to stochastic

Stochastic gradient descent

Batch gradient descent versus SGD

There is, and it is called mini-batch gradient descent.

Mini-batch gradient descent

Mini-batch gradient descent is a trade-o between stochastic

Where bs is the mini-batch size and the cost function is:

To create the mini-batches, we can use the following function:

from numpy import random

def train_nn_MBGD(nn_structure, X, y, bs=100,

tri_W, tri_b = init_tri_values(nn_structure)

Lets see how it performs with a min-batch size of 100 samples:

Mini-batch gradient descent versus the rest

As can be observed in the gure above, mini-batch gradient

One nal bene t of mini-batch gradient descent is that it can be

Popular Popular Popular

BE THE FIRST TO COMMENT

Note: some posts

Copyright 2017 | WordPress Theme by MH Themes

You might also like