You are on page 1of 13

Which GPU(s) to Get for Deep Learning:

My Experience and Advice for Using


GPUs in Deep Learning
2018-11-05 by Tim Dettmers

Deep learning is a field with intense computational requirements and


the choice of your GPU will fundamentally determine your deep learning
experience. With no GPU this might look like months of waiting for an
experiment to finish, or running an experiment for a day or more only to
see that the chosen parameters were off and the model diverged. With a
good, solid GPU, one can quickly iterate over designs and parameters of
deep networks, and run experiments in days instead of months, hours
instead of days, minutes instead of hours. So making the right choice
when it comes to buying a GPU is critical. So how do you select the GPU
which is right for you? This blog post will delve into that question and will
lend you advice which will help you to make a choice that is right for you.

Having a fast GPU is a very important aspect when one begins to learn
deep learning as this allows for rapid gain in practical experience which
is key to building the expertise with which you will be able to apply deep
learning to new problems. Without this rapid feedback, it just takes too
much time to learn from one’s mistakes and it can be discouraging and
frustrating to go on with deep learning. With GPUs, I quickly learned
how to apply deep learning on a range of Kaggle competitions and I
managed to earn second place in the Partly Sunny with a Chance of
Hashtags Kaggle competition using a deep learning approach, where it
was the task to predict weather ratings for a given tweet. In the
competition, I used a rather large two layered deep neural network with
rectified linear units and dropout for regularization and this deep net
fitted barely into my 6GB GPU memory. The GTX Titan GPUs that
powered me in the competition were a main factor of me reaching 2nd
place in the competition.

Should I get multiple GPUs?


Excited by what deep learning can do with GPUs I plunged myself into
multi-GPU territory by assembling a small GPU cluster with InfiniBand
40Gbit/s interconnect. I was thrilled to see if even better results can be
obtained with multiple GPUs.
I quickly found that it is not only very difficult to parallelize neural
networks on multiple GPUs efficiently, but also that the speedup was
only mediocre for dense neural networks. Small neural networks could
be parallelized rather efficiently using data parallelism, but larger neural
networks like I used in the Partly Sunny with a Chance of Hashtags
Kaggle competition received almost no speedup.

I analyzed parallelization in deep learning in depth, developed a


technique to increase the speedups in GPU clusters from 23x to 50x for
a system of 96 GPUs and published my research at ICLR 2016. In my
analysis, I also found that convolution and recurrent networks are rather
easy to parallelize, especially if you use only one computer or 4 GPUs. So
while modern tools are not highly optimized for parallelism you can still
attain good speedups.

Figure
1: Setup in my main computer: You can see three GPUs and an
InfiniBand card. Is this a good setup for doing deep learning?
The user experience of using parallelization techniques in the most
popular frameworks is also pretty good now compared to three years
ago. Their algorithms are rather naive and will not scale to GPU clusters,
but they deliver good performance for up to 4 GPUs. For convolution,
you can expect a speedup of 1.9x/2.8x/3.5x for 2/3/4 GPUs; for
recurrent networks, the sequence length is the most important
parameters and for common NLP problems one can expect similar or
slightly worse speedups then convolutional networks. Fully connect
networks usually have poor performance for data parallelism and more
advanced algorithms are necessary to accelerate these parts of the
network.
So today using multiple GPUs can make training much more convenient
due to the increased speed and if you have the money for it multiple
GPUs make a lot of sense.

Using Multiple GPUs Without Parallelism


Another advantage of using multiple GPUs, even if you do not parallelize
algorithms, is that you can run multiple algorithms or experiments
separately on each GPU. You gain no speedups, but you get more
information about your performance by using different algorithms or
parameters at once. This is highly useful if your main goal is to gain deep
learning experience as quickly as possible and also it is very useful for
researchers, who want try multiple versions of a new algorithm at the
same time.

This is psychologically important if you want to learn deep learning. The


shorter the intervals for performing a task and receiving feedback for
that task, the better the brain able to integrate relevant memory pieces
for that task into a coherent picture. If you train two convolutional nets
on separate GPUs on small datasets you will more quickly get a feel for
what is important to perform well; you will more readily be able to
detect patterns in the cross-validation error and interpret them
correctly. You will be able to detect patterns which give you hints on
what parameter or layer needs to be added, removed, or adjusted.

I personally think using multiple GPUs in this way is more useful as one
can quickly search for a good configuration. Once one has found a good
range of parameters or architectures one can then use parallelism across
multiple GPUs to train the final network.

So overall, one can say that one GPU should be sufficient for almost any
task but that multiple GPUs are becoming more and more important to
accelerate your deep learning models. Multiple cheap GPUs are also
excellent if you want to learn deep learning quickly. I personally have
rather many small GPUs than one big one, even for my research
experiments.

NVIDIA vs AMD vs Intel vs Google vs Amazon


NVIDIA: The Leader
NVIDIA’s standard libraries made it very easy to establish the first deep
learning libraries in CUDA, while there were no such powerful standard
libraries for AMD’s OpenCL. This early advantage combined with strong
community support from NVIDIA increased the size of the CUDA
community rapidly. This means if you use NVIDIA GPUs you will easily
find support if something goes wrong, you will find support and advice if
you do program CUDA yourself, and you will find that most deep
learning libraries have the best support for NVIDIA GPUs. This is a very
strong point for NVIDIA GPUs.

On the other hand, NVIDIA has now a policy that the use of CUDA in
data centers is only allowed for Tesla GPUs and not GTX or RTX cards. It
is not clear what is meant by “data centers” but this means that
organizations and universities are often forced to buy the expensive and
cost-inefficient Tesla GPUs due to fear of legal issues. However, Tesla
cards have no real advantage over GTX and RTX cards and cost up to 10
times as much.

That NVIDIA can just do this without any major hurdles shows the
power of their monopoly — they can do as they please and we have to
accept the terms. If you pick the major advantages that NVIDIA GPUs
have in terms of community and support, you will also need to accept
that you can be pushed around by them at will.

AMD: Powerful But Lacking Support


HIP via ROCm unifies NVIDIA and AMD GPUs under a common
programming language which is compiled into the respective GPU
language before it is compiled to GPU assembly. If we would have all our
GPU code in HIP this would be a major milestone, but this is rather
difficult because it is difficult to port the TensorFlow and PyTorch code
bases. TensorFlow has some support for AMD GPUs and all major
networks can be run on AMD GPUs, but if you want to develop new
networks some details might be missing which could prevent you from
implementing what you need. The ROCm community is also not too large
and thus it is not straightforward to fix issues quickly. There also does
not seem to be much money allocated for deep learning development
and support from AMD’s side which slows the momentum.

However, AMD GPUs show strong performance compared to NVIDIA


GPUs and the next AMD GPU the Vega 20 will be a computing
powerhouse which will feature Tensor-Core-like compute units.
Overall I think I still cannot give a clear recommendation for AMD GPUs
for ordinary users that just want their GPUs to work smoothly. More
experienced users should have fewer problems and by supporting AMD
GPUs and ROCm/HIP developers they contribute to the combat against
the monopoly position of NVIDIA as this will greatly benefit everyone in
the long-term. If you are a GPU developer and want to make important
contributions to GPU computing, then an AMD GPU might be the best
way to make a good impact over the long-term. For everyone else,
NVIDIA GPUs might be the safer choice.

Intel: Trying Hard


My personal experience with Intel’s Xeon Phis has been very
disappointing and I do not seem them as a real competitor to NVIDIA or
AMD cards and thus I will keep it short: If you decide to go with a Xeon
Phi take note that you might encounter poor support, computing issues
that make code sections slower than CPUs, very difficult to write
optimized code, no full support of C++11 features, some important GPU
design patterns are not supported by the compiler, poor compatibility
with other libraries that rely on BLAS routines (NumPy and SciPy) and
probably many other frustrations that I have probably not run into.

I was really looking forward to the Intel Nervana neural network


processor (NNP) because its specs were extremely powerful in the
hands of a GPU developer and it would have allowed for novel
algorithms which might redefine how neural networks are used, but it
has been delayed endlessly and there are rumors that large portions of
the developed jumped the boat. The NNP is planned for Q3/Q4 2019. If
you want to wait that long, keep in mind that a good hardware is not
everything as we can see from AMD and Intel’s own Xeon Phi. It might
well be into 2020 until the NNP is usable in a mature way.

Google: Cheaper On-Demand Processing?


The Google TPU developed into a very mature cloud-based product that
is extremely cost-efficient. The easiest way to make sense of the TPU is
by seeing it as multiple GPUs packaged together. If we look
at performance measures of the Tensor-Core-enabled V100 versus
TPUv2 we find that both systems have nearly the same in performance
for ResNet50. However, the Google TPU is more cost-efficient.

So the TPU is a cost-efficient cloud-based solution? Yes and no. On paper


and for regular use it is more cost-efficient. However, if you use best
practices and guidelines as used by a fastai team and fastai library you
can achieve faster convergences at a lower price — at least for
convolutional networks for object recognition.

With the same software, the TPU could be even more cost-efficient, but
here also lies the problem: (1) TPUs are not available for the use of the
fastai library, that is PyTorch; (2) TPU algorithms rely mostly on the
internal Google team, (3) no uniform high-level library exist which
enforces good standards for TensorFlow.

All three points hit the TPU as it requires separate software to keep up
with new additions to the deep learning algorithm family. I am sure the
grunt-work has already been done by the Google team, but it is unclear
how good the support is for some models. The official repository for
example only has a single model for NLP with the rest being computer
vision models. All models use convolution and none of them recurrent
neural networks. With comes together with a now rather old report
from February, that the TPUv2 did not converge when LSTMs were
used. I could not find a source if the problem has been fixed as of yet. On
the other hand, one big milestone in NLP was BERT which is a big
bidirectional transformer architecture which can be fine-tuned to reach
state-of-the-art performance on a wide range of NLP tasks. TPUs were
critical for training the training bidirectional transformers on a lot of
data. In total 256 TPU-hours were needed to train a base model of BERT.
How does this compare to GPUs? I wrote a detailed analysis on this and
find that new RTX GPUs are critical for transformer performance and
that one can expect a run-time of about 400 GPU-hours. This shows that
TPUs perform quite well on this task and TPUs have a big advantage
over GPUs for training transformers.

To conclude, currently, TPUs seem to be best used for training


convolutional network or large transformers and should be
supplemented with other compute resources rather than a main deep
learning resource.

Amazon: Reliable but Expensive.


A lot of new GPUs have been added to AWS since the last update of this
blog post. However, the prices are still a bit high. AWS GPU instances
can be a very useful solution if additional compute is needed suddenly,
for example when all GPUs are in use as is common before research
paper deadlines.
However, if it ought to be cost-efficient then one should make sure that
one only runs a few networks and that one knows with a good certainty
that parameters chosen for the training run are near-optimal.
Otherwise, the cost will cut quite deep into your pocket and a dedicated
GPU might be more useful. Even if a fast AWS GPU is tempting a solid
GTX 1070 and up will be able to provide good compute performance for
a year or two without costing too much.

So AWS GPU instances are very useful but they need to be used wisely
and with caution to be cost-efficient. For more discussion on cloud
computing see the section below.

What Makes One GPU Faster Than Another?


TL;DR
Your first question might be what is the most important feature for fast
GPU performance for deep learning: Is it CUDA cores? Clock speed?
RAM size?

While a good simplified advice would have been “pay attention to the
memory bandwidth” I would no longer recommend doing that. This
because GPU hardware and software developed over the years in a way
that bandwidth on a GPU is no longer a good proxy for its performance.
The introduction for Tensor Cores in consumer-grade GPUs complicates
the issue further. Now a combination of bandwidth, FLOPS, and Tensor
Cores are the best indicator for the performance of a GPU.

One thing that to deepen your understanding to make an informed


choice is to learn a bit about what parts of the hardware makes GPUs
fast for the two most important tensor operations: Matrix multiplication
and convolution.

A simple and effective way to think about matrix multiplication is that it


is bandwidth bound. That is memory bandwidth is the most important
feature of a GPU if you want to use LSTMs and other recurrent networks
that do lots of matrix multiplications.
Similarly, convolution is bound by computation speed. Thus TFLOPs on a
GPU is the best indicator for the performance of ResNets and other
convolutional architectures.
Tensor Cores change the equation slightly. They are very
straightforward specialized compute units which can speed up
computation — but not memory bandwidth — and thus the largest
benefit can be seen for convolutional nets which are about 30% to 100%
times faster with Tensor Cores.

While Tensor Cores only make the computation faster they also enable
the computation using 16-bit numbers. This is also a big advantage for
matrix multiplication because with numbers only being 16-bit instead of
32-bit large one can transfer twice the number of numbers in a matrix
with the same memory bandwidth. I wrote in detail about how this
change from 32-bit to 16-bit affects matrix multiplication performance,
but in general, one can hope for speedups of 100-300% when switching
from 32-bit to 16-bit and an additional of about 20% to 60% for
LSTMs using Tensor Cores.

These are some big increases in performance and 16-bit training should
become standard with RTX cards — never use 32-bit! If you encounter
problems with 16-bit training then you should use loss scaling: (1)
multiply your loss by a big number, (2) calculate the gradient, (3) divide
by the big number, (4) update your weights. Usually, 16-bit training
should be just fine, but if you are having trouble replicating results with
16-bit loss scaling will usually solve the issue.

So overall, the best rule of thumb would be: Look at bandwidth if you use
RNNs; look at FLOPS if you use convolution; get Tensor Cores if you can
afford them (do not buy Tesla cards unless you have to).

Figure
2: Normalized Raw Performance Data of GPUs and TPU. Higher is
better. An RTX 2080 Ti is about twice as fast as a GTX 1080 Ti: 0.77 vs
0.4.
Cost Efficiency Analysis
The cost-efficiency of a GPU is probably the most important criterion for
selecting a GPU. I did a new cost performance analysis that incorporated
memory bandwidth, TFLOPs, and Tensor Cores. I looked at prices on
eBay and Amazon and weighted them 50:50, then I looked at
performance indicators for LSTMs, CNNs, with and without Tensor
Cores. I took these performance numbers and averaged them to receive
average performance ratings with which I then calculated
performance/cost numbers. This is the result:

Figure
3: Normalized performance/cost numbers for convolutional networks
(CNN), recurrent networks (RNN) and Transformers. Higher is better.
An RTX 2070 is more than 5 times more cost-efficient than a Tesla V100.
From this data (1, 2, 3, 4, 5, 6, 7), we see that the RTX 2070 is more cost-
efficient than the RTX 2080 or the RTX 2080 Ti. Why is this so? The
ability to do 16-bit computation with Tensor Cores is much more
valuable than just having a bigger ship with more Tensor Cores cores.
With the RTX 2070, you get these features for the lowest price.

However, this analysis also has certain biases which should be taken into
account:
(1) Prices fluctuate. Currently, GTX 1080 Ti, RTX 2080 and RTX 2080 Ti
cards seem to be overpriced and they could be more favorable in the
future.
(2) This analysis favors smaller cards. The analysis does not take into
account how much memory you need for networks nor how many GPUs
you can fit into your computer. One computer with 4 fast GPUs is much
more cost-efficient than 2 computers with the most cost/efficient cards.\

Warning: Multi-GPU RTX Heat Problems


There are problems with the RTX 2080 Ti and other RTX GPUs with the
standard dual fan if you use multiple GPUs that run next to each other.
This is especially so for multiple RTX 2080 Ti in one computer but
multiple RTX 2080 and RTX 2070 can also be affected. The fan on some
of the RTX cards is a new design developed by NVIDIA to improve the
experience for gamers that run a single GPU (silent, lower heat for one
GPU). However, the design is terrible if you use multiple GPUs that have
this open dual fan design. If you want to use multiple RTX cards that run
next to each other (directly in the next PCIe slot) then you should get the
version that has a “blower-style” single fan design. This is especially true
for RTX 2080 Ti cards. ASUS and PNY currently have RTX 2080 Ti
models on the market with a blower-style fan. If you use two RTX 2070
you should be fine with any fan though, however, I would also get a
blower-style fan with you run more than 2 RTX 2070 next to each other.

Required Memory Size and 16-bit Training


The memory on a GPU can be critical for some applications like
computer vision, machine translation, and certain other NLP
applications and you might think that the RTX 2070 is cost-efficient, but
its memory is too small with 8 GB. However, note that through 16-bit
training you virtually have 16 GB of memory and any standard model
should fit into your RTX 2070 easily if you use 16-bits. The same is true
for the RTX 2080 and RTX 2080 Ti.

General GPU Recommendations


Currently, my main recommendation is to get an RTX 2070 GPU and use
16-bit training. I would never recommend buying an XP Titan, Titan V,
any Quadro cards, or any Founders Edition GPUs. However, there are
some specific GPUs which also have their place:
(1) For extra memory, I would recommend an RTX 2080 Ti.
(2) For extra performance, I would recommend either (a) an RTX 2080 Ti
or (b) an RTX 2070 now and sell & upgrade to an RTX Titan in 2019
Q1/Q2.
(3) If you are short on money I would recommend a Titan X (Pascal) from
eBay or a GTX 1060 (6GB). If that is too expensive go for a GTX 1050 Ti.
If that is still too expensive have a look at Colab.
(4) If you just want to get started with deep learning a GTX 1050 Ti
(4GB) is a good option.
(5) If you can wait definitely wait it out: Both GTX 1080 Ti and RTX 2080
Ti are great cards but they have crazy prices right now. Their prices
likely stabilize in a month or two.
(6) You want to learn quickly how to do deep learning: Multiple GTX
1060 (6GB).
(7) If you already have a GTX 1080 Ti or GTX Titan (Pascal) you might
want to wait until the RTX Titan is released. Your GPUs are still okay.

I personally wanted to get an RTX 2080 Ti, but since the RTX 2070
release it is a much more cost-efficient card and with a virtual 16-bit
memory which is equivalent to 16 GB in 32-bit I will be able to run any
model that is out there.

Deep Learning in the Cloud


Both GPU instances on AWS and TPUs in the Google Cloud are viable
options for deep learning. While the TPU is a bit cheaper it is lacking the
versatility and flexibility of AWS GPUs. TPUs might be the weapon of
choice for training object recognition pipelines. For other work-loads
AWS GPUs are a safer bet — the good thing about cloud instances is that
you can switch between GPUs and TPUs at any time or even use both at
the same time.

However, mind the opportunity cost here: If you learn the skills to have a
smooth work-flow with AWS instances, you lost time that could be spent
doing work on a personal GPU, and you will also not have acquired the
skills to use TPUs. If you use a personal GPU, you will not have the skills
to expand into more GPUs/TPUs via the cloud. If you use TPUs you are
stuck with TensorFlow and it will not be straightforward to switch to
AWS. Learning a smooth cloud work-flow is expensive and you should
weight this cost if you make the choice for TPUs or AWS GPUs.

Another question is also about when to use cloud services. If you try to
learn deep learning or you need to prototype then a personal GPU might
be the best option since cloud instances can be pricey. However, once
you have found a good deep network configuration and you just want to
train a model using data parallelism with cloud instances is a solid
approach. This means that a small GPU will be sufficient for prototyping
and one can rely on the power of cloud computing to scale up to larger
experiments.
If you are short on money the cloud computing instances might also be a
good solution, but the problem is that you can only buy a lot of compute
per hour when you only need some little for prototyping. In this case, one
might want to prototype on a CPU and then roll out on GPU/TPU
instances for a quick training run. This is not the best work-flow since
prototyping on a CPU can be a big pain, but it is a cost-efficient solution.

Conclusion
With the information in this blog post, you should be able to reason
which GPU is suitable for you. In general, I see two main strategies that
make sense: Firstly, go with an RTX 20 series GPU to get a quick upgrade
or, secondly, go with a cheap GTX 10 series GPU and upgrade once the
RTX Titan becomes available. If you are less serious about performance
or you just do not need the performance, for example for Kaggle,
startups, prototyping, or learning deep learning you can also benefit
greatly from cheap GTX 10 series GPUs. If you go for a GTX 10 series
GPU be careful that the GPU memory size fulfills your requirements.

TL;DR advice
Best GPU overall: RTX 2070
GPUs to avoid: Any Tesla card; any Quadro card; any Founders Edition
card; Titan V, Titan XP
Cost-efficient but expensive: RTX 2070
Cost-efficient and cheap: GTX Titan (Pascal) from eBay, GTX 1060
(6GB), GTX 1050 Ti (4GB)
I have little money: GTX Titan (Pascal) from eBay, or GTX 1060 (6GB),
or GTX 1050 Ti (4GB)
I have almost no money: GTX 1050 Ti (4GB); CPU (prototyping) +
AWS/TPU (training); or Colab.
I do Kaggle: RTX 2070. If you do not have enough money go for a GTX
1060 (6GB) or GTX Titan (Pascal) from eBay for prototyping and AWS
for final training. Use fastai library
I am a competitive computer vision or machine translation researcher:
GTX 2080 Ti with the blower fan design; upgrade to RTX Titan in 2019
I am an NLP researcher: RTX 2070 use 16-bit.
I want to build a GPU cluster: This is really complicated, you can get
some ideas here
I started deep learning and I am serious about it: Start with an RTX
2070. Buy more RTX 2070 after 6-9 months and you still want to invest
more time into deep learning. Depending on what area you choose next
(startup, Kaggle, research, applied deep learning) sell your GPU and buy
something more appropriate after about two years.
I want to try deep learning, but I am not serious about it: GTX 1050 Ti
(4 or 2GB). This often fits into your standard desktop. If it does, do not
buy a new computer!

Update 2018-11-26: Added discussion of overheating issue of the RTX


2080 Ti.
Update 2018-11-05: Added RTX 2070 and updated recommendations.
Updated charts with hard performance data. Updated TPU section.
Update 2018-08-21: Added RTX 2080 and RTX 2080 Ti; reworked
performance analysis
Update 2017-04-09: Added cost efficiency analysis; updated
recommendation with NVIDIA Titan Xp
Update 2017-03-19: Cleaned up blog post; added GTX 1080 Ti
Update 2016-07-23: Added Titan X Pascal and GTX 1060; updated
recommendations
Update 2016-06-25: Reworked multi-GPU section; removed simple
neural network memory section as no longer relevant; expanded
convolutional memory section; truncated AWS section due to not being
efficient anymore; added my opinion about the Xeon Phi; added updates
for the GTX 1000 series
Update 2015-08-20: Added section for AWS GPU instances; added GTX
980 Ti to the comparison relation
Update 2015-04-22: GTX 580 no longer recommended; added
performance relationships between cards
Update 2015-03-16: Updated GPU recommendations: GTX 970 and
GTX 580
Update 2015-02-23: Updated GPU recommendations and memory
calculations
Update 2014-09-28: Added emphasis for memory requirement of CNNs

Acknowledgments

I want to thank Mat Kelcey for helping me to debug and test custom
code for the GTX 970; I want to thank Sander Dieleman for making me
aware of the shortcomings of my GPU memory advice for convolutional
nets; I want to thank Hannes Bretschneider for pointing out software
dependency problems for the GTX 580; and I want to thank Oliver
Griesel for pointing out notebook solutions for AWS instances.

You might also like