You are on page 1of 6

Why You Should Forget ‘for-loop’ for Data Science Code and Embrace ... https://www.kdnuggets.com/2017/11/forget-for-loop-data-science-code-v...

KDnuggets

Subscribe to KDnuggets News | | Contact

SOFTWARE
News/Blog
Top stories
Opinions
Tutorials
JOBS
Companies
Courses
Datasets
EDUCATION
Certificates
Meetings
Webinars

KDnuggets Home » News » 2017 » Nov » Tutorials, Overviews » Why You Should Forget ‘for-loop’ for
Data Science Code and Embrace Vectorization ( 17:n46 )

Why You Should Forget ‘for-loop’ for Data


Science Code and Embrace Vectorization
Previous post
Next post

Like 154 Tweet Share 53

Tags: numpy, Python, Scientific Computing

Data science needs fast computation and transformation of data. NumPy objects in Python provides that
advantage over regular programming constructs like for-loop. How to demonstrate it in few easy lines of
code?

1 of 6 2018-03-04, 5:06 PM
Why You Should Forget ‘for-loop’ for Data Science Code and Embrace ... https://www.kdnuggets.com/2017/11/forget-for-loop-data-science-code-v...

comments

By Tirthajyoti Sarkar, ON Semiconductor

We all have used for-loops for majority of the tasks which needs an iteration over a long list of elements. I am
sure almost everybody, who is reading this article, wrote their first code for matrix or vector multiplication
using a for-loop back in high-school or college. For-loop has served programming community long and
steady.

However, it comes with some baggage and is often slow in execution when it comes to processing large data
sets (many millions of records as in this age of Big Data). This is particularly true for interpreted language
like Python, where, if the body of your loop is simple, the interpreter overhead of the loop itself can be a
substantial amount of the overhead.

Fortunately, in almost all major programming ecosystem there is an alternative. Python has a beautiful one.

Numpy, short for Numerical Python, is the fundamental package required for high performance scientific
computing and data analysis in Python ecosystem. It is the foundation on which nearly all of the higher-level
tools such as Pandas and scikit-learn are built. TensorFlow uses NumPy arrays as the fundamental building
block on top of which they built their Tensor objects and graphflow for deep learning tasks (which makes
heavy use of linear algebra operations on a long list/vector/matrix of numbers).

Two of the most important advantages Numpy provides, are:

ndarray, a fast and space-efficient multidimensional array providing vectorized arithmetic operations
and sophisticated broadcasting capabilities
Standard mathematical functions for fast operations on entire arrays of data without having to write
loops
S HAR E S

2 of 6 2018-03-04, 5:06 PM
Why You Should Forget ‘for-loop’ for Data Science Code and Embrace ... https://www.kdnuggets.com/2017/11/forget-for-loop-data-science-code-v...

You will often come across this assertion in the data science, machine learning, and Python
community that Numpy is much faster due to its vectorized implementation and due to the fact
that many of its core routines are written in C (based on CPython framework).

And it is indeed true (this article is a beautiful demonstration of various options that one can work with
Numpy, even writing bare-bone C routines with Numpy APIs). Numpy arrays are densely packed arrays of
homogeneous type. Python lists, by contrast, are arrays of pointers to objects, even when all of them are of
the same type. So, you get the benefits of locality of reference. Many Numpy operations are implemented in
C, avoiding the general cost of loops in Python, pointer indirection and per-element dynamic type checking.
The speed boost depends on which operations you’re performing. For data science and modern machine
learning tasks, this is an invaluable advantage, as often the data set size runs into millions if not billions of
records and you do not want to iterate over it using a for-loop along with its associated baggage.

How to prove it definitively with an example of a moderately sized data set?

Here is the link to my Github code (Jupyter notebook) that shows, in a few easy lines of code, the difference
in speed of Numpy operation from that of regular Python programming constructs like for-loop, map-
function, or list-comprehension.

I just outline the basic flow:

Create a list of a moderately large number of floating point numbers, preferably drawn from a
continuous statistical distribution like a Gaussian or Uniform random. I chose 1 million for the demo.
Create a ndarray object out of that list i.e. vectorize.
Write short code blocks to iterate over the list and use a mathematical operation on the list say taking
logarithm of base 10. Use for-loop, map-function, and list-comprehension. Each time
use time.time() function to determine how much time it takes in total to process the 1 million
records.

t1=time.time()
for item in l1:
l2.append(lg10(item))
t2 = time.time()
print("With for loop and appending it took {} seconds".format(t2-t1))
speed.append(t2-t1)

Do the same operation using Numpy’s built-in mathematical method (np.log10) over
the ndarray object. Time it.

t1=time.time()
a2=np.log10(a1)
t2 = time.time()
print("With direct Numpy log10 method it took {} seconds".format(t2-t1))
speed.append(t2-t1)

S HAR E S

3 of 6 2018-03-04, 5:06 PM
Why You Should Forget ‘for-loop’ for Data Science Code and Embrace ... https://www.kdnuggets.com/2017/11/forget-for-loop-data-science-code-v...

Store the execution times in a list and plot a bar chart showing the comparative difference.

Here is the result. And, you can repeat the whole process by running all the cells of the Jupyter notebook.
Every time it will generate a new set of random numbers, so the exact execution time may vary a little bit but
overall the trend will always be the same. You can try with various other mathematical functions/string
operations or combination thereof, to check if this holds true in general.

There is an entire open-source, online book on this topic by a French neuroscience researcher. Check it out
here.

Bar chart of comparative speeds of execution of simple mathematical operations

If you have any questions or ideas to share, please contact the author at tirthajyoti[AT]gmail.com. Also you
can check author’s GitHub repositories for other fun code snippets in Python, R, or MATLAB and machine
learning resources.

Bio: Tirthajyoti Sarkar is a semiconductor technologist, machine learning/data science zealot, Ph.D. in EE,
blogger and writer.

Original. Reposted with permission.

Related:

S HAR E S

4 of 6 2018-03-04, 5:06 PM
Why You Should Forget ‘for-loop’ for Data Science Code and Embrace ... https://www.kdnuggets.com/2017/11/forget-for-loop-data-science-code-v...

Python Data Preparation Case Files: Group-based Imputation

1 Comment KDnuggets 1 Login

Recommend 2 Share Sort by Best

Join the discussion…

LOG IN WITH
OR SIGN UP WITH DISQUS ?

Henri de Feraudy • 3 months ago


If you programme in Julia, the for loop is a lot less expensive.
• Reply • Share ›

Subscribe Add Disqus to your siteAdd DisqusAdd Privacy

Previous post
Next post

Top Stories Past 30 Days

Most Popular Most Shared

1. Neural network AI is simple. So... Stop 1. Neural network AI is simple. So… Stop
pretending you are a genius pretending you are a genius
2. Top 10 Machine Learning Algorithms 2. Top 20 Python AI and Machine
for Beginners Learning Open Source Projects
3. Comparing Machine Learning as a 3. 5 Fantastic Practical Machine Learning
Service: Amazon, Microsoft Azure, Resources
Google Cloud AI 4. The 8 Neural Network Architectures
4. Top 20 Python AI and Machine Machine Learning Researchers Need to
Learning Open Source Projects Learn
5. 5 Fantastic Practical Machine Learning 5. Web Scraping Tutorial with Python:
Resources Tips and Tricks
S HAR E S

5 of 6 2018-03-04, 5:06 PM
Why You Should Forget ‘for-loop’ for Data Science Code and Embrace ... https://www.kdnuggets.com/2017/11/forget-for-loop-data-science-code-v...

Should Not Overlook 7. Deep Learning Development with


7. Want to Become a Data Scientist? Try Google Colab, TensorFlow, Keras &
Feynman Technique PyTorch

Latest News
TDWI Chicago, May 6-11: Get Your Hands Dirty With Data ...
Upcoming Meetings in AI, Analytics, Big Data, Data Scie...
For GPU Databases of today, the big challenge is doing ...
Data Science in Fashion
Data Science for Javascript Developers 
Unleash a faster Python on your data

KDnuggets Home » News » 2017 » Nov » Tutorials, Overviews » Why You Should Forget ‘for-loop’ for
Data Science Code and Embrace Vectorization ( 17:n46 )

© 2018 KDnuggets. About KDnuggets

S HAR E S

6 of 6 2018-03-04, 5:06 PM

You might also like