You are on page 1of 17


WIy becomIng u duLu scIenLIsL mIgIL be eusIer LIun you LIInk
b DerrIck HurrIs
OCT. 14, 2012 - 10: 02 PM PST
Several novice programmers who signed up for a free machine-learning class on Coursera have gone on
recently to win predictive-modeling competitions. Maybe its not that hard to mint new data scientists after
Maybe the business world has jumped the gun with all the talk about a looming skills
shortage in big data and advanced analytics. Theres mounting evidence that it doesnt
take much to turn a novice programmer or statistician into a perfectly capable data
scientist. Maybe all it takes is just some cheap cloud computing servers, or a few weeks
studying machine learning with Stanford professor Andrew Ng on Coursera.
Much of this evidence comes via Kaggle, a platform where companies and
organizations award prizes for the best solutions to their predictive-modeling needs. In
September, for example, I covered a first-time Kaggle user and admitted data science
neophyte named Carter S. who won a competition using a simple but effective method
he dubbed overkill analytics.
Impressive, sure, but Carter builds insurance-industry risk models for a living. While
hes able to learn new techniques such as natural-language processing and social
network analysis as he goes, hes no stranger to a linear regression. But what if
someones only formal experience with computer science was a single undergraduate
programming course?
Ask Luis Tandalla. That was his case before he took a handful of free online classes
last year on Coursera. Yet the University of New Orleans senior recently scored his
first victory in a Kaggle competition hosted by the Hewlett Foundation where he had to
devise a model for accurately grading short-answer questions on exams. Not bad for a
college senior who didnt really know what artificial intelligence and machine learning
were before he signed up to learn them.
Once Tandalla got started, he told me, he got passionate about learning more. So he
also took Coursera classes on natural-language processing and probabilistic models,
began studying on his own outside the online lectures and even got active on Kaggle
(this was his first victory in five competitions). Hell receive his bachelors degree in
mechanical engineering in May 2013, but now Tandalla says he wants to pursue a
masters degree in machine learning and start his own predictive-software company
The Courserc connection
Maybe Tandalla isnt so unique after all. The second- and third-place finishers in the
Heritage Foundation competition, it turns out, also learned machine learning on
Coursera. The latter, Xavier Conort, is a 39-year-old actuary from Singapore who just
decided to become a data scientist last year and is now Kaggles top-ranked competitor.
Andrew Ng
Stanford professor and Coursera co-founder Andrew Ng who teaches the
machine-learning class that all three top finishers took doesnt think their success is
just coincidence. If youre not trying to make the types of contacts students at top
universities are after, and your goal isnt to perform advanced research, he explained,
online education platforms such as Coursera (and, Ill add, Udacity and EdX), can be
incredibly valuable.
In particular, Ng said, Machine learning has matured to the point by where if you take
one class you can actually become pretty good at applying it. Familiarity with algebra
and probabilities are certainly helpful, he added, but the only real prerequisite to his
course is a basic understanding of programming.
And with machine learning becoming one of the more highly sought-after skills in
Silicon Valley, Ng said, corporate recruiters say just completing a single course can
significantly boost someones salary and job prospects at companies where such
knowledge is still in short supply.
I bet many students are going on to [do] great things because of these courses [even if
we never hear about it], Ng said.
Wh it uorls, cnd uh it could chcne the uorld
Ng thinks the current incarnation of online education platforms work so well because
theyre essentially nurturing the already-talented students who seek them out. Some
professionals, he explained, take courses to learn skills such as machine learning or
iOS programming that werent in vogue or didnt even exist when they earned their
computer science degrees just a decade ago.
Furthermore, with students able to learn at their own pace, theres a lot of valuable
information disseminated in the discussion forums.
Free access to the best teachers around doesnt hurt either. Ng said he couldnt teach
his course so well if he hadnt spent so much time living in Silicon Valley learning best
practices from some of the smartest computer scientists on the planet. That experience
lets him spend less time teaching algorithms for the sake of algorithms and more time
talking about how one might actually apply machine learning in the field.
Ng says thats a more important than just understanding the algorithms in a vacuum. He
compares it to learning how to write a computer program instead of just learning the
syntax of a programming language but not being able to string commands together into
something useful. This approach isnt entirely unique among the new order of online
educators: On Udacity, for example, Google VP and Stanford professor Sebastian
Thrun, centers the Computer Science 101 curriculum around learning Python in the
context of building a working search engine.
The value of this opportunity wasnt lost on Tandalla. He said he can feel the passion
that professors have even through the pre-recorded video lectures, and it feels good
knowing youre learning from the people who literally wrote the book on the subject
youre studying.
Who lnous uho's the next Einstein
But ultimately, minting new data scientists even Kaggle winners is low-hanging
fruit. Ng said we dont yet know how much impact online educations platforms like
Coursera can have. In all fields, there are talented people all over the world who just
need an avenue to hone their skills and a chance to distinguish themselves.
It makes me wonder, Ng said, if the next Albert Einstein is a little girl in Afghanistan
who just needs [the opportunity to access quality education].
// comments :
---- x
I think that Kaggle is an immature way of defining a data scientist because many Kaggle
problems are clearly defined, i.e. the data is in place and there is a metric that needs to
be minimized. Data scientists need to have some sort of intuition about how to quantify
causation vs. correlation, and the intuition is tougher to build through Courseras ML
course. The real value data scientists provide is figuring out the correct questions to ask
that can be answered by the data, and have the vision to see implementable results
from raw data. That being said, its pretty damn easy to get that start with the toolbox.
---- x
Good comment, and I agree to a degree. Kaggle competitions dont necessarily align
with the accepted definition of what a data scientist does, but they certainly require
exercising part of that skill set.
The bigger picture, though, probably is online education platforms such as Coursera et
al. If youre already working with a companys data and know the business inside and
out, you can become dangerous with the right new skills.
--- x
The top three Hewlett finisher took machine learning w/ Andrew Ng. I dont know about
the rest, but Luis also took AI, probabilistic models (I believe) and NLP.
I along with 4 colleagues of my company also took the free ML and AI class offered last
year by Andrew and Sebastian / Norvig. We completed both the courses and the ML
course was really helpful. It introduced me to a lot of things I wasnt familiar with (I am a
CSE graduate but didnt have ML course in my undergrad programe). After the
completion of the courses, my colleagues also participated in a contest in Kaggle
though didnt win. The outcome of these courses is, I am working remotely (from Dhaka,
Bangladesh) for a US startup and helping them build their product recommendation
engine. Though its not a rocket science, but I got the idea and confidence to work in
this field after I took the ML course. I am also doing another course on Big Data and
Web Intelligence.
Kudos to coursera and other similar online education ventures
What Does it Mean When a College Kid From Ecuador Beats the Best?
October 6, 2012 - by Tom Vander Ark
Yesterday the William and Flora Hewlett Foundation awarded $100,000 to the top five
teams in an assessment scoring competition. The goal was to build software
systems that could grade short answer responses on state standardized tests as
accurately as trained expertsa more difficult challenge than the essay
competition held in the first quarter of this year where the winning algorithms equaled
Hewlett wants to see deeper learning in American classrooms and recognizes that
assessment can influence instruction. The online assessment of Common Core State
Standards will frame U.S. education the way NCLB and AYP framed the last. The short
window to influence the quality of new state tests, which roll out in 2014-15, offers a big
The Automated Student Assessment Prize (ASAP) was constructed to support the aims
of the two state testing consortia, PARCC and SBACbetter tests of higher order skills
at a lower price point. Meeting these objectives will require automated scoring of
constructed response tasks. Anyone who has taken the time to investigate the
consortia's projected use of Race to the Top funding to influence the comprehensive
overhaul of summative assessment among their 47 state partners can see that building
a next generation of high stakes assessment can only work with a big boost from
technology. So, there's a lot at stake, and ASAP is now the new platform for fairly
testing the capabilities of those solutions.
ASAP was hosted on Kaggle, a venture-backed platform that delivers a compelling data
challenge to a pool of 55,000 machine-learning scientists from around the world. As
evidenced by many Kaggle competitions, well-constructed prizes mobilize talent and
accelerate innovation. ASAP appears to have succeeded on both accounts. Like the
first competition, ASAP Phase Two attracted amazing and diverse talent. The three top
competitors came to a CCSSO meeting in ndianapolis to pick up their checks. 'd like to
introduce you to the winners.
Luis Tandalla, 1
place. A Fulbright Scholar at the University of New Orleans, Luis is
from Quito, Ecuador. His mother is an elementary school teacher, but it was a
secondary school teacher that really turned him on to math. He entered about 20 math
competitions and won many of them. He skyped his mom into the award ceremony, so
that she could see her proud son pick up a $50,000 check.
Luis Tandalla
With initial interest in aerospace, Louis is majoring in mechanical engineering. A
newcomer to data science, Luis participated in Stanford's Massively Open Online
Course (MOOC) in machine learning last year from Andrew Ng, and it ignited a passion.
He entered the first phase of ASAP and placed 13
, equaling vendors who have been in
the testing business for decades. After doing so well in data competitions, he's applying
to top computer science graduate programs. (How about we get the kid an H1b visa so
he can work here?)
Jure Zbontar, 2
place. Jure grew up and lives in Ljubljana, Slovenia, where he is a
teaching assistant at the local college. He's pursuing a Ph.D. in computer science in the
field of machine learning, but given how well he did we think he'll have a lot of bright
opportunities ahead of him. He was approached by several of the leading vendors in
automated assessment at the ndianapolis conference, who are thirsty for bright young
talent, especially an eastern-block grad student who can beat their team of Ph.D.s.
Jure began programming in high school. He has entered seven Kaggle competitions.
He won a challenge to predict the Eurovision winner.
Luis Tandalla, 1st place (middle); Jure Zbontar, 2nd place (right); Xavier Conort, 3rd
place (left)
Xavier Conort, 3
place A French-born actuary, Xavier consults to insurance
companies in Singapore. He holds two masters' degrees and is a Chartered Enterprise
Risk Analyst. He has worked in Brazil and China and is married to a Malaysian.
Xavier is the top-rated competitor in the Kaggle community. He has entered at least a
dozen competitions and placed in most.
Xavier notes that there are two types of competitors on Kaggle: computer scientists and
statisticians. He modestly says, "'m not a very good programmer. t looks like he's
pretty good at pattern recognition.
"t's how keep learning, said Conort. About the competitions he chooses Xavier said,
" like to work on interesting problems where have a chance to make a contribution.
He's looking forward to getting back home to see his daughter who turned one last
week. He's flying LA-Paris-Tokyo-Singapore.
What it means. Jaison Morgan is my co-director of ASAP. We spent a couple years
searching for a mega-prize that would transform global education. We didn't find one.
But we are convinced after a couple very successful targeted efforts that a sequence of
small, targeted prizes can focus and accelerate innovation.
Second, it's interesting to note that the winners all learned a lot from Andrew Ng's
Stanford MOOC last year. After the success of the first course, Andrew won backing
from Kleiner Perkins to co-found Coursera. Learning is global. A lot of it is free.
Third, the stories above illustrate the fact that well constructed prizes mobilize global
talent. Seriously, you could not make this stuff upa kid from Ecuador beat some of the
best testing companies in the world after six week of work. Young men (yes, it still is
mostly young men) from Slovenia to Singapore, from Pittsburgh to Poland poured 100
hours a week into the competition hoping to see their name creep up the leaderboard.
Crowdsourcing works.
The fourth lesson from ASAP is that most innovation is translationalsomething that
worked in one field may work in another. The question is how to expose cross-discipline
and cross-sector innovation? Prizes are a super-efficient means of promoting
translational innovation. Some of the winners from ASAP Phase One are already
helping some of the big testing companies improve their services.
Fifth, in Phase One, we let competitors keep their intellectual property. But in Phase
Two we required that competitors open source their code (GPLv3 license) along with an
instruction manual. nterestingly, there was exactly the same number of competitors in
Phase Two. This suggests that the world, at least data science, is becoming more open.
Last, this is just the beginning of Big Data in education. Next week Digital Learning
Now! will release a paper on comprehensive learner profiles. A warehouse of data on
each student will unleash the power of predictive analytics that will empower teachers to
personalize learning in new and powerful ways. Prizes will accelerate innovation in
t's just about to get fun.
October 15, 2012
Why becoming a data scientist is NOT actually easier than you think
was just doing some late night reading and came across this article:
TL;DR - You can take the ML course on Coursera and you're magically a data scientist,
because three really intelligent people did it. disagree.
'm not claiming the people referenced in this article are not data scientists who score
high in Kaggle competitions. They're probably really intelligent people who picked up a
new skill and excelled at it (although one was already an actuary, so he is basically
doing machine learning in some form already).
Here is my problem with it - being a data scientist usually requires a much larger skill
set than a basic understanding of a few learning algorithms. 'm taking the Coursera ML
course right now, and think it is great! Here is what didnt learn though:
Programming Languages and Other Technologies:
Most data scientists and the companies that employ them are not using Matlab/Octave.
They have backend web services written in Java, Python, Scala, or Ruby. These
languages are not covered. Python has libraries like Scipy, Numpy, and Scikit-learn that
are great for solving numerical problems. Java has a bunch of libraries too like the
Mahout math library [2]. R is used by most statisticians (again not covered in the
course). When your boss (or a customer) comes to you and says you need to integrate
an algorithm into a pre-existing web service ( example -they need a recommendation
engine), and you say " only know Matlab" that is going be a huge problem. You don't
just pick up Java/Python/C++/Scala/whatever in a few days on the job. You have to be
somewhat familiar with these languages to understand large, pre-existing code bases. t
wouldn't hurt to have a decent understanding of existing technologies like Django, ROR,
Groovy, Lift, etc. because you're going to have to integrate your amazing algorithm into
one of them. f you only know Python but the rest of the company is using Java, you
better know about Thrift, Avro, Google ProtoBufs or something smiliar. digress ...
Big Data Software:
Most data scientists are working on problems that can't be run on single 512MB RAM
machine (the data sets on Coursera are tiny). They have large data sets that require
distributed processing. To do this, you need to understand map-reduce, distributed files
systems and be able to utilize Hadoop. Even if you don't know Java, you still need to
know that Hadoop streaming exists, how do use it, and know a scripting language
(Hadoop streaming does not currently support Matlab or Octave). Again - not something
you just pick up in a few days. f you're going to be doing distributed machine learning,
you will probably want to use Mahout. Some of the algorithms covered in the Coursera
course havedistributed versions implemented into Mahout (Clustering, PCA,
Regression), but most do not exist yet (SVM, Distributed-User-Recommendations, and
cross-validation/performance metrics for distributed versions of the algorithms). You're
going to have to know Java (and the Hadoop/Mahout APs) to implement them from
scratch, or use a different algorithm that you may or may not be familiar with. Even if
you do not need to use a distributed algorithm, it would probably be good to know how
to spin up a 64GB instance on ec2, login, install some software, and run your algorithm
on the cloud.
Other useful learning algorithms
Coursera skipped over Bayesian learning [1]. A lot of systems use this (or some form of
this) in production, but you would know nothing about it ('m not saying you couldn't
learn it, but am obviously disagreeing with this article).
Feature Extraction:
You can use every algorithm from the ML course and build hundreds of different
classifiers to solve a real-world problem (you could even combine them!), but if your
features suck, the performance of your classifier is going to suck also. Extracting good
features usually requires a deep understanding of the problem, the underlying
distributions of the data, and/or an familiarity of how the data is being generated. t
might help to also know about Convolution, Wavelets, Time Series Analysis, Digital
Signal Processing, Fourier Transforms, etc. Feature extraction could be a whole
Coursera course by itself.
Data cleaning:
Data preprocessing - Coursera sets up all the data sets for you. They even write the
scripts to load the data. (see week 6 SVM email classification, they wrote all of the
regex expressions to clean the emails for you) That doesn't work in the real world.
Real-world data is ugly, and unstructured. You need to know regular expressions
and UNX commands like sed, grep, tr, cut, sort, awk, and map/reduce to clean these
data sets up and put them into "Coursera" format. Notice said UNX commands, which
implies you need to be somewhat comfortable on UNX/LNUX, which may be a steep
learning curve if you're currently using Windows.
Probability & Statistics
The ML course touches on some of these topics, but real-world problems usually
require a much deeper understanding. Examples:
Are your features dependent or independent (Chi Squared test).
How do you interpret p-values?
How do you set up confidence intervals?
What is the F-test?
What is standard error?
Should use AROCs to test the performance of this algorithm?
What is a ROC curve?
What is Hypothesis testing and when can you / do you use it.
could go on - but you get the point . t's important to know this stuff and know how and
when to apply it.
Where is all this data being stored? t's fine in flat files for the purposes of the ML
course, but when you show up to your first day at the job, it's going to be stored in
MySQL, Postgres, MongoDB, Casandra, CouchDB, and/or on the HDFS. Your going to
have to improvise.
f you think Matlab 2D plots are awesome, check out D3.js. Oh, forgot to mention, you
need to know javascript, functional programming, and the general learning curve for the
D3.js APs.
Debugging is briefly covered in the ML course (Neural Networks - Gradient Checking,
which was an optional programming excerise), but when an algorithm isn't working
correctly, you're going to have to quit treating it like a black-box and dive in. That's when
shit gets real - you actually do need to know about Conjungate Gradients, Partial
Differential Equations, Numerical Analysis, Lagrange Multipliers, Numerical Linear
Alegbra, Convex Optimization, Vector Calculus, Stochastic Processes, and potentially a
lot of other subjects. The ML course is only 8 weeks, and they can't cover these topics,
because they require years of experience, and probably some form of technical training
beyond a BS in whatever. You could just ask the guy with a few years more experience
that is about to take your job though.
Beyond what have highlighted above, being a good data scientist (usually) takes years
of experience. t requires more than just knowing how machine learning algorithms
work. t's knowing what questions to ask and how to convey the answers to investors,
management, and customers. No online course, even from Stanford professors, is
going to be able to teach you that.
Discussion on HN here:
You can follow me on twitter: @josephmisiti
// -- Comments
think you made really good points.
Would be nice if you post some references to the various topics you mentioned though!
don't think Coursera makes you a Data scientist, for which as you pointed out a whole
set of skills and experience is required. still think though that is a valuable initiative that
at least gives you a sense of what Machine Learning involves.
think can be a Data Scientist someday (3-4 years?). am comfortable with Java,
Javascript, Data Structures & Algorithms, Learning Hadoop & MapReduce. The only
challenge is see is Math.
December 29, 2012
Permutations vs. Combinations: A refresher
Here is a referesher for anyone that has forgotten the difference
between permutations and combinations.
Permutations: order matters
Combinations : order does not matter
The four different situations you can encouter:
n = number of items
r = number of choices

Permutations with replacement:
Permuations without replacement
n! / (n-r)!
Combinations with replacement
(n+r-1)! / r!(n-1)!
Combinations without replacement (think the lottery)
1. pretend like order matters (permuation)
2. alter equation so it does
"n choose r" = n! / (n-r)! * 1/r! = n! / r!(n-r)!
How oxduLu wunLs Lo IeIp
everyone become duLu scIenLIsLs
b DerrIck HurrIs
AUG. 14, 2012
Although its still a work in progress, 0xdata thinks it has the answer to the problem of doing advanced
statistical analysis at scale: Build on HDFS for scale, use the widely known R programming language and
hide it all under a simple interface
Theres a trend afoot in the big data space to turn data science from black magic into
childs play, and one of the newest companies trying to pull of this technological
alchemy is 0xdata. The bootstrapped startup, pronounced hexadata, is the brainchild
of former DataStax engineer, and Platfora co-founder, SriSatish Ambati, and its trying
to blend Hadoop, R and Google BigQuery into the ultimate tool for statistical analysis.
Scientists, data analysts or whoever ultimately uses the product only need to be experts
in their domains, not in statistics.
At its core, oxdatas flagship product, called H2O, is a statistical analysis engine that
uses the Hadoop Distributed File System (HDFS) as its storage platform, but the goal is
to make it as simple as using a Google service such as BigQuery. Users will interact
with H2O via a simple web-search-like bar and standard R statistical-analysissyntax, but
H2O will run machine-learning algorithms behind the scenes. Alternatively, users can
call out to H2O from Microsoft Excel or the RStudio integrated development
environment using a REST API.
Although 8igOuery is a 5OL service hosted by Coogle, 0xdata follows a similar
theory on simplicity.
However they choose to leverage the product, Ambati said, the scale of the underlying
data and the complexity of running advanced analysis are details that need to be
hidden. Its the same theory that underlies Platfora, the company Ambati co-founded
last year with his former DataStax colleague Ben Werther, although their approaches
appear to be different. Whereas Platfora is trying to disrupt the data warehouse
market by building a next-generation user experience atop Hadoop, 0xdata is trying to
change the way users interact with popular statistical software such as R.
But either way, Ambati says of new data-analysis products, [There are] no bragging
rights for making it simple. If you dont do that, you wont be able to go forward.
oxdata is also putting a focus on speed, both in terms of how fast it processes data and
how fast it lets users react. Google search changed our thinking around how many
questions people can ask successively, Ambati explained, and data analysts should
have the same experience. Thats why H2O provides approximate results at every step
in the analysis process. Rather than wait for the entire job to run and the exact results to
be computed, users can get a general idea of results and kill the job and start over
quicker if theyre completely outside the expected range.
But it will be a while before the public gets a chance to see whether H2O lives up to its
promises. Ambati said the product is just four months into development and wont have
its first set of algorithms available for another few months. His team of eight engineers
has built a lot of cool stuff, but now it needs to round out the process and turn its code
for H2O into an actual product.
Still, having decided to tackle data as a system, Ambati and his team are having a lot of
fun. We are live-and-die-with-infrastructure people, he said, but for a bunch of folks
who spent a lot of time learning math, its like going back to the their days as computer
science students.