Professional Documents
Culture Documents
Lab 4
Andreas Papadopoulos
andpapad+epl451@gmail.com
andpapad@cs.ucy.ac.cy
Intro to Mahout!
Mahout
Scalable Machine Learning and Data Mining
n top of Apache Hadoop
Use the map/reduce paradigm
http://mahout.apache.org/
Lab 4 - Mahout
History
Challenges
Large amount of input data
Techniques work better
Nature of the deploying context
Scalability
Mahout core algorithms are implemented on
top of Apache Hadoop using the Map/Reduce
paradigm
Clustering
takes documents and groups topically related documents
Classification
tries to assign documents to a correct category based on examples
Recommend friends/dates/products
Classify content into predefined groups
Find similar content based on object properties
Find associations/patterns in actions/behaviors
Identify key topics in large collections of text
Detect anomalies in machine output
Ranking search results
Others?
Mahout
For a complete list of algorithms visit :
https://cwiki.apache.org/confluence/display/MAHOUT/Alg
orithms
Goal: Be as fast and efficient as the possible given the
intrinsic design of the algorithm
Some algorithms wont scale to massive machine clusters
Others fit logically on a Map Reduce framework like Apache
Hadoop
Still others will need other distributed programming models
Be pragmatic
Clustering
Call it fuzzy grouping based on a notion of
similarity
Mahout Clustering
Plenty of Algorithms: K-Means,
Fuzzy K-Means, Mean Shift,
Canopy, Dirichlet
Group similar looking objects
Notion of similarity: Distance measure:
Euclidean
Cosine
Tanimoto
Manhattan
Classification
Predicting the type of a new object based on its features
The types are predetermined
Dog
Cat
Mahout Classification
Plenty of algorithms
Nave Bayes
Complementary Nave Bayes
Random Forests
Logistic Regression (SGD)
Support Vector Machines
Recommendations
Predict what the user likes based on
His/Her historical behavior
Aggregate behavior of people similar to him
Mahout Recommenders
Different types of recommenders
User based
Item based
Spam Detection
Yahoo!
http://www.slideshare.net/hadoopusergroup/mailantispam
Install Mahout
sudo apt-get install maven2
wget http://apache.osuosl.org//mahout/0.5/mahout-distribution-0.5-src.tar.gz
tar xvf mahout-distribution-0.5-src.tar.gz
Pattern mining
From: PFP: Parallel FP-Growth for Query Recommendation, by: Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, Edward Y. Chang
In Proceedings of the 2008 ACM conference on Recommender systems
.
Key: 218: Value: ([17, 18, 16, 171, 218],7), ([17, 18, 16, 21, 15, 171, 218],6), ([17, 18, 16, 21, 171, 218],6), ([17, 18,
16, 43, 171, 218],6), ([17, 18, 16, 31, 218],6), ([18, 16, 24, 171, 218],6), ([17, 18, 16, 29, 218],6), ([17, 18, 25, 171,
218],6), ([18, 16, 31, 171, 218],6), ([17, 18, 16, 24, 218],6), ([18, 16, 25, 171, 218],6), ([17, 18, 16, 25, 218],6), ([17,
18, 43, 218],6), ([17, 18, 16, 21, 29, 15, 218],5), ([17, 18, 16, 31, 21, 15, 218],5), ([17, 18, 16, 21, 43, 15, 218],5),
([17, 18, 16, 31, 29, 218],5), ([17, 18, 21, 29, 15, 218],5), ([17, 18, 16, 21, 29, 218],5), ([18, 16, 31, 21, 15, 218],5),
([17, 16, 31, 21, 15, 218],5), ([17, 18, 16, 31, 21, 218],5), ([17, 18, 16, 21, 43, 218],5), ([17, 18, 31, 43, 218],5), ([18,
16, 43, 15, 218],5), ([17, 18, 29, 43, 218],5), ([17, 18, 16, 21, 29, 27, 43, 218],4), ([17, 18, 16, 29, 27, 43, 218],4),
([17, 18, 16, 31, 21, 29, 218],4), ([17, 18, 21, 29, 27, 43, 218],4), ([17, 18, 16, 21, 27, 43, 218],4), ([17, 18, 16, 31,
21, 43, 218],4), ([16, 21, 29, 27, 43, 218],4), ([17, 16, 31, 21, 43, 218],4), ([21, 29, 27, 43, 218],4)
Key: 175: Value: ([17, 12, 18, 16, 43, 28, 1, 175],7), ([17, 12, 18, 16, 28, 1, 175],7), ([17, 12, 18, 16, 43, 28, 175],7),
([17, 12, 18, 16, 43, 1, 175],7), ([17, 12, 18, 16, 43, 175],7), ([12, 18, 28, 1, 175],7), ([16, 43, 28, 1, 175],7), ([12, 16,
43, 1, 175],7), ([17, 12, 18, 16, 31, 43, 175],6), ([17, 12, 18, 16, 29, 43, 175],6), ([17, 12, 18, 16, 31, 175],6), ([17,
18, 16, 31, 43, 175],6), ([17, 12, 18, 29, 43, 175],6), ([12, 16, 31, 43, 175],6), ([17, 12, 18, 16, 31, 21, 29, 27, 43,
175],5), ([12, 16, 31, 29, 175],5), ([21, 43, 175],5)
Key: 306: Value: ([17, 18, 13, 93, 306],8), ([17, 18, 43, 306],7), ([17, 12, 18, 306],7), ([17, 12, 18, 43, 306],6), ([17,
12, 18, 28, 306],6), ([17, 18, 29, 306],6), ([43, 15, 306],6), ([17, 12, 18, 43, 28, 306],5), ([17, 12, 18, 29, 306],5),
([17, 12, 18, 27, 306],5), ([17, 18, 29, 43, 306],5), ([17, 12, 18, 29, 27, 306],4), ([17, 12, 18, 29, 43, 306],4), ([17, 12,
18, 27, 28, 306],4), ([17, 12, 18, 29, 28, 306],4), ([17, 12, 18, 27, 43, 306],4), ([17, 12, 18, 21, 27, 306],4), ([17, 18,
21, 27, 306],4), ([17, 18, 31, 306],4), ([17, 12, 18, 29, 27, 28, 306],3), ([17, 12, 18, 21, 27, 28, 306],3), ([17, 12, 18,
29, 27, 43, 306],3), ([17, 12, 18, 27, 43, 28, 306],3), ([17, 12, 18, 21, 27, 43, 306],3), ([17, 12, 18, 21, 29, 27, 306],3),
([17, 12, 18, 29, 43, 28, 306],3), ([17, 18, 16, 31, 29, 306],3), ([17, 18, 21, 29, 27, 306],3), ([17, 18, 21, 27, 28,
306],3), ([17, 12, 18, 21, 43, 306],3), ([17, 21, 27, 43, 306],3), ([17, 18, 31, 43, 306],3), ([18, 21, 27, 43, 306],3),
([17, 12, 18, 31, 306],3), ([18, 16, 306],3), ([16, 29, 306],3)
Sources
Mahout Hands on! Ted Dunning, Robin Anil
OSCON, Portugal 2011
Introduction to Scalable Machine Learning
with Apache Mahout, Grant Ingersoll, Thinking
Lucid, February 15, 2010
https://cwiki.apache.org/confluence/display/MAHO
UT/Algorithms
http://mahout.apache.org/