You are on page 1of 27

Web Data Mining

Lab 4
Andreas Papadopoulos
andpapad+epl451@gmail.com
andpapad@cs.ucy.ac.cy

Intro to Mahout!

Mahout
Scalable Machine Learning and Data Mining
n top of Apache Hadoop
Use the map/reduce paradigm
http://mahout.apache.org/

Lab 4 - Mahout

Who uses Mahout?

Whats the name?


A mahout is a person who drives an elephant.
The name Mahout comes from the projects
use of Apache Hadoop for scalability and
tolerance

History

Challenges
Large amount of input data
Techniques work better
Nature of the deploying context

Must produce results quickly


The amount of input is so large that it is not
feasible to process it all on one computer,
even a powerful one

Scalability
Mahout core algorithms are implemented on
top of Apache Hadoop using the Map/Reduce
paradigm

Four main use cases (currently)


Recommendation
takes user behavior and tries to predict items users might like

Clustering
takes documents and groups topically related documents

Classification
tries to assign documents to a correct category based on examples

Frequent itemset mining


takes a set of item groups and identifies which items usually appear
together

Common use cases

Recommend friends/dates/products
Classify content into predefined groups
Find similar content based on object properties
Find associations/patterns in actions/behaviors
Identify key topics in large collections of text
Detect anomalies in machine output
Ranking search results
Others?

Mahout
For a complete list of algorithms visit :

https://cwiki.apache.org/confluence/display/MAHOUT/Alg
orithms
Goal: Be as fast and efficient as the possible given the
intrinsic design of the algorithm
Some algorithms wont scale to massive machine clusters
Others fit logically on a Map Reduce framework like Apache
Hadoop
Still others will need other distributed programming models
Be pragmatic

Most Mahout implementations are Map Reduce enabled

Algorithms and Applications

Clustering
Call it fuzzy grouping based on a notion of
similarity

Mahout Clustering
Plenty of Algorithms: K-Means,
Fuzzy K-Means, Mean Shift,
Canopy, Dirichlet
Group similar looking objects
Notion of similarity: Distance measure:

Euclidean
Cosine
Tanimoto
Manhattan

Classification
Predicting the type of a new object based on its features
The types are predetermined

Dog

Cat

Mahout Classification
Plenty of algorithms

Nave Bayes
Complementary Nave Bayes
Random Forests
Logistic Regression (SGD)
Support Vector Machines

Learn a model from a manually classified data


Predict the class of a new object based on its
features and the learned model

Recommendations
Predict what the user likes based on
His/Her historical behavior
Aggregate behavior of people similar to him

Mahout Recommenders
Different types of recommenders
User based
Item based

Full framework for storage, online and offline


computation of recommendations
Like clustering, there is a notion of similarity in users
or items
Cosine, Tanimoto, Pearson and LLR

Frequent Pattern Mining


Find interesting groups of items based on how
they co-occur in a dataset

Mahout Parallel FPGrowth


Identify the most commonly
occurring patterns from
Sales Transactions
buy Milk, eggs and bread
Query Logs
ipad -> apple, tablet, iphone

Spam Detection
Yahoo!
http://www.slideshare.net/hadoopusergroup/mailantispam

Install Mahout
sudo apt-get install maven2
wget http://apache.osuosl.org//mahout/0.5/mahout-distribution-0.5-src.tar.gz
tar xvf mahout-distribution-0.5-src.tar.gz

sudo mvn install


cd core/
mvn compile
cd ../examples
mvn compile
cd ..
./bin/mahout

Pattern mining

Frequent Pattern Mining


Data: http://fimi.cs.helsinki.fi/data/
Download
http://www.cs.ucy.ac.cy/courses/EPL451/lectures/Lab/new
_accidents.tar.gz
Extract data file: tar xvf new_accidents.tar.gz
Start hadoop
Upload data
hadoop dfs -copyFromLocal ./new_accidents.dat
new_accidents.dat
<MAHOUT_PATH>/mahout fpg -i new_accidents.dat -o patterns method mapreduce -g 10 -regex [\ ]
./mahout seqdumper --seqFile patterns/fpgrowth/part-r-00000

Frequent Pattern Mining


mahout fpg
MapReduce (parallel) implementation of FP Growth
Algorithm for frequent Itemset Mining
Runs each stage of PFPGrowth as described in the
paper:
http://infolab.stanford.edu/~echang/recsys08-69.pdf

Frequent Pattern Mining

From: PFP: Parallel FP-Growth for Query Recommendation, by: Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, Edward Y. Chang
In Proceedings of the 2008 ACM conference on Recommender systems

Frequent Pattern Mining


Example Output

.
Key: 218: Value: ([17, 18, 16, 171, 218],7), ([17, 18, 16, 21, 15, 171, 218],6), ([17, 18, 16, 21, 171, 218],6), ([17, 18,
16, 43, 171, 218],6), ([17, 18, 16, 31, 218],6), ([18, 16, 24, 171, 218],6), ([17, 18, 16, 29, 218],6), ([17, 18, 25, 171,
218],6), ([18, 16, 31, 171, 218],6), ([17, 18, 16, 24, 218],6), ([18, 16, 25, 171, 218],6), ([17, 18, 16, 25, 218],6), ([17,
18, 43, 218],6), ([17, 18, 16, 21, 29, 15, 218],5), ([17, 18, 16, 31, 21, 15, 218],5), ([17, 18, 16, 21, 43, 15, 218],5),
([17, 18, 16, 31, 29, 218],5), ([17, 18, 21, 29, 15, 218],5), ([17, 18, 16, 21, 29, 218],5), ([18, 16, 31, 21, 15, 218],5),
([17, 16, 31, 21, 15, 218],5), ([17, 18, 16, 31, 21, 218],5), ([17, 18, 16, 21, 43, 218],5), ([17, 18, 31, 43, 218],5), ([18,
16, 43, 15, 218],5), ([17, 18, 29, 43, 218],5), ([17, 18, 16, 21, 29, 27, 43, 218],4), ([17, 18, 16, 29, 27, 43, 218],4),
([17, 18, 16, 31, 21, 29, 218],4), ([17, 18, 21, 29, 27, 43, 218],4), ([17, 18, 16, 21, 27, 43, 218],4), ([17, 18, 16, 31,
21, 43, 218],4), ([16, 21, 29, 27, 43, 218],4), ([17, 16, 31, 21, 43, 218],4), ([21, 29, 27, 43, 218],4)
Key: 175: Value: ([17, 12, 18, 16, 43, 28, 1, 175],7), ([17, 12, 18, 16, 28, 1, 175],7), ([17, 12, 18, 16, 43, 28, 175],7),
([17, 12, 18, 16, 43, 1, 175],7), ([17, 12, 18, 16, 43, 175],7), ([12, 18, 28, 1, 175],7), ([16, 43, 28, 1, 175],7), ([12, 16,
43, 1, 175],7), ([17, 12, 18, 16, 31, 43, 175],6), ([17, 12, 18, 16, 29, 43, 175],6), ([17, 12, 18, 16, 31, 175],6), ([17,
18, 16, 31, 43, 175],6), ([17, 12, 18, 29, 43, 175],6), ([12, 16, 31, 43, 175],6), ([17, 12, 18, 16, 31, 21, 29, 27, 43,
175],5), ([12, 16, 31, 29, 175],5), ([21, 43, 175],5)
Key: 306: Value: ([17, 18, 13, 93, 306],8), ([17, 18, 43, 306],7), ([17, 12, 18, 306],7), ([17, 12, 18, 43, 306],6), ([17,
12, 18, 28, 306],6), ([17, 18, 29, 306],6), ([43, 15, 306],6), ([17, 12, 18, 43, 28, 306],5), ([17, 12, 18, 29, 306],5),
([17, 12, 18, 27, 306],5), ([17, 18, 29, 43, 306],5), ([17, 12, 18, 29, 27, 306],4), ([17, 12, 18, 29, 43, 306],4), ([17, 12,
18, 27, 28, 306],4), ([17, 12, 18, 29, 28, 306],4), ([17, 12, 18, 27, 43, 306],4), ([17, 12, 18, 21, 27, 306],4), ([17, 18,
21, 27, 306],4), ([17, 18, 31, 306],4), ([17, 12, 18, 29, 27, 28, 306],3), ([17, 12, 18, 21, 27, 28, 306],3), ([17, 12, 18,
29, 27, 43, 306],3), ([17, 12, 18, 27, 43, 28, 306],3), ([17, 12, 18, 21, 27, 43, 306],3), ([17, 12, 18, 21, 29, 27, 306],3),
([17, 12, 18, 29, 43, 28, 306],3), ([17, 18, 16, 31, 29, 306],3), ([17, 18, 21, 29, 27, 306],3), ([17, 18, 21, 27, 28,
306],3), ([17, 12, 18, 21, 43, 306],3), ([17, 21, 27, 43, 306],3), ([17, 18, 31, 43, 306],3), ([18, 21, 27, 43, 306],3),
([17, 12, 18, 31, 306],3), ([18, 16, 306],3), ([16, 29, 306],3)

Sources
Mahout Hands on! Ted Dunning, Robin Anil
OSCON, Portugal 2011
Introduction to Scalable Machine Learning
with Apache Mahout, Grant Ingersoll, Thinking
Lucid, February 15, 2010
https://cwiki.apache.org/confluence/display/MAHO
UT/Algorithms

http://mahout.apache.org/

You might also like