Lab 4

Web Data Mining
Lab 4
Andreas Papadopoulos
andpapad+epl451@gmail.com
andpapad@cs.ucy.ac.cy
Intro to Mahout!
Mahout
Scalable Machine Learning and Data Mining
n top of Apache Hadoop
Use the map/reduce paradigm
http://mahout.apache.org/
Lab 4 - Mahout
Who uses Mahout?
Whats the name?

A mahout is a person who drives an elephant.
The name Mahout comes from the projects
use of Apache Hadoop for scalability and
tolerance
History
Challenges
Large amount of input data
Techniques work better
Nature of the deploying context
Must produce results quickly

The amount of input is so large that it is not
feasible to process it all on one computer,
even a powerful one
Scalability
Mahout core algorithms are implemented on
top of Apache Hadoop using the Map/Reduce
paradigm
Four main use cases (currently)

Recommendation
takes user behavior and tries to predict items users might like
Clustering
takes documents and groups topically related documents
Classification
tries to assign documents to a correct category based on examples
Frequent itemset mining

takes a set of item groups and identifies which items usually appear
together
Common use cases
Recommend friends/dates/products
Classify content into predefined groups
Find similar content based on object properties
Find associations/patterns in actions/behaviors
Identify key topics in large collections of text
Detect anomalies in machine output
Ranking search results
Others?
Mahout
For a complete list of algorithms visit :
https://cwiki.apache.org/confluence/display/MAHOUT/Alg
orithms
Goal: Be as fast and efficient as the possible given the
intrinsic design of the algorithm
Some algorithms wont scale to massive machine clusters
Others fit logically on a Map Reduce framework like Apache
Hadoop
Still others will need other distributed programming models
Be pragmatic
Most Mahout implementations are Map Reduce enabled
Algorithms and Applications
Clustering
Call it fuzzy grouping based on a notion of
similarity
Mahout Clustering
Plenty of Algorithms: K-Means,
Fuzzy K-Means, Mean Shift,
Canopy, Dirichlet
Group similar looking objects
Notion of similarity: Distance measure:
Euclidean
Cosine
Tanimoto
Manhattan
Classification
Predicting the type of a new object based on its features
The types are predetermined
Dog
Cat
Mahout Classification
Plenty of algorithms
Nave Bayes
Complementary Nave Bayes
Random Forests
Logistic Regression (SGD)
Support Vector Machines
Learn a model from a manually classified data

Predict the class of a new object based on its
features and the learned model
Recommendations
Predict what the user likes based on
His/Her historical behavior
Aggregate behavior of people similar to him
Mahout Recommenders
Different types of recommenders
User based
Item based
Full framework for storage, online and offline

computation of recommendations
Like clustering, there is a notion of similarity in users
or items
Cosine, Tanimoto, Pearson and LLR
Frequent Pattern Mining

Find interesting groups of items based on how
they co-occur in a dataset
Mahout Parallel FPGrowth

Identify the most commonly
occurring patterns from
Sales Transactions
buy Milk, eggs and bread
Query Logs
ipad -> apple, tablet, iphone
Spam Detection
Yahoo!
http://www.slideshare.net/hadoopusergroup/mailantispam
Install Mahout
sudo apt-get install maven2
wget http://apache.osuosl.org//mahout/0.5/mahout-distribution-0.5-src.tar.gz
tar xvf mahout-distribution-0.5-src.tar.gz
sudo mvn install

cd core/
mvn compile
cd ../examples
mvn compile
cd ..
./bin/mahout
Pattern mining

Data: http://fimi.cs.helsinki.fi/data/
Download
http://www.cs.ucy.ac.cy/courses/EPL451/lectures/Lab/new
_accidents.tar.gz
Extract data file: tar xvf new_accidents.tar.gz
Start hadoop
Upload data
hadoop dfs -copyFromLocal ./new_accidents.dat
new_accidents.dat
<MAHOUT_PATH>/mahout fpg -i new_accidents.dat -o patterns method mapreduce -g 10 -regex [\ ]
./mahout seqdumper --seqFile patterns/fpgrowth/part-r-00000

mahout fpg
MapReduce (parallel) implementation of FP Growth
Algorithm for frequent Itemset Mining
Runs each stage of PFPGrowth as described in the
paper:
http://infolab.stanford.edu/~echang/recsys08-69.pdf
From: PFP: Parallel FP-Growth for Query Recommendation, by: Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, Edward Y. Chang
In Proceedings of the 2008 ACM conference on Recommender systems

Example Output
.
Key: 218: Value: ([17, 18, 16, 171, 218],7), ([17, 18, 16, 21, 15, 171, 218],6), ([17, 18, 16, 21, 171, 218],6), ([17, 18,
16, 43, 171, 218],6), ([17, 18, 16, 31, 218],6), ([18, 16, 24, 171, 218],6), ([17, 18, 16, 29, 218],6), ([17, 18, 25, 171,
218],6), ([18, 16, 31, 171, 218],6), ([17, 18, 16, 24, 218],6), ([18, 16, 25, 171, 218],6), ([17, 18, 16, 25, 218],6), ([17,
18, 43, 218],6), ([17, 18, 16, 21, 29, 15, 218],5), ([17, 18, 16, 31, 21, 15, 218],5), ([17, 18, 16, 21, 43, 15, 218],5),
([17, 18, 16, 31, 29, 218],5), ([17, 18, 21, 29, 15, 218],5), ([17, 18, 16, 21, 29, 218],5), ([18, 16, 31, 21, 15, 218],5),
([17, 16, 31, 21, 15, 218],5), ([17, 18, 16, 31, 21, 218],5), ([17, 18, 16, 21, 43, 218],5), ([17, 18, 31, 43, 218],5), ([18,
16, 43, 15, 218],5), ([17, 18, 29, 43, 218],5), ([17, 18, 16, 21, 29, 27, 43, 218],4), ([17, 18, 16, 29, 27, 43, 218],4),
([17, 18, 16, 31, 21, 29, 218],4), ([17, 18, 21, 29, 27, 43, 218],4), ([17, 18, 16, 21, 27, 43, 218],4), ([17, 18, 16, 31,
21, 43, 218],4), ([16, 21, 29, 27, 43, 218],4), ([17, 16, 31, 21, 43, 218],4), ([21, 29, 27, 43, 218],4)
Key: 175: Value: ([17, 12, 18, 16, 43, 28, 1, 175],7), ([17, 12, 18, 16, 28, 1, 175],7), ([17, 12, 18, 16, 43, 28, 175],7),
([17, 12, 18, 16, 43, 1, 175],7), ([17, 12, 18, 16, 43, 175],7), ([12, 18, 28, 1, 175],7), ([16, 43, 28, 1, 175],7), ([12, 16,
43, 1, 175],7), ([17, 12, 18, 16, 31, 43, 175],6), ([17, 12, 18, 16, 29, 43, 175],6), ([17, 12, 18, 16, 31, 175],6), ([17,
18, 16, 31, 43, 175],6), ([17, 12, 18, 29, 43, 175],6), ([12, 16, 31, 43, 175],6), ([17, 12, 18, 16, 31, 21, 29, 27, 43,
175],5), ([12, 16, 31, 29, 175],5), ([21, 43, 175],5)
Key: 306: Value: ([17, 18, 13, 93, 306],8), ([17, 18, 43, 306],7), ([17, 12, 18, 306],7), ([17, 12, 18, 43, 306],6), ([17,
12, 18, 28, 306],6), ([17, 18, 29, 306],6), ([43, 15, 306],6), ([17, 12, 18, 43, 28, 306],5), ([17, 12, 18, 29, 306],5),
([17, 12, 18, 27, 306],5), ([17, 18, 29, 43, 306],5), ([17, 12, 18, 29, 27, 306],4), ([17, 12, 18, 29, 43, 306],4), ([17, 12,
18, 27, 28, 306],4), ([17, 12, 18, 29, 28, 306],4), ([17, 12, 18, 27, 43, 306],4), ([17, 12, 18, 21, 27, 306],4), ([17, 18,
21, 27, 306],4), ([17, 18, 31, 306],4), ([17, 12, 18, 29, 27, 28, 306],3), ([17, 12, 18, 21, 27, 28, 306],3), ([17, 12, 18,
29, 27, 43, 306],3), ([17, 12, 18, 27, 43, 28, 306],3), ([17, 12, 18, 21, 27, 43, 306],3), ([17, 12, 18, 21, 29, 27, 306],3),
([17, 12, 18, 29, 43, 28, 306],3), ([17, 18, 16, 31, 29, 306],3), ([17, 18, 21, 29, 27, 306],3), ([17, 18, 21, 27, 28,
306],3), ([17, 12, 18, 21, 43, 306],3), ([17, 21, 27, 43, 306],3), ([17, 18, 31, 43, 306],3), ([18, 21, 27, 43, 306],3),
([17, 12, 18, 31, 306],3), ([18, 16, 306],3), ([16, 29, 306],3)
Sources
Mahout Hands on! Ted Dunning, Robin Anil
OSCON, Portugal 2011
Introduction to Scalable Machine Learning
with Apache Mahout, Grant Ingersoll, Thinking
Lucid, February 15, 2010
https://cwiki.apache.org/confluence/display/MAHO
UT/Algorithms
http://mahout.apache.org/

Lab 4

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab 4

Uploaded by

Copyright:

Available Formats

Web Data Mining

Who uses Mahout?

Whats the name?

Must produce results quickly

Four main use cases (currently)

Frequent itemset mining

Common use cases

Most Mahout implementations are Map Reduce enabled

Algorithms and Applications

Learn a model from a manually classified data

Full framework for storage, online and offline

Frequent Pattern Mining

Mahout Parallel FPGrowth

sudo mvn install

Frequent Pattern Mining

Frequent Pattern Mining

Frequent Pattern Mining

Frequent Pattern Mining

You might also like