Data Mining Patterns and Models

CIS787-
FINAL REVIEW
Reza Zafarani (reza@data.syr.edu)
DATA MINING
Given lots of data, the data mining process discovers
patterns and models that are:
1) Valid: hold on new data with some certainty

2) Useful (Actionable): should be possible to act on the item
3) Unexpected: non-obvious to the system
4) Understandable: humans should be able to
interpret the pattern
Patterns are the relationships and summaries derived through

a data mining exercise.
2
THIS CLASS: CIS787
This class overlaps with machine learning, statistics,
artificial intelligence, databases but more stress on
 Fundamental Data Mining Algorithms
 Scalability (big data)
Statistics Machine
 Algorithms
 Hands-on Experience Learning
 Go to www.awseducate.com
Data Mining
Database
systems
3
WHAT WE HAVE COVERED
 Data Types/Data  Linear/Mulitivariate/Logistic/Lasso Regression  Rank of a matrix
Preprocessing  Logistic Regression  SVD/Dimensionality Reduction with
 Sampling/Stratified  Bagging/Bootstrap SVD
Sampling Samples/Boosting/AdaBoost  SVD and Eigen Decomposition
 Curse of  Partitional Clustering/ K-means/Objective  CUR decomposition
Dimensionality Function/Centroids  Power method
 PCA  Bisecting K-means  MapReduce/Mapper/Reducer/
 Discretization  Hierarchical Clustering (Min,Max, group Combiner
 Decision Trees average, centroid, Ward) -> Dendrogram  Samping a fixed proportion
 Hunt’s/ID3/C4.5/Ob  Inversion/Globular Clusters/Chain Effect  Reservoir Sampling
lique Trees  Lance-Williams Formula  Bloom Filter
 Gini/Entropy  Minimum Spanning Tree Divisive Clustering  Flajolet Martin
 Evaluation  SSE/Cohesion/Separation/Silhouette Index
(Accuracy/Recall/F-  AMS method for computing
Measure/AUC)  DBScan/Chameleon moments
 KNN/Naïve Bayes  Frequent Itemsets  Shingling/Minhashing/LSH
 Support Vectors / SVM  Support/Confidence/Lift
 DTW  Apriori
 Maximal/Closed Itemsets
4
WHAT WE HAVEN’T COVERED
 Data Preprocessing  Other clustering  Semi-Supervised Learning  Spatial Data
 Wavelet Transforms methods
 Co-training/Self-training  Trajectory Mining
(Haar)  CLARANS/DENCLUE
 Active Learning  Graph Mining
 * MDS / /NMF/BIRCH/CURE
/CLIQUE/PROCLUS/  Ensembles  Graph Isomorphism
Multidimensional
Scaling ORCLUS  Random Forest  MCGs
 * Embedding  Classification  Clustering data streams  Web Mining
Techniques  Fisher LDA  STREAM/ClusStream  Collaborative Filtering
 ISOMAP!  Rule Induction  Text Mining  SimRank/TrustRank
 MLLE  Bayesian Networks  Co-clustering  PageRank/HITS
 Similarity-based  Kernel Methods  PLSA  Social Network Analysis
methods  Neural Networks  Community Detection
 Rocchio Method
 Similarity between  MLP/Perceptron/  Collective
 Topic Models
graphs Hebbian Learning Classification
 Discrete Sequences
 Other Association Rule  Privacy Preserving
Mining methods  HMMs
Algorithms
 * FP-Tree/FP-Growth  Prob. Suffix Trees
 K-anonymity /
Samarati’s method
5
WHAT CAN I READ TO KNOW
MORE
1. Han, Jiawei, Micheline Kamber, and Jian Pei. Data mining: concepts and
techniques: concepts and techniques. Elsevier, 2011.
2. Aggarwal, Charu C. Data Mining: The Textbook. Springer, 2015.
3. Zaki, Mohammed J., and Wagner Meira Jr. Data mining and analysis:
fundamental concepts and algorithms. Cambridge University Press, 2014.
4. Quinlan, J. Ross. C4. 5: programs for machine learning. Elsevier, 2014.
5. Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of
statistical learning. Vol. 1. Springer, Berlin: Springer series in statistics, 2001.
6. Bishop, Christopher M. Pattern recognition and machine learning. springer,
2006.
7. Schölkopf, Bernhard, and Alexander J. Smola. Learning with kernels: Support
vector machines, regularization, optimization, and beyond. MIT press, 2002.
8. Koller, Daphne, and Nir Friedman. Probabilistic graphical models: principles
and techniques. MIT press, 2009.
ENGINEERING AND COMPUTER SCIENCE | SYRACUSE UNIVERSITY 6
WE DON’T NEED NO BOOKS!
Stanford - Machine Learning Course

https://www.youtube.com/playlist?list=PLA89D
CFA6ADACE599
Caltech - Machine Learning Course - CS 156
https://www.youtube.com/playlist?list=PLD63A
284B7615313A

COMPETITIONS
Kaggle.com
KDD Cup
 http://www.kdd.org/kdd-cup
 Participate in KDD Cups!
CIKM Cup

WHAT’S NEXT: 3RD EXAM

WHAT’S AFTER THE CLASS
[possibility] Advanced Data Mining (Fall 2018)
 Sparse Learning / More math and stats
 Large-Scale Machine Learning/Sketching/etc.
 Graph Mining
[possibility] Machine Learning (Fall 2018)
[possibility] Spectral Graph Theory (Fall 2018)
[possibility] or the same course, but more refined
[possibility] Social Media Mining (Spring 2019)
 Mine Big Data on Social Media
 Study Networks, Users, Influentials, Content on Social Media
 How does information propagate on Social Networks?
 Measure Influence, Model Human Behavior, Determine Communities
10
FINALLY!
• You have done a lot!!!

I hope you have learned a lot!
• Spent time answering questions, proving results, and preparing for quizzes
• Have implemented a number of methods
• And did great on the 3rd exam!
Thank you for the

Hard Work!!!

Data Mining Patterns and Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Patterns and Models

Uploaded by

Copyright:

Available Formats

CIS787-

1) Valid: hold on new data with some certainty

Patterns are the relationships and summaries derived through

Stanford - Machine Learning Course

Caltech - Machine Learning Course - CS 156

ENGINEERING AND COMPUTER SCIENCE | SYRACUSE UNIVERSITY 7

ENGINEERING AND COMPUTER SCIENCE | SYRACUSE UNIVERSITY 8

ENGINEERING AND COMPUTER SCIENCE | SYRACUSE UNIVERSITY 9

• You have done a lot!!!

Thank you for the

You might also like