Professional Documents
Culture Documents
Lecture 1: Introduction
Lectures: Prof. Dr. Matthias Renz
Exercises: TBA
About the course
n Class schedule
q Lectures: Monday 12:15 - 01:45 pm, CAP 2, HS F, Tuesday 8:15 – 9:45 am, LMS4, R325.
q Exercise Class: Tuesday 10:15 – 11:45 am, CAP 2, LMS2 - R.Ü3.
q Homework Assignments: Assignments almost every week. Discussion in Exercise Class.
n Office hours:
q Prof. Renz: Tuesdays, 1:00 - 2:00 pm, CAP 4, R708.
q Assistants (Lukas Galke, ) by appointment (email: mras@informatik.uni-kiel.de).
n Exam:
q Final Exam: date/time TBA, in written form.
q Exam will be based on the material discussed in the class with focus on the exercises.
q No requirements for the admission of the final exam
n Grade
q Final written exam at the end.
Knowledge Discovery and Data Mining: Introduction 2
Who am I?
n Diploma in Electrical Engineering
Munich University of Applied Sciences, Germany, 1997-2002.
n Diploma in Computer Science
Department of Computer Science, Ludwig-Maximilians University Munich, Germany, 2002.
n PhD in Computer Science
Department of Computer Science, Ludwig-Maximilians Universität (LMU) Munich, Germany, 2006.
n Habilitation in Computer Science
Department of Computer Science, Ludwig-Maximilians Universität (LMU) Munich, Germany, 2011.
n Acting chair of the Database Systems Group
Department of Computer Science, Ludwig-Maximilians Universität (LMU) Munich, Germany, 2015.
n Associate Professor
Department of Computational & Data Science, George Mason University, 01/15/2016 – 06/01/2018.
n Professor at CAU Kiel, 07/01/2018 – present.
n My Portfolio:
Applications Environments
• Big Data Analytics • eScience • Sensor Networks • Privacy
• Industrie4.0 • Data-driven Science • Embedded Systems • Reliability
• Mobile Devices
Data
High Dimensional Enriched Geo Uncertain Multimedia
Managing: Methods Pattern Mining
• Data Management Searching: • Clustering
• Similarity Models • Similarity Search • Outlier Detection
• Indexing • Query Processing • Frequent Pattern Mining
n Data Mining is often associated with Machine Learning. It is an area that has taken much of its
inspiration and techniques from machine learning (and some, also, from statistics), but is put to
different ends.
n Data Mining is about using statistics as well as other programming methods to find patterns hidden
in the data so that you can explain some phenomenon.
n Machine Learning uses Data Mining techniques and other learning algorithms to build models of
what is happening behind some data so that it can predict future outcomes.
n “A breakthrough in machine learning would be worth ten Microsofts” (Bill Gates, Microsoft)
n “Machine learning is the next Internet” (Tony Tether, Director, DARPA)
n “Machine learning is the hot new thing” (John Hennessy, President, Stanford)
n “Machine learning is going to result in a real revolution” (Greg Papadopoulos, Former CTO, Sun)
n “Machine learning today is one of the hottest aspects of computer science” (Steve Ballmer, CEO,
Microsoft) *Source: Pedro Domingos http://courses.cs.washington.edu/courses/cse446/15sp/slides/intro.pdf
“If “sexy” means having rare qualities that are much in demand, data scientists are already
there. They are difficult and expensive to hire and, given the very competitive market for their
services, difficult to retain. There simply aren’t a lot of people with their combination of
scientific background and computational and analytical skills.”
Source: Harvard Business Review. Data Scientist: The Sexiest Job of the 21st Century. October 2012 link
n Big data and analytics are more business terms and ill-defined
“Big data is like teenage sex: everyone talks about it, nobody really knows
how to do it, everyone thinks everyone else is doing it, so everyone claims
they are doing it.“
Source: Dan Ariely, Duke University
n Though nowadays we have more data than ever and the infrastructure to deal with it
q à more opportunities and challenges for data mining and machine learning
n Huge amounts of data are collected nowadays from different application domains
n “We are drowning in information but starving for knowledge” John Naibett link
n The amount and the complexity of the collected data does not allow for manual analysis.
IoT
Telecommunication
Internet
n The Internet of Things (IoT) is the network of physical objects or "things" embedded with electronics,
software, sensors, and network connectivity, which enables these objects to collect and exchange
data.
Source: https://en.wikipedia.org/wiki/Internet_of_Things
Source: http://blogs.cisco.com/diversity/the-internet-of-things-
infographic
Image source:http://tinyurl.com/prtfqxf
Computer-driven –
Simulation complex phenomenons
Theoretical –
Development of models
Empirical -
Description of natural
phenomenons last few last few
thousand years ago hundred years decades today
source:http://research.microsoft.com/en-us/um/people/gray/talks/nrc-cstb_escience.ppt
Knowledge Discovery and Data Mining: Introduction 14
Examples of data sources: data intensive science
“Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help
researchers manipulate and explore massive datasets.”
Slide from:http://research.microsoft.com/en-us/um/people/gray/talks/nrc-cstb_escience.ppt
Knowledge Discovery and Data Mining: Introduction 15
From data to knowledge
Knowledge Discovery in Databases (KDD) is the nontrivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data.
[Fayyad, Piatetsky-Shapiro, and Smyth 1996]
Remarks:
• valid: the discovered patterns should also hold for new, previously unseen problem instances.
• novel: at least to the system and preferably to the user
• potentially useful: they should lead to some benefit to the user or task
• ultimately understandable: the end user should be able to interpret the patterns either immediately or after
some post-processing
Target data
Preprocessing/Cleaning:
• Integration of data from
different data sources
• Missing values
Preprocessed data
Transformation:
• Select useful features
• Feature transformation/
discretization
• Dimensionality reduction
The KDD process and the Data Mining step
Data Mining:
Transformed data
Evaluation:
• Evaluate patterns based on
interestingness measures
Patterns
Statistics
Machine Learning
Data visualization
KDD Databases
Pattern recognition
KDD
Databases
Scalability to large data sets
New data types (web data, Micro-arrays, social data ...)
Integration with commercial databases
[Chen, Han & Yu 1996]