Professional Documents
Culture Documents
Lecture 1: Introduction
1
Logistics Basic Info.
Class meet:
Time: M/W 4:30-5:45pm
Location: Tempe COOR 170
Format: instructor presentation (before spring break) + student presentation (after break)
2
Logistics - Grades
Grades (total 100% + 5%)
Exam: 30%
To test the basic material
Before the spring break
limited open book 1 A4 paper, with any font size/margin
Assignment: 30% - group paper-reading presentation
Each (advanced) topic covers 3-5 papers.
Each topic will be assigned to one presentation team and one challenging team.
before class: both teams will read these 3-5 papers.
In class: (both teams will be graded)
the presentation team will present the topic (20-30 minutes);
The challenging team will ask ~5 (challenging) questions
Each person will be assigned in (will be announced before spring break)
one presentation team (20%)
one challenging team (10%)
Class Project: 40% (details later)
Project proposal 10%
Final presentation 15%
Final report 15%
Class Participation: 5% (bonus points)
Grade Assignment: (No Scaling will be applied)
A+ (97 100%), A (94 97%), A- (90 94%), B+ (87 90%), B (84 87%), B- (80 84%), C+ (76
80%), C (70 76%), D (60 70%), E (0 60%).
3
Logistics Late Policy
Late policy (for assignment + project proposal,
presentation and reports, etc):
each person has 2 slip days in total for the
whole semester. After that, 20% deduction per
day of delay
The minimum unit for delay is 1 day
no penalty if medical emergence (need doctors
notes)
All assignments are due at the beginning of the
class meeting time
4
Logistics - textbooks
`Required Text: Mining the Web: Discovering
Knowledge from Hypertext Data by Soumen
Chakrabarti. Morgan Kaufmann
Optional: Web Data Mining: Exploring
Hyperlinks, Contents, and Usage Data, Liu,
Springer
Logistics Misc.
Seat-in OK.
But will be asked to leave if exceeding room
capacity
Add to portfolio (for mater students) OK
Check with departments policy
Food no food in class (except water)
Cell-phone keep in silent
6
What is `Semantic Web Mining
Q0: Why Web Retrieval/Search is not Enough?
7
Traditional Web (Info.) Retrieval Model
8
Traditional Web (Info.) Retrieval Model
Get rid of mice in a
Task
politically correct way
9
Classic IR Goals
10
Relevance vs. Semantics
11
Challenges in Web Retrieval System:
Document Base
12
Document Base: Web
13
13
Challenge: Web Data
14
Challenges in Web Retrieval System:
Users (behind the query)
15
Web Search Query
16
Different User Needs
17
Challenges in Web Retrieval System:
Interaction
18
Challenges: Interaction
19
Query Distribution
20
Challenges: Interaction
21
Challenges in Web Retrieval System:
IR System
22
Challenges: Bag-of-Words
Representation
23
Challenge: Text Similarity Models
24
Challenges: Summary
The Big Challenge of Web Retrieval:
Meet the diverse user needs,
given
their poorly made queries
the size
heterogeneity of the web corpus
Possible (and Promising) Solution:
Semantic Web Mining (This Course)
25
What is `Semantic Web Mining
Q0: Why Web Retrieval/Search is not Enough?
26
Semantic Web Mining
Semantic Web Mining = Semantic Web +
Web Mining
A1: use semantics to improve mining
A2: use mining results to generate semantics
Web Mining A1 The Semantic Web
extracts implicit makes knowledge
knowledge machine-understandable
A2
Key Word: Machine Understandable
Tim Berner Lees Vision:
Web as a means of collaboration for people
Web as a means of collaboration for machines
Semantic Web is a web of data that machines can understand too.
27
Web Mining
Knowledge discovery (aka Data mining):
the non-trivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data. formal def. PKDD
SDM PAKDD
Finding interesting patterns from data CFs def 0.008
0.007
0.009
keywords: data + patterns KDD 0.005 ICML
0.011
ICDM
Examples: neighbourhood, association, anomalies 0.004
0.005
CIKM ICDE
Web Mining: 0.004
0.004
0.005
28
Data mining: the textbook version
(A simplified extract from the adult dataset in the UCI machine learning repository)
Data analysis: the reality
data mining / knowledge
...
discovery process
p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:03:51 +0100]"GET
/search.html?t=jane%20austen&SID=023785&ord=asc HTTP/1.0" 200 1759
p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:05:06 +0100] "GET
/search.html?t=jane%20austen&m=video&SID=023785&ord=desc HTTP/1.0" 200
8450
What is the
meaning of the
p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:06:41 +0100] "GET
/view.asp?id=3456&SID=023785 HTTP/1.0" 200 3478
attributes?
...
What is the
meaning of the
attribute values?
Data modelling
is only one part!
30
Where does semantics come in?
Semantics
31
Web Mining Link Ranking
32
Web Mining Click Graph
33
Click Graph - Construction
34
Click Graph Node Distribution
35
Click Graph Connected Component
36
Web Mining User Intention
37
Classified Queries
38
Classified Queries
39
Classified Queries by Topics
40
Web Mining - Spams
41
Semantic Web
How is information represented in the actual Web?
As documents written in natural language
As graphs, pictures, tables, videos, and other multimedia
Humans are good at:
deduce facts from some (incomplete) information
create associations between facts
aggregate information from several sources
But, machines:
cannot use partial (or incomplete) information
have difficulties aggregating several sources of information
can read but cannot understand information
42
Semantic Web: Integrating Data
43
Semantic Web
Representing the existing data (which are meaningful
only to people) in a form understandable for machines.
This means, annotate data with metadata.
44
What does an ontology look like? Examples
45
A1: Mining to Learn Ontologies
46
A2: Use Ontology to Improve Mining
48
Recap & Annoucement
Doing a thesis (for CSE master student)
Office visit outside OH
Enroll in class
Prerequisite
Slides for Bing Lius textbook
http://www.cs.uic.edu/~liub/WebMiningBook.html
Lecture Slides BB (this week)
No Class Meeting Next Monday enjoy the MLK day
Click Graph Node Distribution
51
A real example:
Watson DeepQA
52
Topics not covered in this class
(but very important)
Crawling
Software Architecture (Yahoo! Challenge)
Data Cleaning and Pre-precossing
Web Design
Privacy & Security (2014: year of data breach)
Human Computation
53
Crawling the very first step
54
54
Data Cleaning & Pre-processing
55
Software Architecture
56
Web Design
57
Privacy, Privacy, Privacy
58
Human Computation/crowd sourcing
Computers are incredibly fast, accurate, and
stupid
60