You are on page 1of 59

CSE 591 Semantic Web Mining

Arizona State University, Spring 2015

Lecture 1: Introduction

Instructor: Hanghang Tong


Hanghang.tong@asu.edu

1
Logistics Basic Info.
Class meet:
Time: M/W 4:30-5:45pm
Location: Tempe COOR 170
Format: instructor presentation (before spring break) + student presentation (after break)

Instructor: Hanghang Tong hanghang.tong@asu.edu

TA: Ms. Chen Chen (chenannie45@gmail.com)


Grader: Mr. Vishal Ruhela (ruhela.vishal@gmail.com)

Office Hours the major source for your outside-classroom questions:


M 3:00-4:00pm, BYENG 416 (Tong)
T 3:30-5pm BYENG 411AC (Chen)
W 1-4pm BYENG 415AB (Ruhela)
starting from the week of Feb. 1st
Before that, send him emails for help
Th 3:30-5pm BYEND 411AC (Chen)

2
Logistics - Grades
Grades (total 100% + 5%)
Exam: 30%
To test the basic material
Before the spring break
limited open book 1 A4 paper, with any font size/margin
Assignment: 30% - group paper-reading presentation
Each (advanced) topic covers 3-5 papers.
Each topic will be assigned to one presentation team and one challenging team.
before class: both teams will read these 3-5 papers.
In class: (both teams will be graded)
the presentation team will present the topic (20-30 minutes);
The challenging team will ask ~5 (challenging) questions
Each person will be assigned in (will be announced before spring break)
one presentation team (20%)
one challenging team (10%)
Class Project: 40% (details later)
Project proposal 10%
Final presentation 15%
Final report 15%
Class Participation: 5% (bonus points)
Grade Assignment: (No Scaling will be applied)
A+ (97 100%), A (94 97%), A- (90 94%), B+ (87 90%), B (84 87%), B- (80 84%), C+ (76
80%), C (70 76%), D (60 70%), E (0 60%).

3
Logistics Late Policy
Late policy (for assignment + project proposal,
presentation and reports, etc):
each person has 2 slip days in total for the
whole semester. After that, 20% deduction per
day of delay
The minimum unit for delay is 1 day
no penalty if medical emergence (need doctors
notes)
All assignments are due at the beginning of the
class meeting time
4
Logistics - textbooks
`Required Text: Mining the Web: Discovering
Knowledge from Hypertext Data by Soumen
Chakrabarti. Morgan Kaufmann
Optional: Web Data Mining: Exploring
Hyperlinks, Contents, and Usage Data, Liu,
Springer
Logistics Misc.
Seat-in OK.
But will be asked to leave if exceeding room
capacity
Add to portfolio (for mater students) OK
Check with departments policy
Food no food in class (except water)
Cell-phone keep in silent

6
What is `Semantic Web Mining
Q0: Why Web Retrieval/Search is not Enough?

Q1: What is Semantic Web Mining?

7
Traditional Web (Info.) Retrieval Model

8
Traditional Web (Info.) Retrieval Model
Get rid of mice in a
Task
politically correct way

9
Classic IR Goals

10
Relevance vs. Semantics

11
Challenges in Web Retrieval System:
Document Base

12
Document Base: Web

13
13
Challenge: Web Data

14
Challenges in Web Retrieval System:
Users (behind the query)

15
Web Search Query

16
Different User Needs

17
Challenges in Web Retrieval System:
Interaction

18
Challenges: Interaction

19
Query Distribution

20
Challenges: Interaction

21
Challenges in Web Retrieval System:
IR System

22
Challenges: Bag-of-Words
Representation

23
Challenge: Text Similarity Models

24
Challenges: Summary
The Big Challenge of Web Retrieval:
Meet the diverse user needs,
given
their poorly made queries
the size
heterogeneity of the web corpus
Possible (and Promising) Solution:
Semantic Web Mining (This Course)

25
What is `Semantic Web Mining
Q0: Why Web Retrieval/Search is not Enough?

Q1: What is Semantic Web Mining?

26
Semantic Web Mining
Semantic Web Mining = Semantic Web +
Web Mining
A1: use semantics to improve mining
A2: use mining results to generate semantics
Web Mining A1 The Semantic Web
extracts implicit makes knowledge
knowledge machine-understandable
A2
Key Word: Machine Understandable
Tim Berner Lees Vision:
Web as a means of collaboration for people
Web as a means of collaboration for machines
Semantic Web is a web of data that machines can understand too.
27
Web Mining
Knowledge discovery (aka Data mining):
the non-trivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data. formal def. PKDD

SDM PAKDD
Finding interesting patterns from data CFs def 0.008
0.007
0.009
keywords: data + patterns KDD 0.005 ICML
0.011
ICDM
Examples: neighbourhood, association, anomalies 0.004
0.005
CIKM ICDE
Web Mining: 0.004
0.004
0.005

the application of data mining techniques on the content, ECML SIGMOD

(hyperlink) structure, and usage of Web resources. DMKD

Web Mining Areas


Web Content Mining
Web Structure Mining
Web Usage Mining

28
Data mining: the textbook version

The meaning of attributes is clear


The meaning of attribute values is clear
Data modelling can be applied directly (e.g.,
regression, classification, clustering, association-
rule discovery)

(A simplified extract from the adult dataset in the UCI machine learning repository)
Data analysis: the reality
data mining / knowledge
...
discovery process
p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:03:51 +0100]"GET
/search.html?t=jane%20austen&SID=023785&ord=asc HTTP/1.0" 200 1759
p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:05:06 +0100] "GET
/search.html?t=jane%20austen&m=video&SID=023785&ord=desc HTTP/1.0" 200
8450
What is the
meaning of the
p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:06:41 +0100] "GET
/view.asp?id=3456&SID=023785 HTTP/1.0" 200 3478
attributes?
...
What is the
meaning of the
attribute values?
Data modelling
is only one part!

30
Where does semantics come in?

Semantics

31
Web Mining Link Ranking

32
Web Mining Click Graph

33
Click Graph - Construction

34
Click Graph Node Distribution

35
Click Graph Connected Component

36
Web Mining User Intention

37
Classified Queries

38
Classified Queries

39
Classified Queries by Topics

40
Web Mining - Spams

41
Semantic Web
How is information represented in the actual Web?
As documents written in natural language
As graphs, pictures, tables, videos, and other multimedia
Humans are good at:
deduce facts from some (incomplete) information
create associations between facts
aggregate information from several sources
But, machines:
cannot use partial (or incomplete) information
have difficulties aggregating several sources of information
can read but cannot understand information

42
Semantic Web: Integrating Data

43
Semantic Web
Representing the existing data (which are meaningful
only to people) in a form understandable for machines.
This means, annotate data with metadata.

Metadata are data about data.

Ontologies: documents that define relations among


terms (enabling technology of semantic web).
Software agents that can process the data on behalf of
humans, and automated web services that provide
data.

44
What does an ontology look like? Examples

45
A1: Mining to Learn Ontologies

46
A2: Use Ontology to Improve Mining

48
Recap & Annoucement
Doing a thesis (for CSE master student)
Office visit outside OH
Enroll in class
Prerequisite
Slides for Bing Lius textbook
http://www.cs.uic.edu/~liub/WebMiningBook.html
Lecture Slides BB (this week)
No Class Meeting Next Monday enjoy the MLK day
Click Graph Node Distribution

See more details:


Ricardo Baeza-Yates and Alessandro Tiberi: Extracting semantic relations from query logs. KDD 2007
Click Graph Connected Component

51
A real example:
Watson DeepQA

52
Topics not covered in this class
(but very important)
Crawling
Software Architecture (Yahoo! Challenge)
Data Cleaning and Pre-precossing
Web Design
Privacy & Security (2014: year of data breach)
Human Computation

53
Crawling the very first step

54
54
Data Cleaning & Pre-processing

55
Software Architecture

56
Web Design

57
Privacy, Privacy, Privacy

58
Human Computation/crowd sourcing
Computers are incredibly fast, accurate, and
stupid

Human beings are incredibly slow, inaccurate, and


brilliant.

Together they are powerful beyond imagination.

Example: Captcha and ReCaptcha 59


Schedule (subject to change)
Class meeting Part 1 Instructor Presentation

Class meeting Part 2 Student Presentation (Topics TBA)

60

You might also like