You are on page 1of 10

Web Mining L,T,P,J,C

Subject Code:
3,0,0,4,4
Objectives To focus on a detailed overview of the web mining process and its
techniques
To Understand the basics of Web search with special emphasis on web
Crawling
To understand the basic of indexing and the various type of query
processing approaches.
To appreciate the use of machine learning approaches for Web Content
Mining
To understand the role of hyper links in web structure mining
To appreciate the various aspects of web usage mining

Expected Outcome Upon Completion of the course, the students will be able to
Build a sample search engine using available open source tools
Describe the browser security model in web security
Identify the different components of a web page that can be used for
mining
Apply machine learning concepts to web content mining
Implement Page Ranking algorithm and modify the algorithm for mining
information
Design a system to harvest information available on the web to build
recommender systems
Analyse social media data using appropriate data/web mining techniques
Modify an existing search engine to make it personalized

Module Topics L Hrs SLO


1 INTRODUCTION
Introduction of WWW Architecture of the WWW Web Document
Representation- Web Search Engine Challenges - Web security
overview and concepts, Web application security, Basic web security 5 2
model -Web Hacking Basics HTTP & HTTPS URL, Web Under the
Cover Overview of Java security Reading the HTML source.
2 WEB CRAWLING
Basic Crawler Algorithm: Breadth-First/ depth-First Crawlers, -
Universal Crawlers- Preferential Crawlers : Focused Crawlers - Topical 5 7, 1
Crawlers.
3 INDEXING 5 2
Static and Dynamic Inverted Index Index Construction and Index
Compression- Latent Semantic Indexing. Searching using an Inverted
Index: Sequential Search - Pattern Matching - Similarity search.

4 WEB STRUCTURE MINING


Link Analysis - Social Network Analysis - Co-Citation and
Bibliographic Coupling - Page Rank- Weighted Page Rank- HITS -
8 7, 1
Community Discovery - Web Graph Measurement and Modelling-
Using Link Information for Web Page Classification.

5 WEB CONTENT MINING


Classification: Decision tree for Text Document- Naive Bayesian Text
Classification - Ensemble of Classifiers. Clustering: K-means
Clustering - Hierarchical Clustering Markov Models - Probability- 8 7, 1
Based Clustering. Vector Space Model Latent semantic Indexing
Automatic Topic Extraction from Web Documents.
6 WEB USAGE MINING
Web Usage Mining - Click stream Analysis - Log Files - Data
Collection and Pre-Processing - Data Modelling for Web Usage Mining
- The BIRCH Clustering Algorithm - Modelling web user interests
using clustering- Affinity Analysis and the A Priori Algorithm 9 7, 1
Binning Web usage mining using Probabilistic Latent Semantic
Analysis Finding User Access Pattern via Latent Dirichlet Allocation
Model.

7 QUERY PROCESSING
Relevance Feedback and Query Expansion - Automatic Local and
3 11
Global Analysis Measuring Effectiveness and Efficiency

8 Recent Trends 2 5

Project # Generally a team project [5 to 10 members] 60 [Non 17


# Concepts studied in XXXX should have been used Contact
# Down to earth application and innovative idea should have been attempted hrs]
# Report in Digital format with all drawings using software package to be submitted. [Ex. 1. Design of a traffic
light system using sequential circuits OR 2. Design of digital clock]
# Assessment on a continuous basis with a min of 3 reviews.

Projects may be given as group projects

The following is the sample project that can be given to students for the
implementation
1. To develop the Search Engine for retrieval process
2. To develop the Crawler based on domains
3. Efficiently extracting the related textual information and Multimedia contents
from documents.
4. To implement the Indexing structure for multi-dimensional data with dynamic
nature.
5. Opinion Mining and Sentiment Analysis from the document using web mining.
6. To implement the Recommendation System.
7. To implement the effective compression schemes for storing the data using less
storage space.
8. To develop the mechanism for Query Manager.
9. To develop the effective query refinement mechanism based on query algebra.
10. Personalize the search engine.
11. Solving Data Science problems from Kaggle website
List of Case Studies:
12. Market -Customer analysis
13. Biological/ DNA sequence analysis
14. Detecting software bugs
15. Improving storage performance
16. Design of structured pattern mining methods
17. Network alarm pattern mining
18. XML query access pattern analysis
19. System performance
20. Telecommunication network
21. Financial and Scientific data
22. Creating adaptive web sites
23. System improvement
24. Navigation patterns WEBLOG.

Text Books
1. Bing Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric
Systems and Applications), Springer; 2nd Edition 2009
2. Zdravko Markov, Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content,
Structure, and Usage, John Wiley & Sons, Inc., 2007

Reference Books
1. Guandong Xu ,Yanchun Zhang, Lin Li, Web Mining and Social Networking: Techniques and
Applications, Springer; 1st Edition.2010
2. Soumen Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan
Kaufmann; edition 2002
3. Adam Schenker, Graph-Theoretic Techniques for Web Content Mining, World Scientific Pub Co
Inc , 2005
4. Min Song, Yi Fang and Brook Wu, Handbook of research on Text and Web mining technologies,
IGI global, information Science Reference imprint of :IGI publishing, 2008.

Web Mining
Knowledge Areas that contain topics and learning outcomes covered in the course

Knowledge Area Total Hours of Coverage

CS: IAS(Information Assurance and Security) 5

CS: IM(Information Management) 13

CS: Intelligent Systems (IS) 27

Body of Knowledge coverage


[List the Knowledge Units covered in whole or in part in the course. If in part, please indicate
which topics and/or learning outcomes are covered. For those not covered, you might want to
indicate whether they are covered in another course or not covered in your curriculum at all.
This section will likely be the most time-consuming to complete, but is the most valuable for
educators planning to adopt the CS2013 guidelines.]

KA Knowledge Unit Topics Covered Hours

CS: IAS/Web Security Web security model and its applications 5


IAS Browser security model
HTTP security extensions

CS: IM/Information Basic information storage and retrieval (IS&R) 4


IM Management concepts
Concepts Information capture and representation
Supporting human needs: searching, retrieving,
linking, browsing, navigating
Analysis and indexing

CS: IM/Indexing The impact of indices on query performance 6


IM The basic structure of an index
Indexing text
Indexing the web (e.g., web crawling)

CS: IS IS/Basic Search Uninformed search (breadth-first, depth-first, 3


Strategies depth-first with iterative deepening)
Heuristics and informed search

CS: IS IS/Basic Machine Definition and examples of broad variety of 23


Learning machine learning tasks, including classification
Inductive learning
Simple statistical-based learning, such as Naive
Bayesian Classifier, decision trees
The over-fitting problem
Measuring classifier accuracy

IS/Advanced Learning graphical models (Cross-reference 4


Machine Learning IS/Reasoning under Uncertainty)

---- ----- ----- ---

Include all the topic here


Total hours 45
Where does the course fit in the curriculum?
[In what year do students commonly take the course? Is it compulsory? Does it have pre-
requisites, required following courses? How many students take it?]

This course is a
Elective Course.
Suitable from 4th semester onwards.
Knowledge of basic mathematics is essential.

What is covered in the course?


[A short description, and/or a concise list of topics - possibly from your course syllabus.(This is
likely to be your longest answer)]

Part 1: Introduction to Web Mining


It introduces what is web mining and its architecture, challenges and security over the web.

Part II: Web Crawling and Indexing


This section covers the way to fetch and store the data from the web using recent algorithms.

Part III: Three categories of web mining


This section explains web mining in three different categories, its explained using the recent
algorithms.

What is the format of the course?


[Is it face to face, online or blended? How many contact hours? Does it have lectures, lab
sessions, discussion classes?]

This Course is designed with 100 minutes of in-classroom sessions per week, 60 minutes of
video/reading instructional material per week, 100 minutes of lab hours per week, as well as
200 minutes of non-contact time spent on implementing course related project. Generally this
course should have the combination of lectures, in-class discussion, case studies, guest-lectures,
mandatory off-class reading material, quizzes.

How are students assessed?


[What type, and number, of assignments are students are expected to do? (papers, problem sets,
programming projects, etc.). How long do you expect students to spend on completing assessed
work?]

Students are assessed on a combination group activities, classroom discussion, projects,


and continuous, final assessment tests.

Additional weightage will be given based on their rank in crowd sourced projects/ Kaggle
like competitions.

Students can earn additional weightage based on certificate of completion of a related


MOOC course.
Additional topics
[List notable topics covered in the course that you do not find in the CS2013 Body of
Knowledge]

Other comments
[optional]

Session wise plan


Student Outcomes Covered: 2, 11, 14, 17

Class Hour Lab Topic Covered levels of Reference Remarks


Hour mastery Book

2 Introduction and Familiarity 1


Architecture of the
WWW
1 Web Document Usage 1
Representation-
Web Search Engine
Challenges
1 Web security Familiarity 1
overview and
concepts, Web
application
security, Basic web
security model
1 Web Hacking Familiarity 1
Basics HTTP &
HTTPS URL, Web
Under the Cover
Overview of Java
security Reading
the HTML source
2 Basic Crawler Usage 1,2
Algorithm:
Breadth-First/
depth-First
Crawlers
1 Universal Crawlers Usage 1,2
2 Preferential Usage 1,2
Crawlers : Focused
Crawlers - Topical
Crawlers.

3 Static and Dynamic Familiarity 1


Inverted Index
Index Construction
and Index
Compression-
Latent Semantic
Indexing
2 Searching using an Usage 1,2
Inverted Index:
Sequential Search -
Pattern Matching -
Similarity search
3 Link Analysis - Familiarity 1,2
Social Network
Analysis - Co-
Citation and
Bibliographic
Coupling
3 Page Rank- Usage 1,2
Weighted Page
Rank
2 Community Familiarity 1,2
Discovery - Web
Graph
Measurement and
Modelling- Using
Link Information
for Web Page
Classification.

3 Classification: Assessment 1,2,3


Decision tree for
Text Document-
Naive Bayesian
Text Classification
- Ensemble of
Classifiers.
3 Clustering: K- Assessment 1,2
means Clustering -
Hierarchical
Clustering
Markov Models -
Probability-Based
Clustering.
2 Vector Space Usage 1
Model Latent
semantic Indexing
Automatic Topic
Extraction from
Web Documents.

2 Web Usage Mining Usage 1,2


- Click stream
Analysis -Web
Server Log Files -
Data Collection
and Pre-Processing
- Data Modelling
for Web Usage
Mining
4 The BIRCH Usage 1,2
Clustering
Algorithm -
Modelling web
user interests using
clustering- Affinity
Analysis and the A
Priori Algorithm
Binning
2 Web usage mining Usage 1,2
using Probabilistic
Latent Semantic
Analysis Finding
User Access
Pattern via Latent
Dirichlet
Allocation Model.

2 Relevance Usage 1,2


Feedback and
Query Expansion -
Automatic Local
and Global
Analysis
2 Application Assessement

45 Hours (3
Credit hours
/week 15
Weeks
schedule)

You might also like