Welcome to Scribd!

Internet Data Mining

Uploaded by

0% found this document useful (0 votes)

65 views2 pages

Common search engines do not index dynamic content; any URL with a '?' is ignored. Their design makes them unsuitable for comparison shopping or data integration. The DISL group has constructed a powerful set of information extraction tools.

Original Description:

Copyright

Available Formats

DOCX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

65 views2 pages

Internet Data Mining

Uploaded by

amrisundar

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 2

Search inside document

Internet Data Mining

Sponsor Ling Liu / David Buttler

{lingliu, buttler}@cc.gatech.edu
223 / 260 CCB
Systems and Databases
Area
Related Projects

Problem
The explosive growth of the Internet has become an overused cliche, yet the problems of
information overload remain as real as ever. Web search engines provide one way to manage the
deluge of information on the Internet, but they have some serious drawbacks for many
applications. Common search engines do not index dynamic content; any URL with a '?' is
ignored. Neither do search engines provide finer granularity than a single HTML page. Their
design makes them unsuitable for comparison shopping or data integration.
The DISL group has constructed a powerful set of information extraction tools to work at solving
some of these problems. There are several remaining research challenges however. The
following figure presents a simple architecture for a dynamic search engine.

Within this framework there are several possible short proejcts suitable for a 7001 mini project,
or an extended Special Problems.
1. Design and implffement a robot crawler that discovers new dynamic search engine
interfaces
2. Design a technique to categorize a search engine by its contents (the pages that it
dynamically generates), the types of queries it responds to (query interface), or the
context of the search interface.
3. In conjunction with the categorization system, develop a user interface that assists users
in selecting the appropriate types of sources that are applicable to their query (see the
AQR project for an example static system)
4. Improve the automated object extraction system. This may be broken down into
individual projects by itself.

Currently, the automated object extraction system works in two phases: (1) identify the
region of a dynamically generated web page that contains data objects; (2) discover how
the objects are separated (e.g. is there a single tag that separates objects?), and use the
separator to split the data region into objects.

Mini-projects in this area may include the following:

○ Develop a new heuristic to identify where the data objects are; validate the
effectiveness of the heuristic
○ Develop a new heuristic to split the data region in to data objects; validate the
effectiveness of the heuristic
○ Implement a more sophisticated technique to combine individual heuristics to
produce a better result, either for the data region identification heuristics, or the
object separtor discovery heuristics.
There are several interesting projects related with this topic. Please see either David or Prof. Ling
Liu to discuss other options.
Resources that may be helpful:
• Local Java code library (convert an HTML file into a tree, automatically extract textual
objects from a page, and more).
• A Java framework to automatically run a heuristic over a large set of test web pages
• set of web pages to test solutions, plus a method to evaluate whether a data-region
heuristic or an object separator heuristic succeeded on a given web page.

Background
You are expected to have a solid grasp of Java programming. Familiarity with XML is useful but
not required.

Deliverables
A report describing the work you did and how you evaluate your results; any source code you
produced to accomplish your results.
Evaluation
You will be graded on the novelty and quality of your report and implementation.

FinalProject Description
Document5 pages
FinalProject Description
Bhanu Reddy
No ratings yet
Smart Crawler
Document92 pages
Smart Crawler
Ammu
No ratings yet
Dissertation On Web Mining
Document4 pages
Dissertation On Web Mining
CustomPaperWritingNewark
100% (1)
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
Document10 pages
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
Killerbee
No ratings yet
Search Engine Case Study: Searching The Web Using Genetic Programming and MPI
Document19 pages
Search Engine Case Study: Searching The Web Using Genetic Programming and MPI
Satyam Saurabh
No ratings yet
Hidden Web Crawler Research Paper
Document5 pages
Hidden Web Crawler Research Paper
afnkcjxisddxil
100% (1)
Entity Extraction System
Document6 pages
Entity Extraction System
Uday Sol
No ratings yet
CVSA18-M09-Archive, Index & Search
Document54 pages
CVSA18-M09-Archive, Index & Search
Muneeza Hashmi
No ratings yet
Web Crawler Research Paper
Document6 pages
Web Crawler Research Paper
fvf8zrn0
100% (1)
Project Final
Document59 pages
Project Final
Raghupal reddy Gangula
No ratings yet
Design and Implementation of A High-Performance Distributed Web Crawler
Document12 pages
Design and Implementation of A High-Performance Distributed Web Crawler
Amritpal Singh
No ratings yet
Ad Web Explore
Document30 pages
Ad Web Explore
surya putra
No ratings yet
Architecture of Deep Web: Surfacing Hidden Value: Suneet Kumar Virender Kumar Sharma
Document5 pages
Architecture of Deep Web: Surfacing Hidden Value: Suneet Kumar Virender Kumar Sharma
Rakeshconclave
No ratings yet
KiranCV May 2012
Document6 pages
KiranCV May 2012
Abhinay Kumar
No ratings yet
DNI BlackBook 2
Document33 pages
DNI BlackBook 2
Hutch Rev Oliver
No ratings yet
Data Mining
Document7 pages
Data Mining
Roxanna Gonzalez
No ratings yet
A Survey On Semantic Web Search Engines: October 2011
Document8 pages
A Survey On Semantic Web Search Engines: October 2011
Disha goyal
No ratings yet
Focused Crawling: A New Approach To Topic-Specific Web Resource Discovery
Document18 pages
Focused Crawling: A New Approach To Topic-Specific Web Resource Discovery
Priti Singh
No ratings yet
A Focused Crawler Combinatory Link and Content Model Based On T-Graph Principles
Document14 pages
A Focused Crawler Combinatory Link and Content Model Based On T-Graph Principles
Haziq Mirza
No ratings yet
Implementing A Web Crawler in A Smart Phone Mobile Application
Document4 pages
Implementing A Web Crawler in A Smart Phone Mobile Application
Editor IJAERD
No ratings yet
Analysis and Design of Web Personalization Systems For E-Commerce
Document7 pages
Analysis and Design of Web Personalization Systems For E-Commerce
ijbui iir
No ratings yet
IRM For Dummies.
Document164 pages
IRM For Dummies.
Ral Lopez
No ratings yet
Challanges To Distributed Webinformation
Document16 pages
Challanges To Distributed Webinformation
talhakamran2006
No ratings yet
Search Engine Problems and Solutions
Document2 pages
Search Engine Problems and Solutions
International Journal of Innovative Science and Research Technology
No ratings yet
A Vision-Based Approach For Deep Web Data
Document14 pages
A Vision-Based Approach For Deep Web Data
Rajbabu Kumaravel
No ratings yet
Semantic Web (CS1145) : Department Elective (Final Year) Department of Computer Science & Engineering
Document36 pages
Semantic Web (CS1145) : Department Elective (Final Year) Department of Computer Science & Engineering
qwerty u
No ratings yet
Web Intelligence Overview
Document35 pages
Web Intelligence Overview
raja singh
No ratings yet
Java Web Crawler
Document1 page
Java Web Crawler
John Wiltberger
No ratings yet
Resource Capability Discovery and Description Management System For Bioinformatics Data and Service Integration - An Experiment With Gene Regulatory Networks
Document6 pages
Resource Capability Discovery and Description Management System For Bioinformatics Data and Service Integration - An Experiment With Gene Regulatory Networks
mindvision25
No ratings yet
Adbms Unit 1
Document33 pages
Adbms Unit 1
Disha Bhardwaj
No ratings yet
Csea TJ 005 PDF
Document6 pages
Csea TJ 005 PDF
khunemrunalini
No ratings yet
Touch With Industry
Document3 pages
Touch With Industry
Anonymous kw8Yrp0R5r
No ratings yet
Implementation of Web Application For Disease Prediction Using AI
Document5 pages
Implementation of Web Application For Disease Prediction Using AI
BOHR International Journal of Computer Science (BIJCS)
No ratings yet
Web Data Scraping
Document5 pages
Web Data Scraping
Munawir Munawir
No ratings yet
Recommender Systems Using Semantic Web Technologies and Folksonomies
Document5 pages
Recommender Systems Using Semantic Web Technologies and Folksonomies
bonsonsm
No ratings yet
Design and Implementation of A Simple Web Search E
Document9 pages
Design and Implementation of A Simple Web Search E
ajaykumarbmsit
No ratings yet
3.Eng-A Survey On Web Mining
Document8 pages
3.Eng-A Survey On Web Mining
Impact Journals
No ratings yet
Engineering-A Review Web Data Scrapping
Document4 pages
Engineering-A Review Web Data Scrapping
Impact Journals
No ratings yet
Nayak (2022) - A Study On Web Scraping
Document3 pages
Nayak (2022) - A Study On Web Scraping
José
No ratings yet
Flipkart Web Scrapping
Document8 pages
Flipkart Web Scrapping
parv2410shri
No ratings yet
Web Database (Very Good) PDF
Document44 pages
Web Database (Very Good) PDF
Vỹ Phạm
No ratings yet
Internal 3 Answer
Document10 pages
Internal 3 Answer
TKK
No ratings yet
Architectural Design and Evaluation of An Efficient Web-Crawling System
Document8 pages
Architectural Design and Evaluation of An Efficient Web-Crawling System
khadafishah
No ratings yet
Gilbane Group Report Intelligenx
Document9 pages
Gilbane Group Report Intelligenx
Intelligenx
No ratings yet
Thesis On Web Structure Mining
Document7 pages
Thesis On Web Structure Mining
CollegePapersToBuyCanada
100% (2)
Part Ii: Applications of Gas: Ga and The Internet Genetic Search Based On Multiple Mutation Approaches
Document31 pages
Part Ii: Applications of Gas: Ga and The Internet Genetic Search Based On Multiple Mutation Approaches
Srikar Chintala
No ratings yet
G.Eswar Reddy Mobile:-91+9642182661 Mail
Document4 pages
G.Eswar Reddy Mobile:-91+9642182661 Mail
SriReddy
No ratings yet
Web Information Retrieval
Document4 pages
Web Information Retrieval
VikasThada
No ratings yet
LRam (4,0)
Document5 pages
LRam (4,0)
Amardeep Vishwakarma
No ratings yet
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
Document12 pages
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
Ayush Sood
No ratings yet
Oracle Quick Guides: Part 2 - Oracle Database Design
From Everand
Oracle Quick Guides: Part 2 - Oracle Database Design
Malcolm Coxall
No ratings yet
Data Warehousing Part C
Document5 pages
Data Warehousing Part C
sneha kotawadekar
No ratings yet
Data - Hao Ye
Document5 pages
Data - Hao Ye
HARSHA
No ratings yet
Data Analysis by Web Scraping Using Python
Document6 pages
Data Analysis by Web Scraping Using Python
national srkdc
No ratings yet
Name: Pankaj L. Chowkekar Application ID: 6808 Class: S.Y. M.C.A. (Sem 4) Subject: Advanced Database Techniques Group: A
Document41 pages
Name: Pankaj L. Chowkekar Application ID: 6808 Class: S.Y. M.C.A. (Sem 4) Subject: Advanced Database Techniques Group: A
Bhavin Panchal
No ratings yet
Universe Designer & WEB Intelligence
Document27 pages
Universe Designer & WEB Intelligence
Srujan Kumar
No ratings yet
WP Dremio Definitive Guide To The Data Lakehouse
Document20 pages
WP Dremio Definitive Guide To The Data Lakehouse
jonascontiero
No ratings yet
Thesis On Web Log Mining
Document8 pages
Thesis On Web Log Mining
hannahcarpenterspringfield
100% (2)
Job Information Crawling, Visualization and Clustering of Job Search Websites
Document5 pages
Job Information Crawling, Visualization and Clustering of Job Search Websites
boopathi kumar
No ratings yet
Data Mesh: Building Scalable, Resilient, and Decentralized Data Infrastructure for the Enterprise Part 1
From Everand
Data Mesh: Building Scalable, Resilient, and Decentralized Data Infrastructure for the Enterprise Part 1
Tom Lesley
No ratings yet
ARM MICROCONTROLLER & EMBEDDED SYSTEM 15EC62 Module 4 Notes
Document21 pages
ARM MICROCONTROLLER & EMBEDDED SYSTEM 15EC62 Module 4 Notes
Gururaj E
71% (7)
Report
Document3 pages
Report
sudulagunta akshara
No ratings yet
Enter AV1: Alliance For Open Media Codec
Document14 pages
Enter AV1: Alliance For Open Media Codec
Vikram Bhaskaran
No ratings yet
SOA Programming Model and Physical Architecture Model
Document45 pages
SOA Programming Model and Physical Architecture Model
prasadpandit123
No ratings yet
Server Side Scripting PHP
Document99 pages
Server Side Scripting PHP
Aspen Thrush
No ratings yet
"Logistics Management System": Bachelor of Commerce (Computer Application) - III
Document9 pages
"Logistics Management System": Bachelor of Commerce (Computer Application) - III
Anonymous g7uPednI
No ratings yet
Api Meta
Document416 pages
Api Meta
Bhaskar Kumar Uppuluri
No ratings yet
RC Res
Document5 pages
RC Res
Er Rahul Keshri
No ratings yet
Brochure
Document1 page
Brochure
Gaurav Rajput
No ratings yet
What Is Spool Administration in SAP
Document3 pages
What Is Spool Administration in SAP
Subramani Sambandam
No ratings yet
STT04 Abb Manual PDF
Document285 pages
STT04 Abb Manual PDF
harosalesv
No ratings yet
Microsoft Azure, Dynamics 365 and Online Services - ISO 27018 Certificate 12
Document17 pages
Microsoft Azure, Dynamics 365 and Online Services - ISO 27018 Certificate 12
Guille Vallejo
No ratings yet
The Operating System's Job
Document30 pages
The Operating System's Job
Glyndel D Dupio
No ratings yet
Curriculum Vitae Phani Raj Ankam Objective:: Technologies
Document5 pages
Curriculum Vitae Phani Raj Ankam Objective:: Technologies
Pavan P
No ratings yet
System Software 2 Marks and 16 Marks With Answer
Document23 pages
System Software 2 Marks and 16 Marks With Answer
priyaaram
No ratings yet
Power - 48 DC: 4 RJ45 Ethernet Ports
Document46 pages
Power - 48 DC: 4 RJ45 Ethernet Ports
eng.muhanad moussa
No ratings yet
Industrial Training Report
Document20 pages
Industrial Training Report
Sumit Kansagara
100% (1)
Please Make A Backup of This File
Document4 pages
Please Make A Backup of This File
Aroldo Zelaya
No ratings yet
RHSA1 Day5
Document37 pages
RHSA1 Day5
Zeinab Abdelghaffar Radwan Abdelghaffar
No ratings yet
Penerapan Metode Waterfall Pada Sistem Informasi Inventori Pt. Pangan Sehat SEJAHTERA Muhamad Tabrani, Eni Pudjiarti
Document14 pages
Penerapan Metode Waterfall Pada Sistem Informasi Inventori Pt. Pangan Sehat SEJAHTERA Muhamad Tabrani, Eni Pudjiarti
Doahman Sidabutar
No ratings yet
Intel Io Processors - Linux Installation Application Note
Document22 pages
Intel Io Processors - Linux Installation Application Note
isaaccc
No ratings yet
Ubuntu VSFTPD With Virutual Users
Document7 pages
Ubuntu VSFTPD With Virutual Users
Renee
No ratings yet
Week 4 - PIG SqoopFall2019
Document117 pages
Week 4 - PIG SqoopFall2019
Oneil Henry
No ratings yet
PowerLink Advantage V3.00 User's Guide
Document285 pages
PowerLink Advantage V3.00 User's Guide
Salvador Fayssal
No ratings yet
JFX Graphics
Document82 pages
JFX Graphics
chrisnankam
No ratings yet
How To Draw The Google Chrome Logo in Illustrator
Document28 pages
How To Draw The Google Chrome Logo in Illustrator
Aleksandra
No ratings yet
Authorization Check For The Condition Screen
Document4 pages
Authorization Check For The Condition Screen
planetdejavu
No ratings yet
TNS 1190
Document4 pages
TNS 1190
FernandoArriagadaAburto
No ratings yet
Kafka Reference Architecture
Document12 pages
Kafka Reference Architecture
mbhangale
No ratings yet
Log
Document179 pages
Log
dani07set
No ratings yet