Welcome to Scribd!

Web Scraping Using Nutch and Solr 2/3

Uploaded by

0% found this document useful (0 votes)

153 views10 pages

A short presentation ( part 2 of 3 ) describing the use of open source code nutch and solr to web crawl the internet and process the data.

Original Title

Web Scraping Using Nutch and Solr 2/3

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

A short presentation ( part 2 of 3 ) describing the use of open source code nutch and solr to web crawl the internet and process the data.

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

153 views10 pages

Web Scraping Using Nutch and Solr 2/3

Uploaded by

Mike Frampton

A short presentation ( part 2 of 3 ) describing the use of open source code nutch and solr to web crawl the internet and process the data.

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 10

Search inside document

Web Scraping Using Nutch and Solr - Part 2

The following example assumes that you have

Watched web scraping with nutch and solr The above movie identity is cAiYBD4BQeE Set up Linux based Nutch/Solr environment Run the web scrape in the above movie Clean up that environment Web scrape a parameterised url View the urls in the data

Now we will

Empty Nutch Database

Clean up the Nutch crawl database

Previously used apache-nutch-1.6/nutch_start.sh This contained -dir crawl option This created apache-nutch-1.6/crawl directory Which contains our Nutch data cd apache-nutch-1.6; rm -rf crawl

Clean this as

Only because it contained dummy data ! Next run of script will create dir again

Empty Solr Database

Clean Solr database via a url

Book mark this url Only use it if you need to empty your data http://localhost:8983/solr/update?commit=true -d '<delete><query>*:*</query></delete>'

Run the following ( with solr server running )

Set up Nutch

Now we will do something more complex Web scrape a url that has parameters i.e.

http://<site>/<function>?var1=val1&var2=val2 Have extra url characters '?=&' Need greater search depth Need better url filtering

This web scrape will

Remember that you need to get permission to scrape a third party web site

Nutch Configuration

Change seed file for Nutch apache-nutch-1.6/urls/seed.txt In this instance I will use a url of the form

http://somesite.co.nz/Search?DateRange=7&industry=62
( this is not a real url just an example )

Change conf regex-urlfilter.txt entry i.e.

# skip URLs containing certain characters -[*!@] # accept anything else +^http://([a-z0-9]*\.)*somesite.co.nz\/Search

This will only consider some site Search urls

Run Nutch

Now run nutch using start script

cd apache-nutch-1.6 ; ./nutch_start.bash

Monitor for errors in solr admin log window The Nutch crawl should end with

crawl finished: crawl

Checking Data

Data should have been indexed in Solr In Solr Admin window

Set 'Core Selector' = collection1 Click 'Query' In Query window set fl field = url Click Execute Query

The result ( next ) shows the filtered list of urls in Solr

Checking Data

Results

Congratulations you have completed your second crawl

With parameterised urls More complex url filtering With a Solr Query search

Feel free to contact us at

www.semtech-solutions.co.nz info@semtech-solutions.co.nz

We offer IT project consultancy We are happy to hear about your problems You can just pay for those hours that you need To solve your problems

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
Rating: 4 out of 5 stars
4/5 (5794)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
Rating: 4 out of 5 stars
4/5 (1090)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
Rating: 4.5 out of 5 stars
4.5/5 (838)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
Rating: 4 out of 5 stars
4/5 (599)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
Rating: 4.5 out of 5 stars
4.5/5 (1712)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
Rating: 4 out of 5 stars
4/5 (1103)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
Rating: 4 out of 5 stars
4/5 (894)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
Rating: 4 out of 5 stars
4/5 (587)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
Rating: 4.5 out of 5 stars
4.5/5 (537)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
Rating: 4.5 out of 5 stars
4.5/5 (2099)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
Rating: 4.5 out of 5 stars
4.5/5 (474)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
Rating: 4.5 out of 5 stars
4.5/5 (344)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
Rating: 4 out of 5 stars
4/5 (1015)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
Rating: 4 out of 5 stars
4/5 (1839)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
Rating: 4 out of 5 stars
4/5 (821)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
Rating: 4.5 out of 5 stars
4.5/5 (119)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
Rating: 4.5 out of 5 stars
4.5/5 (271)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
Rating: 4.5 out of 5 stars
4.5/5 (440)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
Rating: 3.5 out of 5 stars
3.5/5 (399)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
Rating: 3.5 out of 5 stars
3.5/5 (2219)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
Rating: 4.5 out of 5 stars
4.5/5 (4609)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
Rating: 3.5 out of 5 stars
3.5/5 (1937)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
Rating: 4 out of 5 stars
4/5 (4200)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
Rating: 4.5 out of 5 stars
4.5/5 (1929)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
Rating: 4 out of 5 stars
4/5 (98)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
Rating: 4.5 out of 5 stars
4.5/5 (806)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
Rating: 4.5 out of 5 stars
4.5/5 (265)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
Rating: 3.5 out of 5 stars
3.5/5 (2322)
Yes Please
From Everand
Yes Please
Amy Poehler
Rating: 4 out of 5 stars
4/5 (1891)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
Rating: 3.5 out of 5 stars
3.5/5 (231)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
Rating: 4.5 out of 5 stars
4.5/5 (234)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
Rating: 3.5 out of 5 stars
3.5/5 (738)
John Adams
From Everand
John Adams
David McCullough
Rating: 4.5 out of 5 stars
4.5/5 (2409)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
Rating: 4 out of 5 stars
4/5 (3811)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
Rating: 4 out of 5 stars
4/5 (73)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
Rating: 4.5 out of 5 stars
4.5/5 (789)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
Rating: 4 out of 5 stars
4/5 (45)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
Rating: 3.5 out of 5 stars
3.5/5 (792)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carre
Rating: 3.5 out of 5 stars
3.5/5 (104)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
Rating: 3.5 out of 5 stars
3.5/5 (137)
Little Women
From Everand
Little Women
Louisa May Alcott
Rating: 4 out of 5 stars
4/5 (104)
Pragmatic Marketing Framework
Document2 pages
Pragmatic Marketing Framework
ohgenry
No ratings yet
Guide To Djent Tone
Document6 pages
Guide To Djent Tone
Cristiana Musella
No ratings yet
Progress Test 5 (Units 13-15) : Complete All Seven Sections. There Are Seventy Marks in Total
Document7 pages
Progress Test 5 (Units 13-15) : Complete All Seven Sections. There Are Seventy Marks in Total
Ilia Gviniashvili
No ratings yet
What Is Apache Airavata ?
Document12 pages
What Is Apache Airavata ?
Mike Frampton
No ratings yet
Apache SystemML AI/ML
Document11 pages
Apache SystemML AI/ML
Mike Frampton
No ratings yet
Apache Beam
Document13 pages
Apache Beam
Mike Frampton
No ratings yet
Apache ActiveMQ
Document11 pages
Apache ActiveMQ
Mike Frampton
No ratings yet
Apache Kudu
Document12 pages
Apache Kudu
Mike Frampton
No ratings yet
Kubernetes
Document14 pages
Kubernetes
Mike Frampton
100% (1)
What Is Apache Phoenix ?
Document11 pages
What Is Apache Phoenix ?
Mike Frampton
No ratings yet
Apache Gobblin
Document14 pages
Apache Gobblin
Mike Frampton
No ratings yet
What Is Apache Ranger ?
Document17 pages
What Is Apache Ranger ?
Mike Frampton
No ratings yet
What Is Apache Edgent ?
Document12 pages
What Is Apache Edgent ?
Mike Frampton
No ratings yet
Apache Tez
Document10 pages
Apache Tez
Mike Frampton
No ratings yet
What Is Apache Couchdb ?
Document12 pages
What Is Apache Couchdb ?
Mike Frampton
No ratings yet
An Introduction To Titan
Document8 pages
An Introduction To Titan
Mike Frampton
No ratings yet
Ni Fi
Document15 pages
Ni Fi
Mike Frampton
No ratings yet
An Introduction To 0xdata H2O
Document10 pages
An Introduction To 0xdata H2O
Mike Frampton
No ratings yet
Apache Tinkerpop - Odp
Document11 pages
Apache Tinkerpop - Odp
Mike Frampton
No ratings yet
An Introduction To Databricks
Document10 pages
An Introduction To Databricks
Mike Frampton
No ratings yet
Apache Tinkerpop - Odp
Document11 pages
Apache Tinkerpop - Odp
Mike Frampton
No ratings yet
An Introduction To Apache Spark MLlib
Document8 pages
An Introduction To Apache Spark MLlib
Mike Frampton
No ratings yet
An Introduction To Apache Mesos
Document9 pages
An Introduction To Apache Mesos
Mike Frampton
No ratings yet
An Introduction To Apache Gora
Document11 pages
An Introduction To Apache Gora
Mike Frampton
No ratings yet
An Introduction To Pentaho
Document13 pages
An Introduction To Pentaho
Mike Frampton
No ratings yet
An Introduction To Apache S4
Document8 pages
An Introduction To Apache S4
Mike Frampton
No ratings yet
An Introduction To Apache Cordova
Document9 pages
An Introduction To Apache Cordova
Mike Frampton
No ratings yet
An Introduction To Apache Thrift
Document12 pages
An Introduction To Apache Thrift
Mike Frampton
No ratings yet
An Introduction To Apache Crunch
Document8 pages
An Introduction To Apache Crunch
Mike Frampton
No ratings yet
An Introduction To Apache Storm
Document10 pages
An Introduction To Apache Storm
Mike Frampton
No ratings yet
An Introduction To Apache Falcon
Document8 pages
An Introduction To Apache Falcon
Mike Frampton
No ratings yet
An Introduction To Apache Bigtop
Document9 pages
An Introduction To Apache Bigtop
Mike Frampton
No ratings yet
An Introduction To Apache Maven
Document9 pages
An Introduction To Apache Maven
Mike Frampton
No ratings yet
Word Formation - Exercises
Document4 pages
Word Formation - Exercises
Ana Ciocan
No ratings yet
Prayer Buddy
Document42 pages
Prayer Buddy
Joribelle Arante
No ratings yet
The Sims Freeplay
Document14 pages
The Sims Freeplay
Florian
No ratings yet
Blind and Visually Impaired
Document5 pages
Blind and Visually Impaired
Prem Kumar
No ratings yet
Chapter 1. Introduction To TCPIP Networking
Document15 pages
Chapter 1. Introduction To TCPIP Networking
Poojitha Nagaraja
No ratings yet
Accomplishment Report - 1st and 2nd Sem
Document41 pages
Accomplishment Report - 1st and 2nd Sem
shailean azul
No ratings yet
Universitas Alumni Psikotest Lolos
Document11 pages
Universitas Alumni Psikotest Lolos
Psikotes BVK
No ratings yet
ISE I Conversation Task - Rules and Regulations
Document3 pages
ISE I Conversation Task - Rules and Regulations
Elena B. Herrero
No ratings yet
Javier Couso, Alexandra Huneeus, Rachel Sieder Cultures of Legality Judicialization and Political Activism in Latin America Cambridge Studies in Law and Society
Document290 pages
Javier Couso, Alexandra Huneeus, Rachel Sieder Cultures of Legality Judicialization and Political Activism in Latin America Cambridge Studies in Law and Society
Lívia de Souza
No ratings yet
VDA China
Document72 pages
VDA China
tuananh1010
No ratings yet
SDLC - Agile Model
Document3 pages
SDLC - Agile Model
Muhammad Akram
No ratings yet
VIII MKL Duet I Etap 2018 Angielski Arkusz Dla Piszącego
Document5 pages
VIII MKL Duet I Etap 2018 Angielski Arkusz Dla Piszącego
Kamil
No ratings yet
Mock Exam 2
Document33 pages
Mock Exam 2
Althea Karmylle M. Bonita
No ratings yet
Infanrix Hexa RSMKL July 2023
Document37 pages
Infanrix Hexa RSMKL July 2023
Bayu Kurniawan
No ratings yet
Lung Biopsy
Document8 pages
Lung Biopsy
Siya Patil
No ratings yet
Grade 10 To 12 English Amplified Pamphlet
Document59 pages
Grade 10 To 12 English Amplified Pamphlet
Chikuta Shingalili
No ratings yet
SOLUS Is An Autonomous System That Enables Hyper-Personalized Engagement With Individual Customers at Scale
Document3 pages
SOLUS Is An Autonomous System That Enables Hyper-Personalized Engagement With Individual Customers at Scale
Shikha
No ratings yet
Twin-Field Quantum Key Distribution Without Optical Frequency Dissemination
Document8 pages
Twin-Field Quantum Key Distribution Without Optical Frequency Dissemination
Hareesh Panakkal
No ratings yet
Islamic Finance in the UK
Document27 pages
Islamic Finance in the UK
Ali Can ERTÜRK (alicanerturk)
No ratings yet
The Meaning of Al Fatiha
Document11 pages
The Meaning of Al Fatiha
mmhoward20
No ratings yet
CQI - Channel Quality Indicator - Ytd2525
Document4 pages
CQI - Channel Quality Indicator - Ytd2525
Tonzay
No ratings yet
Method of Istinja
Document24 pages
Method of Istinja
Islamic Library
No ratings yet
Kurukshetra English August '17
Document60 pages
Kurukshetra English August '17
amit2688
No ratings yet
Chap 4 e
Document22 pages
Chap 4 e
Hira Ameen
No ratings yet
Ns5e rw3 SB Ak Hye
Document24 pages
Ns5e rw3 SB Ak Hye
Keys Shield Joshua
No ratings yet
Carb-Based-Dll No. 2 - 4th Qtr.
Document5 pages
Carb-Based-Dll No. 2 - 4th Qtr.
Kathrene Santos Rivera
No ratings yet
Librarianship and Professional Ethics: Understanding Standards for Library Professionals
Document12 pages
Librarianship and Professional Ethics: Understanding Standards for Library Professionals
HALL
No ratings yet