You are on page 1of 10

Web Scraping Using Nutch and Solr - Part 2

The following example assumes that you have

Watched web scraping with nutch and solr The above movie identity is cAiYBD4BQeE Set up Linux based Nutch/Solr environment Run the web scrape in the above movie Clean up that environment Web scrape a parameterised url View the urls in the data

Now we will

Empty Nutch Database

Clean up the Nutch crawl database

Previously used apache-nutch-1.6/ This contained -dir crawl option This created apache-nutch-1.6/crawl directory Which contains our Nutch data cd apache-nutch-1.6; rm -rf crawl

Clean this as

Only because it contained dummy data ! Next run of script will create dir again

Empty Solr Database

Clean Solr database via a url

Book mark this url Only use it if you need to empty your data http://localhost:8983/solr/update?commit=true -d '<delete><query>*:*</query></delete>'

Run the following ( with solr server running )

Set up Nutch

Now we will do something more complex Web scrape a url that has parameters i.e.

http://<site>/<function>?var1=val1&var2=val2 Have extra url characters '?=&' Need greater search depth Need better url filtering

This web scrape will

Remember that you need to get permission to scrape a third party web site

Nutch Configuration

Change seed file for Nutch apache-nutch-1.6/urls/seed.txt In this instance I will use a url of the form
( this is not a real url just an example )

Change conf regex-urlfilter.txt entry i.e.

# skip URLs containing certain characters -[*!@] # accept anything else +^http://([a-z0-9]*\.)*\/Search

This will only consider some site Search urls

Run Nutch

Now run nutch using start script

cd apache-nutch-1.6 ; ./nutch_start.bash

Monitor for errors in solr admin log window The Nutch crawl should end with

crawl finished: crawl

Checking Data

Data should have been indexed in Solr In Solr Admin window

Set 'Core Selector' = collection1 Click 'Query' In Query window set fl field = url Click Execute Query

The result ( next ) shows the filtered list of urls in Solr

Checking Data


Congratulations you have completed your second crawl

With parameterised urls More complex url filtering With a Solr Query search

Contact Us

Feel free to contact us at

We offer IT project consultancy We are happy to hear about your problems You can just pay for those hours that you need To solve your problems

You might also like