You are on page 1of 10

Web Scraping Using Nutch and Solr - Part 2

The following example assumes that you have


Watched web scraping with nutch and solr The above movie identity is cAiYBD4BQeE Set up Linux based Nutch/Solr environment Run the web scrape in the above movie Clean up that environment Web scrape a parameterised url View the urls in the data

Now we will

Empty Nutch Database

Clean up the Nutch crawl database


Previously used apache-nutch-1.6/nutch_start.sh This contained -dir crawl option This created apache-nutch-1.6/crawl directory Which contains our Nutch data cd apache-nutch-1.6; rm -rf crawl

Clean this as

Only because it contained dummy data ! Next run of script will create dir again

Empty Solr Database

Clean Solr database via a url


Book mark this url Only use it if you need to empty your data http://localhost:8983/solr/update?commit=true -d '<delete><query>*:*</query></delete>'

Run the following ( with solr server running )

Set up Nutch

Now we will do something more complex Web scrape a url that has parameters i.e.

http://<site>/<function>?var1=val1&var2=val2 Have extra url characters '?=&' Need greater search depth Need better url filtering

This web scrape will


Remember that you need to get permission to scrape a third party web site

Nutch Configuration

Change seed file for Nutch apache-nutch-1.6/urls/seed.txt In this instance I will use a url of the form

http://somesite.co.nz/Search?DateRange=7&industry=62
( this is not a real url just an example )

Change conf regex-urlfilter.txt entry i.e.


# skip URLs containing certain characters -[*!@] # accept anything else +^http://([a-z0-9]*\.)*somesite.co.nz\/Search

This will only consider some site Search urls

Run Nutch

Now run nutch using start script

cd apache-nutch-1.6 ; ./nutch_start.bash

Monitor for errors in solr admin log window The Nutch crawl should end with

crawl finished: crawl

Checking Data

Data should have been indexed in Solr In Solr Admin window


Set 'Core Selector' = collection1 Click 'Query' In Query window set fl field = url Click Execute Query

The result ( next ) shows the filtered list of urls in Solr

Checking Data

Results

Congratulations you have completed your second crawl


With parameterised urls More complex url filtering With a Solr Query search

Contact Us

Feel free to contact us at


www.semtech-solutions.co.nz info@semtech-solutions.co.nz

We offer IT project consultancy We are happy to hear about your problems You can just pay for those hours that you need To solve your problems

You might also like