Professional Documents
Culture Documents
Watched web scraping with nutch and solr The above movie identity is cAiYBD4BQeE Set up Linux based Nutch/Solr environment Run the web scrape in the above movie Clean up that environment Web scrape a parameterised url View the urls in the data
Now we will
Previously used apache-nutch-1.6/nutch_start.sh This contained -dir crawl option This created apache-nutch-1.6/crawl directory Which contains our Nutch data cd apache-nutch-1.6; rm -rf crawl
Clean this as
Only because it contained dummy data ! Next run of script will create dir again
Book mark this url Only use it if you need to empty your data http://localhost:8983/solr/update?commit=true -d '<delete><query>*:*</query></delete>'
Set up Nutch
Now we will do something more complex Web scrape a url that has parameters i.e.
http://<site>/<function>?var1=val1&var2=val2 Have extra url characters '?=&' Need greater search depth Need better url filtering
Remember that you need to get permission to scrape a third party web site
Nutch Configuration
Change seed file for Nutch apache-nutch-1.6/urls/seed.txt In this instance I will use a url of the form
http://somesite.co.nz/Search?DateRange=7&industry=62
( this is not a real url just an example )
# skip URLs containing certain characters -[*!@] # accept anything else +^http://([a-z0-9]*\.)*somesite.co.nz\/Search
Run Nutch
cd apache-nutch-1.6 ; ./nutch_start.bash
Monitor for errors in solr admin log window The Nutch crawl should end with
Checking Data
Set 'Core Selector' = collection1 Click 'Query' In Query window set fl field = url Click Execute Query
Checking Data
Results
With parameterised urls More complex url filtering With a Solr Query search
Contact Us
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
We offer IT project consultancy We are happy to hear about your problems You can just pay for those hours that you need To solve your problems