You are on page 1of 10

Solr Extracting Data

Start this session with a full Solr indexed repository


Movie cAiYBD4BQeE showed installation Movie Th5Scvlyt-E showed Nutch web crawl Extract data from Solr Extract to xml or csv Show aim to load into data warehouse

This movie will show how to


This movie assumes you know Linux

Solr Extracting Data

Progress so far, greyed out area yet to be examined

Checking Solr Data

Data should have been indexed in Solr In Solr Admin window


Set 'Core Selector' = collection1 Click 'Query' In Query window set fl field = url Click Execute Query

The result ( next ) shows the filtered list of urls in Solr

Checking Solr Data

How To Extract

How could we get at Solr data ?


In admin console via query Via http solr select Via curl -o call using solr http select Xml Comma separated variable (csv)

What format of data that suits this purpose


How To Extract

We want to extract two columns from Solr

tstamp, url

We want to extract as csv ( csv in call below could be xml ) We want to extract to a file So we will use an http call

http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv

We will also use a curl call

curl -o <csv file> '<http call>'

How To Extract

Ceate a bash file in Solr install directory


cd solr-4-2-1/extract ; touch solr_url_extract.bash chmod 755 solr_url_extract.bash


#!/bin/bash curl -o result.csv 'http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv' mv result.csv result.csv.$(date +%Y%m%d.%H%M%S)

Add contents to bash file


Now run the bash script

./solr_url_extract.bash

Check Output

Now we check whether we have data ls -l shows

result.csv.20130506.124857

Check the content , wc -l shows 11 lines Check the content , head -2 shows

tstamp, url 2013-05-04T01:56:58.157Z,http://www.mysite.co.nz/Search? DateRange=7& ...

Congratulations, you have extracted data from Solr It's in CSV format ready to be loaded into a data warehouse

Possible Next Steps

Choose more fields to extract from data Allow Nutch crawl to go deeper Allow Nutch crawl to collect a lot more data Look at facets in Solr data Load CSV files into Data Warehouse Staging schema Next movie will show next step in progress

Contact Us

Feel free to contact us at


www.semtech-solutions.co.nz info@semtech-solutions.co.nz

We offer IT project consultancy We are happy to hear about your problems You can just pay for those hours that you need To solve your problems

You might also like