Nutch Configuration

Configuring the Nutch and Creating the index
Introduction:
Setting up Nutch to run into Eclipse is very tricky and difficult to comprehend ,since
there are lots of files which are interdependent to each other . However, it's very
useful to be able to debug Nutch in Eclipse. But again one might be quicker by
looking at the logs (logs/hadoop.log)...
Step 1: Installing Nutch0.9 on Eclipse on window platform
First install cygwin and set PATH variable for it . (control panel/ system/advanced
tab/ environment variables and edit/add PATH ) PATH: C:\cygwin\bin
Grab a fresh release of Nutch 0.9 -

http://lucene.apache.org/nutch/version_control.html
Step 2:
Create a new java project in Eclipse
• File > New > Project > Java project > click Next
• Name the project (Nutch_Trunk for instance)
• Select "Create project from existing source" and use the location where you downloaded
Nutch
• Click on Next, and wait while Eclipse is scanning the folders
• Add the folder "conf" to the classpath (third tab and then add class folder)
• Go to "Order and Export" tab, find the entry for added "conf" folder and move it to the
top. It's required to make eclipse take config (nutch-default.xml, nutch-final.xml, etc.)
resources from our "conf" folder not anywhere else.
• Eclipse should have guessed all the java files that must be added on your classpath. If it's
not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to
your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries
• Set output dir to "tmp_build", create it if necessary
• DO NOT add "build" to classpath
Change the files
• In the nutch-site.xml
changed nutch-site.xml to include the protocol-file plugin
---------------------------------------------------------------------------------------------------------------------
-----
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(xml|html|text)|index-
(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-
basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
</configuration>
• In crawl-urlfilter.txt
• Replace
-^(file|ftp|mailto):
to
-^(http|ftp|mailto):
• Replace
-.
to
+.
Explaination: “+” means the url made out of regular expression that
follows “+” and “-“ implies not to accept .Also, “.” Can be replaced
with any sequence . Therefore if a page has its base url as
http://241sdgsd.ist.psu.edu will be accepted but http://twet
%dfs.ist.psu.edu will not be accepted .The same can be done for any
file extension as well.
• Create a file with file name as “urls” which contains the

location of all the directories which is to be crawled .
For eg: file:///home/cf/tests
• In file nutch-default.xml
<property>
<name>plugin.folders</name>
<value>plugins</value>
To
<property>
<name>plugin.folders</name>
<value>./src/plugin</value>
In nutch-0.9\src\plugin\protocol-file\src\java\org\apache\nutch\protocol\file
Don’t crawl the parent-directories
The following took some time. The crawler was not restricted to the directories that we specified
in the Urls file but it was jumping into the parent directories as well. The code that is responsible
for this is in org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File f).
While it is obvious – when looking at the code – that this behavior was intended by the author,
we failed to understand the motivation behind it. For my own crawlings I changed the code in a
way that only directories beneath the directories that I specify get crawled.
I changed the following line:
this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :

true);
to
this.content = list2html(f.listFiles(), path, false);
and recompiled.
Commands to Search:
IN Main Class:
org.apache.nutch.searcher.NutchBean
In VM Arguments:
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
In Program Arguments:
GERHARD WANGLER system (query string)

Command to Crawl:
urls –dir crawl –depth 3 –topN 50
• urls is thefile that contains the seed urls .
• “-dir crawl “ is the arguments that supplies that name of the directory where
all the segments will be stored . Make sure that crawl does not exist in the
working nutch directory .
• “-depth 3” specifies the depth ,i.e the pages that are at a distance upto 2 will
be fetched .
This will complete the crawl operation. When the crawling operation is finished, you
will see a directory named crawl in the nutch directory .
Working with Tomcat Server.
Objective: Set up a web servlet container (tomcat in this case) and perform the
search operation upon the crawled pages .
Assumption: Java ,Nutch ,tomcat installed on the system .
So we have nutch and tomcat in the D:/BTP directory . In our case case, both of the
nutch and tomcat(bin configuration) are already installed. So we just need to
configure these two tools and we are ready to go .
In our case tomcat is at D:\BTP\apache-tomcat-6.0.26 location . This

directory(tomcat ) contains “bin” is the directory where all the executable are
located and “webapp” is the directory where all the web application that run inside
tomcat are deployed .
Next, we need to deploy our nutch application to this tomcat server. Here is how
this is done.
• First we need to remove the root(ROOT directory in webapp) application that

is running in the tomcat and then copy the nutch web application
file(nutch0.9.war) in to this directory .
• Rename nutch.war to ROOT.war in webapp directory .
Next , we need to start the tomcat server which is done by following command in
cmd (windows).
Catalina.bat start
Make sure that before executing the above command on cmd , one should be at the
bin directory of the tomcat .
Now , the tomcat server is running but it is not configured yet and therefore, it
cannot find the index that nutch created in the last step. So, we will configure the
tomcat server before showing off with the searching .
In file webapp/ROOT/WEB-INF/classes/nutch-site.xml
The content of the file should be changed
From
<?xml version”1.0”?>
<?xml-stylesheet types=”types=”text/xsl” href=”configuration.xsl”?>
<!—Put site-specific property overrides in this file -- >
<configuration>
</configuration>
To
<?xml version”1.0”?>
<?xml-stylesheet types=”types=”text/xsl” href=”configuration.xsl”?>
<!—Put site-specific property overrides in this file -- >
<configuration>
<property>
<name>searcher.dir</name>
<value>D:\BTP\nutch-0.9\crawl</value>
</property>
</configuration>
Now we need to restart the tomcat server . So, stop it first and then start .So these
are two commands should do the work .
Catalane.bat start
Catalina.bat stop
Now in the browser , type localhost:8080/ and the nutch application is deployed .
Reference:
http://clgiles.ist.psu.edu/IST441/materials/nutch-lucene

Nutch Configuration

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nutch Configuration

Uploaded by

Copyright:

Available Formats

Configuring the Nutch and Creating the index

Step 1: Installing Nutch0.9 on Eclipse on window platform

Grab a fresh release of Nutch 0.9 -

Create a new java project in Eclipse

Change the files

changed nutch-site.xml to include the protocol-file plugin

• Create a file with file name as “urls” which contains the

Don’t crawl the parent-directories

I changed the following line:

this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :

this.content = list2html(f.listFiles(), path, false);

GERHARD WANGLER system (query string)

urls –dir crawl –depth 3 –topN 50

• urls is thefile that contains the seed urls .

Working with Tomcat Server.

Assumption: Java ,Nutch ,tomcat installed on the system .

In our case tomcat is at D:\BTP\apache-tomcat-6.0.26 location . This

• First we need to remove the root(ROOT directory in webapp) application that

• Rename nutch.war to ROOT.war in webapp directory .

The content of the file should be changed

<?xml-stylesheet types=”types=”text/xsl” href=”configuration.xsl”?>

<!—Put site-specific property overrides in this file -- >

<?xml-stylesheet types=”types=”text/xsl” href=”configuration.xsl”?>

<!—Put site-specific property overrides in this file -- >

You might also like