You are on page 1of 6

Configuring the Nutch and Creating the index

Introduction:

Setting up Nutch to run into Eclipse is very tricky and difficult to comprehend ,since
there are lots of files which are interdependent to each other . However, it's very
useful to be able to debug Nutch in Eclipse. But again one might be quicker by
looking at the logs (logs/hadoop.log)...

Step 1: Installing Nutch0.9 on Eclipse on window platform

First install cygwin and set PATH variable for it . (control panel/ system/advanced
tab/ environment variables and edit/add PATH ) PATH: C:\cygwin\bin

Grab a fresh release of Nutch 0.9 -


http://lucene.apache.org/nutch/version_control.html

Step 2:

Create a new java project in Eclipse

• File > New > Project > Java project > click Next
• Name the project (Nutch_Trunk for instance)
• Select "Create project from existing source" and use the location where you downloaded
Nutch
• Click on Next, and wait while Eclipse is scanning the folders
• Add the folder "conf" to the classpath (third tab and then add class folder)
• Go to "Order and Export" tab, find the entry for added "conf" folder and move it to the
top. It's required to make eclipse take config (nutch-default.xml, nutch-final.xml, etc.)
resources from our "conf" folder not anywhere else.
• Eclipse should have guessed all the java files that must be added on your classpath. If it's
not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to
your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries
• Set output dir to "tmp_build", create it if necessary
• DO NOT add "build" to classpath

Change the files

• In the nutch-site.xml

changed nutch-site.xml to include the protocol-file plugin

---------------------------------------------------------------------------------------------------------------------
-----
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(xml|html|text)|index-
(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-
basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

</configuration>

• In crawl-urlfilter.txt

• Replace
-^(file|ftp|mailto):
to

-^(http|ftp|mailto):

• Replace
-.
to
+.
Explaination: “+” means the url made out of regular expression that
follows “+” and “-“ implies not to accept .Also, “.” Can be replaced
with any sequence . Therefore if a page has its base url as
http://241sdgsd.ist.psu.edu will be accepted but http://twet
%dfs.ist.psu.edu will not be accepted .The same can be done for any
file extension as well.

• Create a file with file name as “urls” which contains the


location of all the directories which is to be crawled .
For eg: file:///home/cf/tests

• In file nutch-default.xml

<property>
<name>plugin.folders</name>
<value>plugins</value>
To
<property>
<name>plugin.folders</name>
<value>./src/plugin</value>

In nutch-0.9\src\plugin\protocol-file\src\java\org\apache\nutch\protocol\file

Don’t crawl the parent-directories

The following took some time. The crawler was not restricted to the directories that we specified
in the Urls file but it was jumping into the parent directories as well. The code that is responsible
for this is in org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File f).

While it is obvious – when looking at the code – that this behavior was intended by the author,
we failed to understand the motivation behind it. For my own crawlings I changed the code in a
way that only directories beneath the directories that I specify get crawled.

I changed the following line:

this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :


true);

to

this.content = list2html(f.listFiles(), path, false);

and recompiled.

Commands to Search:

IN Main Class:

org.apache.nutch.searcher.NutchBean

In VM Arguments:

-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

In Program Arguments:

GERHARD WANGLER system (query string)


Command to Crawl:

urls –dir crawl –depth 3 –topN 50

• urls is thefile that contains the seed urls .

• “-dir crawl “ is the arguments that supplies that name of the directory where
all the segments will be stored . Make sure that crawl does not exist in the
working nutch directory .

• “-depth 3” specifies the depth ,i.e the pages that are at a distance upto 2 will
be fetched .

This will complete the crawl operation. When the crawling operation is finished, you
will see a directory named crawl in the nutch directory .

Working with Tomcat Server.

Objective: Set up a web servlet container (tomcat in this case) and perform the
search operation upon the crawled pages .

Assumption: Java ,Nutch ,tomcat installed on the system .

So we have nutch and tomcat in the D:/BTP directory . In our case case, both of the
nutch and tomcat(bin configuration) are already installed. So we just need to
configure these two tools and we are ready to go .

In our case tomcat is at D:\BTP\apache-tomcat-6.0.26 location . This


directory(tomcat ) contains “bin” is the directory where all the executable are
located and “webapp” is the directory where all the web application that run inside
tomcat are deployed .

Next, we need to deploy our nutch application to this tomcat server. Here is how
this is done.

• First we need to remove the root(ROOT directory in webapp) application that


is running in the tomcat and then copy the nutch web application
file(nutch0.9.war) in to this directory .

• Rename nutch.war to ROOT.war in webapp directory .

Next , we need to start the tomcat server which is done by following command in
cmd (windows).

Catalina.bat start

Make sure that before executing the above command on cmd , one should be at the
bin directory of the tomcat .
Now , the tomcat server is running but it is not configured yet and therefore, it
cannot find the index that nutch created in the last step. So, we will configure the
tomcat server before showing off with the searching .

In file webapp/ROOT/WEB-INF/classes/nutch-site.xml

The content of the file should be changed

From

<?xml version”1.0”?>

<?xml-stylesheet types=”types=”text/xsl” href=”configuration.xsl”?>

<!—Put site-specific property overrides in this file -- >

<configuration>

</configuration>

To

<?xml version”1.0”?>

<?xml-stylesheet types=”types=”text/xsl” href=”configuration.xsl”?>

<!—Put site-specific property overrides in this file -- >

<configuration>

<property>

<name>searcher.dir</name>

<value>D:\BTP\nutch-0.9\crawl</value>
</property>

</configuration>

Now we need to restart the tomcat server . So, stop it first and then start .So these
are two commands should do the work .

Catalane.bat start

Catalina.bat stop

Now in the browser , type localhost:8080/ and the nutch application is deployed .
Reference:

http://clgiles.ist.psu.edu/IST441/materials/nutch-lucene

You might also like