Professional Documents
Culture Documents
Introduction:
Setting up Nutch to run into Eclipse is very tricky and difficult to comprehend ,since
there are lots of files which are interdependent to each other . However, it's very
useful to be able to debug Nutch in Eclipse. But again one might be quicker by
looking at the logs (logs/hadoop.log)...
First install cygwin and set PATH variable for it . (control panel/ system/advanced
tab/ environment variables and edit/add PATH ) PATH: C:\cygwin\bin
Step 2:
• File > New > Project > Java project > click Next
• Name the project (Nutch_Trunk for instance)
• Select "Create project from existing source" and use the location where you downloaded
Nutch
• Click on Next, and wait while Eclipse is scanning the folders
• Add the folder "conf" to the classpath (third tab and then add class folder)
• Go to "Order and Export" tab, find the entry for added "conf" folder and move it to the
top. It's required to make eclipse take config (nutch-default.xml, nutch-final.xml, etc.)
resources from our "conf" folder not anywhere else.
• Eclipse should have guessed all the java files that must be added on your classpath. If it's
not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to
your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries
• Set output dir to "tmp_build", create it if necessary
• DO NOT add "build" to classpath
• In the nutch-site.xml
---------------------------------------------------------------------------------------------------------------------
-----
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(xml|html|text)|index-
(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-
basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
</configuration>
• In crawl-urlfilter.txt
• Replace
-^(file|ftp|mailto):
to
-^(http|ftp|mailto):
• Replace
-.
to
+.
Explaination: “+” means the url made out of regular expression that
follows “+” and “-“ implies not to accept .Also, “.” Can be replaced
with any sequence . Therefore if a page has its base url as
http://241sdgsd.ist.psu.edu will be accepted but http://twet
%dfs.ist.psu.edu will not be accepted .The same can be done for any
file extension as well.
• In file nutch-default.xml
<property>
<name>plugin.folders</name>
<value>plugins</value>
To
<property>
<name>plugin.folders</name>
<value>./src/plugin</value>
In nutch-0.9\src\plugin\protocol-file\src\java\org\apache\nutch\protocol\file
The following took some time. The crawler was not restricted to the directories that we specified
in the Urls file but it was jumping into the parent directories as well. The code that is responsible
for this is in org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File f).
While it is obvious – when looking at the code – that this behavior was intended by the author,
we failed to understand the motivation behind it. For my own crawlings I changed the code in a
way that only directories beneath the directories that I specify get crawled.
to
and recompiled.
Commands to Search:
IN Main Class:
org.apache.nutch.searcher.NutchBean
In VM Arguments:
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
In Program Arguments:
• “-dir crawl “ is the arguments that supplies that name of the directory where
all the segments will be stored . Make sure that crawl does not exist in the
working nutch directory .
• “-depth 3” specifies the depth ,i.e the pages that are at a distance upto 2 will
be fetched .
This will complete the crawl operation. When the crawling operation is finished, you
will see a directory named crawl in the nutch directory .
Objective: Set up a web servlet container (tomcat in this case) and perform the
search operation upon the crawled pages .
So we have nutch and tomcat in the D:/BTP directory . In our case case, both of the
nutch and tomcat(bin configuration) are already installed. So we just need to
configure these two tools and we are ready to go .
Next, we need to deploy our nutch application to this tomcat server. Here is how
this is done.
Next , we need to start the tomcat server which is done by following command in
cmd (windows).
Catalina.bat start
Make sure that before executing the above command on cmd , one should be at the
bin directory of the tomcat .
Now , the tomcat server is running but it is not configured yet and therefore, it
cannot find the index that nutch created in the last step. So, we will configure the
tomcat server before showing off with the searching .
In file webapp/ROOT/WEB-INF/classes/nutch-site.xml
From
<?xml version”1.0”?>
<configuration>
</configuration>
To
<?xml version”1.0”?>
<configuration>
<property>
<name>searcher.dir</name>
<value>D:\BTP\nutch-0.9\crawl</value>
</property>
</configuration>
Now we need to restart the tomcat server . So, stop it first and then start .So these
are two commands should do the work .
Catalane.bat start
Catalina.bat stop
Now in the browser , type localhost:8080/ and the nutch application is deployed .
Reference:
http://clgiles.ist.psu.edu/IST441/materials/nutch-lucene