Friday, February 29, 2008

Installation guide for dummies: Nutch 0.9

By Peter P. Wang, Zillionics Inc.

Try the search engine I developed: Malachi Search

Please support my effort by using the best free/low price web hosting: 1&1 Inc

peterwang@zillionics.com

The article is here:
http://trackgc.com/tr/resources/articles/index.htm

42 comments:

Anonymous said...

THANK YOU!!!

Few months ago I tried to install nutch and almost went crazy :D but today I found your tutorial and after few hours and few problems I was able to make everything work. I had few problems with heap size so maybe you should include this in your tutorial?

I have a question, why did we create folder urls and text file in it, because it seems we never used it? Or could we use this file to replace modifications to conf/crawl-urlfilter.txt file?

Will you maybe create few more tutorials, like for ubuntu?

Thanks again, bye

Anonymous said...

hi, thanks for making the Nutch 0.9 tutorial. However, when I ran the
'nutch crawl' script from a cygwin bash shell I got syntax errors.
It seems that one needs to run the script file through 'd2u' to get it
to work! You may want to document this somewhere.

best,
arthur

Anonymous said...

Hi Peter,

I just wanted to thank you for a great insight into Nutch. Your guide has been quite helpful to me in getting started with Nutch. I was able to crawl our site and load the WAR file into Tomcat (thereby, performing some nice searches). On a side note, I just wanted to let you know that I found useful comment which is to perform "d2u" in cygwin for binary files such as bin/nutch. But on another note, I was wondering if you had any pointers on performing recrawls and/or crawls on various sites. You see, I crawled the wrong site initially and would like to change the configurations around so that I will be able to crawl the "right" site (while still using the same Nutch configuration). Do you have pointers on how to do that?

Regards,
Max

Anonymous said...

Hi Peter!

Thanks for posting the nutch tutorial on the wiki. Got the 1.0 dev
release working on mac leopard because of it.

Best,
James

Anonymous said...

Hello Peter Wang,

I have been following your great 'Latest step by Step Installation guide for dummies: Nutch 0.9' on a Windows Vista system and found an unique problem on Vista. As Tomcat is usually installed under 'Program Files', when editing 'WEB-INF\classes\nutch- site.xml', the user may ends up editing a file in VirtualStore.

It may worth adding a note at the end of the tutorial as it may take some people a while to figure out the problem.

Thanks.

Xue Yong Zhi
XRuby Compiler
http://xruby.googlecode.com

Anonymous said...

Hi Peter,

I'm a developer from Israel and I'm currently looking at Nutch as part of a solution for an internet initiative I'm working for.
However my time is very limited so I think of out sourcing some of the work.

Do you know of any talented developers that already experienced Nutch and might be interesting working in this model ?

Looking forward hearing from you,

Udi Bahat

Anonymous said...

Do you have a Unix NutchGuideForDummies?

Also do you do consultative work?


Sean Scurlock, CISSP
Managing Consultant
Security & Privacy Solutions
IBM Global Business Services
901.240.3761

Anonymous said...

Hi Peter,

I have been following your Step by step guide in the Nutch Wiki. It is very well done! I do have one issue when I try to generate a crawl.

When I run bin/nutch crawl I get the following error:

Exception in thread "main" java.lang.NoClassDefFoundError: files\apache

I am not sure why this is. Have you encountered this before?

Thanks in advance,

Bob Brehm

Vishal Jain said...

I am facing problem with injector/generator because it exits saying "0 records selected for fetching, exiting ...".
I have tried with the conf and urls describe in your blog and http://lucene.apache.org/nutch/tutorial.html (for intranet crawling).

bin/nutch crawl urls -dir crawl -depth 3 -topN 50
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080421044729
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

Anonymous said...

Dude.. I am reading your nutch tutorial..
I decided to look at the web pages your example pointed to.. just curious.. MAN THAT IS SO COOL.. that is great. i am sharing my faith everywhere and of course only a few people listen but.. Man.. this is a such a great idea you have here.. I am so happy to see that you did this..
Man it really lifted my spirits.. I am then only one where I work and.. you made my day..
this is just great..
take care man..

ray

Anonymous said...

Hi Peter,

Thanks for the tutorial on http://peterpuwang.googlepages.com/NutchGuideForDummies.htm
It really does provide me a step by step on how to go about installing Nutch on my machine, though mine is not a windows. I managed to do the installation with the help of your tutorial.

However, lately I am stuck in these issues, and wonder if you provide some assistance.

I have no idea on the following:
- How would I be able to limit the crawl so that it does not crawl on various protected directories?
- How should I set so that Nutch is able to crawl into pdf files?
- How would I be able to redesign the Nutch current search interface?

Sorry for taking your time. Thanks in advance.

--
Regards,
Joyce

Anonymous said...

Hi Peter,

I followed your instructions on setting up nutch. It worked very well! Thanks so much for providing this help. I have a question for you on the search results. I don't know why my search results are displayed with Chinese support. I see you have a Chinese last name so you might know what is going on here. I am wondering what settings are set on my system to display the search with Chinese support. Attached, please find two screen captures.

My environment:

1) Windows XP
2) Regional and language - English (United States). BTW, I did install the "far east" language support package
3) Apache 6.x
4) Java 1.6.x
5) Nutch 0.9

Thanks,

Anonymous said...

Hi, Peter

I saw your info on the Nutch site. On my personal side I will be looking into ways to use nutch.

On my bread and butter side, I have an entertainment agency currently hosted by a host I am not really happy with and need a change.

The site has a store window or front side for clients and artists.

The back side is used by our five or so people mainly to carry on the business of establishing and providing entertainment contracts to our clients by our artists.

The agents do most of their work by phone. Every interface though requires a note logging process. Contracts are established through the software and automated email contracts are sent to the clients and copied to artists.

The database is an MS SQL database of about 200mb.

The software is written in VB Script.

We are currently on a MS SQL 2005 database wise and an MS server for the software.

I would like a host where we can run the system described above at the kind of prices 1&1 are announcing on the web site.

Please let me know if I have found a home for our system or if not if you could suggest a host please do.

In time we will migrate this system to linux/unix. By in time I mean when I figure out how to translate everything easily and quickly.

I appreciate your attention.

jack mothershed
for auburnmoonagency.com

Anonymous said...

Dear Mr. Wang.
This is Kate writting to you again.
Firstly, i would like to thank you for your fast and friendly response to my letter yesterday.
Secondly, i decided to try to install nutch on windows instead of lunix, and to use your tutorial. I have a question on your tutorial. I'm pretty sure i did everything according to your directions, but when i got to the part "Run the crawler" i type in the following: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 and the cygwin window thing says: "bash: bin/nutch: No such file or directory" I'm stuck and can't go any further with this tutorial.
What would be your suggestion?
Thank you.
Sincerely,
Kate H.

Anonymous said...

Hi,

i have just start using nutch, so thanks for your Nutch 0-9 tutoriel, i'm using the config:

Nutch: version 0-9

Tomcat: version6

JDK 1.6.0_05

OS: windows vista + Cygwin

after i finish the crawling step and deploy the nutch project i have "no results" 8-(

where is the matter? i copy/cut your file config (nutch-site.xml)!!

so please can you help me..



THANKS..

Anonymous said...

Hello,

I would like to ask for your guide about how to install Nutch on Ubuntu 7.10. I'm just a noob so If you can provide me step by step guide it is really great.

Thanks---Zul'Izzi

Anonymous said...

Hello Peter

I have a major problem with nutch..My name is Fotis Koutsoukos and i study informatics in University of Peiraus,Greece.
I use Windows XP with SP2, jdk1.6 update 5 .Although I used your tutorial i cannot open the nutch 0.9 webapp with the tomcat 6.0.
The exact problem is that when i hit http://localhost:8080/nutch-0.9 the root page is displayed like i haven't change a thing...
Can you help me please? Maybe i should change a property in a xml file to make nutch-0.9 search-site appear...
If you need any details contact me please

Thank you in advance

Fotis

kanna said...

I have been following your Step by step guide in the Nutch Wiki. It is very well done! I do have one issue when I try to generate a crawl.

When I run bin/nutch crawl I get the following error:

Exception in thread "main" java.lang.NoClassDefFoundError: files\apache

I am not sure why this is. Have you encountered this before?

Thanks in advance,
kannabiran.G

Ankur said...

Hi Peter i want to know how we can extend this to the one where we can fetch a URL from a web application and than crawl that URL too.

Aarya said...

Hi Peter,
Thanks for this great article.
It simply made a big task very easy..

But I am stuck at some point.
In addition to search the online contents, I want to search the local contents. Please have a look at the following steps:

I have in:

C:\nutch-0.9\ --> The nutch source
C:\apache-tomcat-6.0.16\ --> Tomact
C:\cygwin --> Cygwin
C:\LocalSearch\localfiles --> Some sample html and text files
C:\nutch-0.9\crawl --> Folder automatically created for indexing

Now I did the following steps:

1)Created a folder called urls inside C:\nutch-0.9
2)Created a file, source.txt, with content:
http://www.apache.org
file:///c:/LocalSearch/localfiles/
3)Edited conf/crawl-urlfilter.txt and added the following entries:
# accept hosts in MY.DOMAIN.NAME
+^file:///c:/LocalSearch/localfiles/*
+^http://([a-z0-9]*\.)*apache.org/
4) Edit conf/nutch-site.xml and add the following entries inside the configuration tab:

searcher.dir --> C:\nutch-0.9\crawl


5) Build the project with 'ant' command
6) Create the war file with 'ant war' command
8) From cygwin console:
bin/nutch crawl urls -dir crawl -depth 3 -topN 10
8) copy the war file from C:\nutch-0.9\build to tomcat's webapps.
9) made sure that C:\apache-tomcat-6.0.16\webapps\nutch-0.9\WEB-INF\classes\nutch-site.xml
contains:

searcher.dir --> C:\nutch-0.9\crawl

10) Access the search tool from http://localhost:8080/nutch-0.9/


the problem is, when i search something, the local machine contents are not displaying.

Expecting your reply,
thanks in advance

Anonymous said...

HI,

I'm trying to get this to work under WINXP - latest SP
All of the latest files as 0f 10/12/08

Have installed it step by step as per your instrcutions, everything seems to work until I try to amke a search trough the nutch search for then I get following eror:

HTTP Status 500 -

type Exception report

message

description The server encountered an internal error () that prevented it from fulfilling this request.

exception

org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value language + "/include/header.html" is quoted with " which must be escaped when used within the value
org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:40)
org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:407)
org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:198)
org.apache.jasper.compiler.Parser.parseQuoted(Parser.java:301)
org.apache.jasper.compiler.Parser.parseAttributeValue(Parser.java:250)
org.apache.jasper.compiler.Parser.parseAttribute(Parser.java:212)
org.apache.jasper.compiler.Parser.parseAttributes(Parser.java:155)
org.apache.jasper.compiler.Parser.parseInclude(Parser.java:869)
org.apache.jasper.compiler.Parser.parseStandardAction(Parser.java:1136)
org.apache.jasper.compiler.Parser.parseElements(Parser.java:1466)
org.apache.jasper.compiler.Parser.parse(Parser.java:138)
org.apache.jasper.compiler.ParserController.doParse(ParserController.java:216)
org.apache.jasper.compiler.ParserController.parse(ParserController.java:103)
org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:154)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:315)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:295)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:282)
org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:586)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:317)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267)
javax.servlet.http.HttpServlet.service(HttpServlet.java:717)

note The full stack trace of the root cause is available in the Apache Tomcat/6.0.18 logs.

Any sugestions?

rrk AT magma DASH energy DOT com

Anonymous said...

Also tried this with nutch-0.8.1

same result


rrk AT magma DASH energy DOT com

Unknown said...

same here. I have no idxea where to look

Jonathan Barbero said...

Kanna, this can resolve your problem.

http://news.skelter.net/articles/2008/09/24/nutch-0-9-quoted-with-must-be-escaped

Is really strange that the release has this problem.

Good tutorial!

Jonathan Barbero

Unknown said...

Hello.

I have the same problem as fotis that posted on June 26, 2008 9:52 AM

Can anybody tell how to change this. And what wrong
Thanks.

Anonymous said...

OK I escaped the " quotes on line 151 of search.jsp and I still get the same error. I must be dense or something.

rrk AT magma DASH energy DOT com

Anonymous said...

Hello, it is really useful. But anyone knows how to add a custom field and then make a search based on that ? I guess we have recompile indexer.java, but how ? Thank you guys !

Frank McCown said...

You need to change the raw HTML for your XML data to use the escaped equivalent for the less-than sign, otherwise your XML closing tags don't show up in Firefox.

Unknown said...

Hi Peter,

I liked the features of nutch and hence thought of using it.But I am facing some problems.
This is what I did :

1)Downloaded the nutch 0.9 file from http://www.apache.org/dyn/closer.cgi/lucene/nutch/
2)Installed java 1.4,tomcat 5.5
and cygwin.
3)In cygwin, set the JAVA_HOME variable.
4)created urls folder in nutch-0.9 folder and created a text file christian.txt in it having the urls specified.
5) Then changed the conf/crawl-urlfilter.txt and conf/nutch-site.xml.
6)Then entered bin/nutch crawl urls -dir crawl -depth 3 -topN 50.


But this is whats heing displayed:

crawl started in:crawl
rootUrlDir = urls
threads = 10
depth = 3
topN=50
Injector : starting
Injector :crawlDb: crawl/crawldb
Injector : urlDir:urls
Injector : Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException : Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.Injector.inject.(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)


Can you suggest whats the problem and a solution for it.

Thanks,
Katrina

Vijay Patil said...

org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value language + "/include

hey i m also facing this error
so hw can i resolve it?
please suggest me !

Anonymous said...

If you have nutch 0.9, you have 3 search.jsp. You have to change the one that you find in C:\Programm Files\Apache Software Foundation\Tomcat 6.0\webapps\nutch-0.9\search.jsp.

To escape the " follow the link of Jonathan Barbero.

Diosa said...

Hi Peter,

Thanks a lot!! you really help me and do it easy. I didn't know nutch but it seems very helpfull and your tutorial it's too :).
Could you help me? please, i don't know how to crawl pdf documents, i'm reading about pdfbox and i download it, but don't know how to use it. Sorry for the question, for the time and for my english :). Thanks again.

Angus Mcjockeyfart said...

How did that comment 2 comments up get through?
online backup

Anonymous said...

Hi everyone,

I used Nutch-1.0 , Installed it, crawled couple of sites and local pdf. all works well. but I am not able to crawl my intranet site. For some reason I cannot fetch any files on it.

I am thinking is it a network issue or something else.
Let me mention, even the site that is public facing is not allowing crawling.

Does anyone have any idea what could be going wrong.

I do not find any error in the logs.

Anonymous said...

THANK YOU SO MUCH!

Akhil said...

Hi peter ,
thnx a lot for the tuts!

I am havin followin exception while running nutch crawl command specified in the beginers tuts .


Exception in thread "main" org.apache.hadoop.util.Shell$ExitCodeException: chmod
: changing permissions of `C:\\tmp\\hadoop-gadaakhil\\mapred\\system\\job_local_
0001': Permission denied

at org.apache.hadoop.util.Shell.runCommand(Shell.java:195)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
286)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:338)
at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSyste
m.java:540)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:532)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:274)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:285)
at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobCli
ent.java:609)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
at org.apache.nutch.crawl.Injector.inject(Injector.java:160)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:113)


**********************************
Plz help me wid dis reguards

Thnx a lot!

Anonymous said...

Hello,

I run the crawler and the searcher fetches

/index.jsp?news_id=111language_id='en'

and

/index.jsp?language_id='en'&news_id=111

which is the same page. How can i fetch only one time

araleling said...

Thank you so much for sharing this! I have searched high and low and had so much hard time trying to do this!

Just look through your tutorial and I managed to do them, thank you so much =D

vj_ultimate said...

Hi Peter,

I am trying to install nutch , but whenever trying to establish connection through nutch i am getting java:Connection refused exception, plz try to give me solution for it,
thank you,

vijay

Arie said...

hello,

I am having problem with error message like this when i tried to hit the page http://localhost:8080/nutch-0.9/

here the error.
Could anyone handy and show me what wrong i made caused this error.

thanks a lot for thought!

***here the error from browser***
exception

org.apache.jasper.JasperException: org.apache.jasper.JasperException: Unable to load class for JSP
org.apache.jasper.servlet.JspServletWrapper.getServlet(JspServletWrapper.java:161)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:340)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
javax.servlet.http.HttpServlet.service(HttpServlet.java:723)

root cause

org.apache.jasper.JasperException: Unable to load class for JSP
org.apache.jasper.JspCompilationContext.load(JspCompilationContext.java:630)
org.apache.jasper.servlet.JspServletWrapper.getServlet(JspServletWrapper.java:149)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:340)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
javax.servlet.http.HttpServlet.service(HttpServlet.java:723)

root cause

java.lang.ClassNotFoundException: org.apache.jsp.search_jsp
java.net.URLClassLoader$1.run(Unknown Source)
java.net.URLClassLoader$1.run(Unknown Source)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader.findClass(Unknown Source)
org.apache.jasper.servlet.JasperLoader.loadClass(JasperLoader.java:134)
org.apache.jasper.servlet.JasperLoader.loadClass(JasperLoader.java:66)
org.apache.jasper.JspCompilationContext.load(JspCompilationContext.java:628)
org.apache.jasper.servlet.JspServletWrapper.getServlet(JspServletWrapper.java:149)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:340)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
javax.servlet.http.HttpServlet.service(HttpServlet.java:723)

i used:
nutch 0.9
tomcat 6.0
windows vista
jdk1.7

Anonymous said...

hai, i have problem while konfigura nutch 0.9.
my environment is
windows vista, cgywin, nutch-0.9

my setting at
windows system:
JAVA_HOME = C:\Program Files\Java\jdk1.7.0_07
NUTCH_HOME = C:\cygwin\home\nutch-0.9

export JAVA_HOME='/cygdrive/c/Program Files/Java/jdk1.7.0_07/'

running nutch
bin/nutch crawl urls -dir myCrawl4 -depth 5 -topN 15

and then got error while indexer
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: myCrawl5/crawldb
CrawlDb update: segments: [myCrawl5/segments/20131201143950]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: myCrawl5/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: myCrawl5/segments/20131201143757
LinkDb: adding segment: myCrawl5/segments/20131201143807
LinkDb: adding segment: myCrawl5/segments/20131201143831
LinkDb: adding segment: myCrawl5/segments/20131201143910
LinkDb: adding segment: myCrawl5/segments/20131201143950
LinkDb: done
Indexer: starting
Indexer: linkdb: myCrawl5/linkdb
Indexer: adding segment: myCrawl5/segments/20131201143757
Indexer: adding segment: myCrawl5/segments/20131201143807
Indexer: adding segment: myCrawl5/segments/20131201143831
Indexer: adding segment: myCrawl5/segments/20131201143910
Indexer: adding segment: myCrawl5/segments/20131201143950
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:134)

please any suggestion.
Thanks in advance.

Anonymous said...

Hi Arie,

for error org.apache.jasper.JasperException: org.apache.jasper.JasperException: Unable to load class for JSP

it's means, your appserver such tomcat haven't set correctly fr reading servlet/jsp file.

please create folder under /work
example:
[tomcat_dir_inst]\work\Catalina\localhost\nutch-0.9\org\apache\jsp

and put file index.jsp or index.java
and then open via brower url: http://localhost:8080/nutch-0.9/

so, you should always put under work folder, your working directory "[tomcat_dir_inst]\work\Catalina\localhost\nutch-0.9\org\apache\jsp"

hope this help you!