By Peter P. Wang, Zillionics Inc.
Try the search engine I developed: Malachi Search
Please support my effort by using the best free/low price web hosting: 1&1 Inc
peterwang@zillionics.comThe article is here:
http://trackgc.com/tr/resources/articles/index.htm
42 comments:
THANK YOU!!!
Few months ago I tried to install nutch and almost went crazy :D but today I found your tutorial and after few hours and few problems I was able to make everything work. I had few problems with heap size so maybe you should include this in your tutorial?
I have a question, why did we create folder urls and text file in it, because it seems we never used it? Or could we use this file to replace modifications to conf/crawl-urlfilter.txt file?
Will you maybe create few more tutorials, like for ubuntu?
Thanks again, bye
hi, thanks for making the Nutch 0.9 tutorial. However, when I ran the
'nutch crawl' script from a cygwin bash shell I got syntax errors.
It seems that one needs to run the script file through 'd2u' to get it
to work! You may want to document this somewhere.
best,
arthur
Hi Peter,
I just wanted to thank you for a great insight into Nutch. Your guide has been quite helpful to me in getting started with Nutch. I was able to crawl our site and load the WAR file into Tomcat (thereby, performing some nice searches). On a side note, I just wanted to let you know that I found useful comment which is to perform "d2u" in cygwin for binary files such as bin/nutch. But on another note, I was wondering if you had any pointers on performing recrawls and/or crawls on various sites. You see, I crawled the wrong site initially and would like to change the configurations around so that I will be able to crawl the "right" site (while still using the same Nutch configuration). Do you have pointers on how to do that?
Regards,
Max
Hi Peter!
Thanks for posting the nutch tutorial on the wiki. Got the 1.0 dev
release working on mac leopard because of it.
Best,
James
Hello Peter Wang,
I have been following your great 'Latest step by Step Installation guide for dummies: Nutch 0.9' on a Windows Vista system and found an unique problem on Vista. As Tomcat is usually installed under 'Program Files', when editing 'WEB-INF\classes\nutch- site.xml', the user may ends up editing a file in VirtualStore.
It may worth adding a note at the end of the tutorial as it may take some people a while to figure out the problem.
Thanks.
Xue Yong Zhi
XRuby Compiler
http://xruby.googlecode.com
Hi Peter,
I'm a developer from Israel and I'm currently looking at Nutch as part of a solution for an internet initiative I'm working for.
However my time is very limited so I think of out sourcing some of the work.
Do you know of any talented developers that already experienced Nutch and might be interesting working in this model ?
Looking forward hearing from you,
Udi Bahat
Do you have a Unix NutchGuideForDummies?
Also do you do consultative work?
Sean Scurlock, CISSP
Managing Consultant
Security & Privacy Solutions
IBM Global Business Services
901.240.3761
Hi Peter,
I have been following your Step by step guide in the Nutch Wiki. It is very well done! I do have one issue when I try to generate a crawl.
When I run bin/nutch crawl I get the following error:
Exception in thread "main" java.lang.NoClassDefFoundError: files\apache
I am not sure why this is. Have you encountered this before?
Thanks in advance,
Bob Brehm
I am facing problem with injector/generator because it exits saying "0 records selected for fetching, exiting ...".
I have tried with the conf and urls describe in your blog and http://lucene.apache.org/nutch/tutorial.html (for intranet crawling).
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080421044729
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl
Dude.. I am reading your nutch tutorial..
I decided to look at the web pages your example pointed to.. just curious.. MAN THAT IS SO COOL.. that is great. i am sharing my faith everywhere and of course only a few people listen but.. Man.. this is a such a great idea you have here.. I am so happy to see that you did this..
Man it really lifted my spirits.. I am then only one where I work and.. you made my day..
this is just great..
take care man..
ray
Hi Peter,
Thanks for the tutorial on http://peterpuwang.googlepages.com/NutchGuideForDummies.htm
It really does provide me a step by step on how to go about installing Nutch on my machine, though mine is not a windows. I managed to do the installation with the help of your tutorial.
However, lately I am stuck in these issues, and wonder if you provide some assistance.
I have no idea on the following:
- How would I be able to limit the crawl so that it does not crawl on various protected directories?
- How should I set so that Nutch is able to crawl into pdf files?
- How would I be able to redesign the Nutch current search interface?
Sorry for taking your time. Thanks in advance.
--
Regards,
Joyce
Hi Peter,
I followed your instructions on setting up nutch. It worked very well! Thanks so much for providing this help. I have a question for you on the search results. I don't know why my search results are displayed with Chinese support. I see you have a Chinese last name so you might know what is going on here. I am wondering what settings are set on my system to display the search with Chinese support. Attached, please find two screen captures.
My environment:
1) Windows XP
2) Regional and language - English (United States). BTW, I did install the "far east" language support package
3) Apache 6.x
4) Java 1.6.x
5) Nutch 0.9
Thanks,
Hi, Peter
I saw your info on the Nutch site. On my personal side I will be looking into ways to use nutch.
On my bread and butter side, I have an entertainment agency currently hosted by a host I am not really happy with and need a change.
The site has a store window or front side for clients and artists.
The back side is used by our five or so people mainly to carry on the business of establishing and providing entertainment contracts to our clients by our artists.
The agents do most of their work by phone. Every interface though requires a note logging process. Contracts are established through the software and automated email contracts are sent to the clients and copied to artists.
The database is an MS SQL database of about 200mb.
The software is written in VB Script.
We are currently on a MS SQL 2005 database wise and an MS server for the software.
I would like a host where we can run the system described above at the kind of prices 1&1 are announcing on the web site.
Please let me know if I have found a home for our system or if not if you could suggest a host please do.
In time we will migrate this system to linux/unix. By in time I mean when I figure out how to translate everything easily and quickly.
I appreciate your attention.
jack mothershed
for auburnmoonagency.com
Dear Mr. Wang.
This is Kate writting to you again.
Firstly, i would like to thank you for your fast and friendly response to my letter yesterday.
Secondly, i decided to try to install nutch on windows instead of lunix, and to use your tutorial. I have a question on your tutorial. I'm pretty sure i did everything according to your directions, but when i got to the part "Run the crawler" i type in the following: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 and the cygwin window thing says: "bash: bin/nutch: No such file or directory" I'm stuck and can't go any further with this tutorial.
What would be your suggestion?
Thank you.
Sincerely,
Kate H.
Hi,
i have just start using nutch, so thanks for your Nutch 0-9 tutoriel, i'm using the config:
Nutch: version 0-9
Tomcat: version6
JDK 1.6.0_05
OS: windows vista + Cygwin
after i finish the crawling step and deploy the nutch project i have "no results" 8-(
where is the matter? i copy/cut your file config (nutch-site.xml)!!
so please can you help me..
THANKS..
Hello,
I would like to ask for your guide about how to install Nutch on Ubuntu 7.10. I'm just a noob so If you can provide me step by step guide it is really great.
Thanks---Zul'Izzi
Hello Peter
I have a major problem with nutch..My name is Fotis Koutsoukos and i study informatics in University of Peiraus,Greece.
I use Windows XP with SP2, jdk1.6 update 5 .Although I used your tutorial i cannot open the nutch 0.9 webapp with the tomcat 6.0.
The exact problem is that when i hit http://localhost:8080/nutch-0.9 the root page is displayed like i haven't change a thing...
Can you help me please? Maybe i should change a property in a xml file to make nutch-0.9 search-site appear...
If you need any details contact me please
Thank you in advance
Fotis
I have been following your Step by step guide in the Nutch Wiki. It is very well done! I do have one issue when I try to generate a crawl.
When I run bin/nutch crawl I get the following error:
Exception in thread "main" java.lang.NoClassDefFoundError: files\apache
I am not sure why this is. Have you encountered this before?
Thanks in advance,
kannabiran.G
Hi Peter i want to know how we can extend this to the one where we can fetch a URL from a web application and than crawl that URL too.
Hi Peter,
Thanks for this great article.
It simply made a big task very easy..
But I am stuck at some point.
In addition to search the online contents, I want to search the local contents. Please have a look at the following steps:
I have in:
C:\nutch-0.9\ --> The nutch source
C:\apache-tomcat-6.0.16\ --> Tomact
C:\cygwin --> Cygwin
C:\LocalSearch\localfiles --> Some sample html and text files
C:\nutch-0.9\crawl --> Folder automatically created for indexing
Now I did the following steps:
1)Created a folder called urls inside C:\nutch-0.9
2)Created a file, source.txt, with content:
http://www.apache.org
file:///c:/LocalSearch/localfiles/
3)Edited conf/crawl-urlfilter.txt and added the following entries:
# accept hosts in MY.DOMAIN.NAME
+^file:///c:/LocalSearch/localfiles/*
+^http://([a-z0-9]*\.)*apache.org/
4) Edit conf/nutch-site.xml and add the following entries inside the configuration tab:
searcher.dir --> C:\nutch-0.9\crawl
5) Build the project with 'ant' command
6) Create the war file with 'ant war' command
8) From cygwin console:
bin/nutch crawl urls -dir crawl -depth 3 -topN 10
8) copy the war file from C:\nutch-0.9\build to tomcat's webapps.
9) made sure that C:\apache-tomcat-6.0.16\webapps\nutch-0.9\WEB-INF\classes\nutch-site.xml
contains:
searcher.dir --> C:\nutch-0.9\crawl
10) Access the search tool from http://localhost:8080/nutch-0.9/
the problem is, when i search something, the local machine contents are not displaying.
Expecting your reply,
thanks in advance
HI,
I'm trying to get this to work under WINXP - latest SP
All of the latest files as 0f 10/12/08
Have installed it step by step as per your instrcutions, everything seems to work until I try to amke a search trough the nutch search for then I get following eror:
HTTP Status 500 -
type Exception report
message
description The server encountered an internal error () that prevented it from fulfilling this request.
exception
org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value language + "/include/header.html" is quoted with " which must be escaped when used within the value
org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:40)
org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:407)
org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:198)
org.apache.jasper.compiler.Parser.parseQuoted(Parser.java:301)
org.apache.jasper.compiler.Parser.parseAttributeValue(Parser.java:250)
org.apache.jasper.compiler.Parser.parseAttribute(Parser.java:212)
org.apache.jasper.compiler.Parser.parseAttributes(Parser.java:155)
org.apache.jasper.compiler.Parser.parseInclude(Parser.java:869)
org.apache.jasper.compiler.Parser.parseStandardAction(Parser.java:1136)
org.apache.jasper.compiler.Parser.parseElements(Parser.java:1466)
org.apache.jasper.compiler.Parser.parse(Parser.java:138)
org.apache.jasper.compiler.ParserController.doParse(ParserController.java:216)
org.apache.jasper.compiler.ParserController.parse(ParserController.java:103)
org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:154)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:315)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:295)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:282)
org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:586)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:317)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267)
javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
note The full stack trace of the root cause is available in the Apache Tomcat/6.0.18 logs.
Any sugestions?
rrk AT magma DASH energy DOT com
Also tried this with nutch-0.8.1
same result
rrk AT magma DASH energy DOT com
same here. I have no idxea where to look
Kanna, this can resolve your problem.
http://news.skelter.net/articles/2008/09/24/nutch-0-9-quoted-with-must-be-escaped
Is really strange that the release has this problem.
Good tutorial!
Jonathan Barbero
Hello.
I have the same problem as fotis that posted on June 26, 2008 9:52 AM
Can anybody tell how to change this. And what wrong
Thanks.
OK I escaped the " quotes on line 151 of search.jsp and I still get the same error. I must be dense or something.
rrk AT magma DASH energy DOT com
Hello, it is really useful. But anyone knows how to add a custom field and then make a search based on that ? I guess we have recompile indexer.java, but how ? Thank you guys !
You need to change the raw HTML for your XML data to use the escaped equivalent for the less-than sign, otherwise your XML closing tags don't show up in Firefox.
Hi Peter,
I liked the features of nutch and hence thought of using it.But I am facing some problems.
This is what I did :
1)Downloaded the nutch 0.9 file from http://www.apache.org/dyn/closer.cgi/lucene/nutch/
2)Installed java 1.4,tomcat 5.5
and cygwin.
3)In cygwin, set the JAVA_HOME variable.
4)created urls folder in nutch-0.9 folder and created a text file christian.txt in it having the urls specified.
5) Then changed the conf/crawl-urlfilter.txt and conf/nutch-site.xml.
6)Then entered bin/nutch crawl urls -dir crawl -depth 3 -topN 50.
But this is whats heing displayed:
crawl started in:crawl
rootUrlDir = urls
threads = 10
depth = 3
topN=50
Injector : starting
Injector :crawlDb: crawl/crawldb
Injector : urlDir:urls
Injector : Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException : Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.Injector.inject.(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
Can you suggest whats the problem and a solution for it.
Thanks,
Katrina
org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value language + "/include
hey i m also facing this error
so hw can i resolve it?
please suggest me !
If you have nutch 0.9, you have 3 search.jsp. You have to change the one that you find in C:\Programm Files\Apache Software Foundation\Tomcat 6.0\webapps\nutch-0.9\search.jsp.
To escape the " follow the link of Jonathan Barbero.
Hi Peter,
Thanks a lot!! you really help me and do it easy. I didn't know nutch but it seems very helpfull and your tutorial it's too :).
Could you help me? please, i don't know how to crawl pdf documents, i'm reading about pdfbox and i download it, but don't know how to use it. Sorry for the question, for the time and for my english :). Thanks again.
How did that comment 2 comments up get through?
online backup
Hi everyone,
I used Nutch-1.0 , Installed it, crawled couple of sites and local pdf. all works well. but I am not able to crawl my intranet site. For some reason I cannot fetch any files on it.
I am thinking is it a network issue or something else.
Let me mention, even the site that is public facing is not allowing crawling.
Does anyone have any idea what could be going wrong.
I do not find any error in the logs.
THANK YOU SO MUCH!
Hi peter ,
thnx a lot for the tuts!
I am havin followin exception while running nutch crawl command specified in the beginers tuts .
Exception in thread "main" org.apache.hadoop.util.Shell$ExitCodeException: chmod
: changing permissions of `C:\\tmp\\hadoop-gadaakhil\\mapred\\system\\job_local_
0001': Permission denied
at org.apache.hadoop.util.Shell.runCommand(Shell.java:195)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
286)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:338)
at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSyste
m.java:540)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:532)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:274)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:285)
at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobCli
ent.java:609)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
at org.apache.nutch.crawl.Injector.inject(Injector.java:160)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:113)
**********************************
Plz help me wid dis reguards
Thnx a lot!
Hello,
I run the crawler and the searcher fetches
/index.jsp?news_id=111language_id='en'
and
/index.jsp?language_id='en'&news_id=111
which is the same page. How can i fetch only one time
Thank you so much for sharing this! I have searched high and low and had so much hard time trying to do this!
Just look through your tutorial and I managed to do them, thank you so much =D
Hi Peter,
I am trying to install nutch , but whenever trying to establish connection through nutch i am getting java:Connection refused exception, plz try to give me solution for it,
thank you,
vijay
hello,
I am having problem with error message like this when i tried to hit the page http://localhost:8080/nutch-0.9/
here the error.
Could anyone handy and show me what wrong i made caused this error.
thanks a lot for thought!
***here the error from browser***
exception
org.apache.jasper.JasperException: org.apache.jasper.JasperException: Unable to load class for JSP
org.apache.jasper.servlet.JspServletWrapper.getServlet(JspServletWrapper.java:161)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:340)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
javax.servlet.http.HttpServlet.service(HttpServlet.java:723)
root cause
org.apache.jasper.JasperException: Unable to load class for JSP
org.apache.jasper.JspCompilationContext.load(JspCompilationContext.java:630)
org.apache.jasper.servlet.JspServletWrapper.getServlet(JspServletWrapper.java:149)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:340)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
javax.servlet.http.HttpServlet.service(HttpServlet.java:723)
root cause
java.lang.ClassNotFoundException: org.apache.jsp.search_jsp
java.net.URLClassLoader$1.run(Unknown Source)
java.net.URLClassLoader$1.run(Unknown Source)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader.findClass(Unknown Source)
org.apache.jasper.servlet.JasperLoader.loadClass(JasperLoader.java:134)
org.apache.jasper.servlet.JasperLoader.loadClass(JasperLoader.java:66)
org.apache.jasper.JspCompilationContext.load(JspCompilationContext.java:628)
org.apache.jasper.servlet.JspServletWrapper.getServlet(JspServletWrapper.java:149)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:340)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
javax.servlet.http.HttpServlet.service(HttpServlet.java:723)
i used:
nutch 0.9
tomcat 6.0
windows vista
jdk1.7
hai, i have problem while konfigura nutch 0.9.
my environment is
windows vista, cgywin, nutch-0.9
my setting at
windows system:
JAVA_HOME = C:\Program Files\Java\jdk1.7.0_07
NUTCH_HOME = C:\cygwin\home\nutch-0.9
export JAVA_HOME='/cygdrive/c/Program Files/Java/jdk1.7.0_07/'
running nutch
bin/nutch crawl urls -dir myCrawl4 -depth 5 -topN 15
and then got error while indexer
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: myCrawl5/crawldb
CrawlDb update: segments: [myCrawl5/segments/20131201143950]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: myCrawl5/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: myCrawl5/segments/20131201143757
LinkDb: adding segment: myCrawl5/segments/20131201143807
LinkDb: adding segment: myCrawl5/segments/20131201143831
LinkDb: adding segment: myCrawl5/segments/20131201143910
LinkDb: adding segment: myCrawl5/segments/20131201143950
LinkDb: done
Indexer: starting
Indexer: linkdb: myCrawl5/linkdb
Indexer: adding segment: myCrawl5/segments/20131201143757
Indexer: adding segment: myCrawl5/segments/20131201143807
Indexer: adding segment: myCrawl5/segments/20131201143831
Indexer: adding segment: myCrawl5/segments/20131201143910
Indexer: adding segment: myCrawl5/segments/20131201143950
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:134)
please any suggestion.
Thanks in advance.
Hi Arie,
for error org.apache.jasper.JasperException: org.apache.jasper.JasperException: Unable to load class for JSP
it's means, your appserver such tomcat haven't set correctly fr reading servlet/jsp file.
please create folder under /work
example:
[tomcat_dir_inst]\work\Catalina\localhost\nutch-0.9\org\apache\jsp
and put file index.jsp or index.java
and then open via brower url: http://localhost:8080/nutch-0.9/
so, you should always put under work folder, your working directory "[tomcat_dir_inst]\work\Catalina\localhost\nutch-0.9\org\apache\jsp"
hope this help you!
Post a Comment