You can also follow this Step-by-Step Nutch commands to crawl web sites:
#Initialize the crawldb with the selected URLs:
$ bin/nutch inject seedDir/
#Generate a fetch list from the crawldb:
$ bin/nutch generate -topN 100
#Fetch URLs for the generated batchIds:
$ bin/nutch fetch -all
#Parse all fetched URLs:
$ bin/nutch parse -all
#Update the crawldb:
$ bin/nutch updatedb -all
Here's an example of the execution of these commands and their results in MongoDB.
Initialize the crawldb with the selected URLs:
$ bin/nutch inject seedDir/
$ mongo
> use nutchdb
switched to db nutchdb
> show collections
webpage
> db.webpage.find()
{
"_id": "localhost:http/",
"fetchTime": "1111111111111",
"fetchInterval": 2592000,
"score": 1,
"markers": {
"dist": "0",
"_injmrk_": "y"
},
"metadata": {
"_csh_": 0,
"AAAAAA=="
}
}
Generate a fetch list from the crawldb:
$ bin/nutch generate -topN 100
$ mongo
> use nutchdb
switched to db nutchdb
> db.webpage.find()
{
"_id": "localhost:http/",
"fetchTime": "1111111111111",
"fetchInterval": 2592000,
"score": 1,
"markers": {
"dist": "0",
"_injmrk_": "y",
"_gnmrk_": "11111111-22222222"
},
"metadata": {
"_csh_": 0,
"AAAAAA=="
},
"batchId": "11111111-22222222"
}
Fetch URLs for the generated batchIds:
$ bin/nutch fetch -all
$ mongo
> use nutchdb
switched to db nutchdb
> db.webpage.find()
{
"_id": "localhost:http/",
"fetchTime": "1111111111111",
"fetchInterval": 2592000,
"score": 1,
"markers": {
"dist": "0",
"_injmrk_": "y",
"_gnmrk_": "11111111-22222222",
"_ftcmrk_": "11111111-22222222"
},
"metadata": {
"_rs_": 0,
"AAAA2g=="
},
"batchId": "11111111-22222222",
"baseUrl": "http://localhost:8080/",
"status": 2,
"prevFetchTime": "11111111111111",
"protocolStatus": {
"code": 1,
"lastModified": 0
},
"content": 0,
"contentType": "application/xhtml+xml",
"headers": {
"Connection": "close",
"Server": "Apache",
"Content-Type": "text/html"
}
}
Parse all fetched URLs:
$ bin/nutch parse -all
$ mongo
> use nutchdb
switched to db nutchdb
> db.webpage.find()
{
"_id": "localhost:http/",
"fetchTime": "1111111111111",
"fetchInterval": 2592000,
"score": 1,
"markers": {
"_injmrk_": "y",
"__prsmrk__": "111111111-2222222222",
"dist": "0",
"_gnmrk_": "111111111-2222222222",
"_ftcmrk_": "111111111-2222222222"
},
"metadata": {
"CharEncodingForConversion": 0,
"OriginalCharEncoding": 0
},
"batchId": "111111111-2222222222",
"baseUrl": "http://localhost:8080/",
"status": 2,
"protocolStatus": {
"code": 1,
"lastModified": 0
},
"contentType": "application/xhtml+xml",
"headers": {
"Connection": "close",
"Server": "Apache",
"Content-Type": "text/html"
},
"title": "Tutorials: Java, EJB, WebSphere AS, Oracle, Ubuntu, Subversion, Nexus, SonarQube, Jenkins, …",
"text": "Tutorials: Java, EJB, WebSphere AS, Oracle, Ubuntu, Subversion, Nexus, SonarQube, Jenkins, … mtitek.com Home Samples Install Tutorials Contact © mtitek.com",
"outlinks": {
"http://localhost:8080/tools·php": "Install"
}
}
Update the crawldb:
$ bin/nutch updatedb -all
$ mongo
> use nutchdb
switched to db nutchdb
> db.webpage.find({},{_id:1})
{ "_id" : "localhost:http/" }
{ "_id" : "localhost:http/tools.php" }
> db.webpage.find()
{
"_id": "localhost:http/",
"fetchTime": "1111111111111",
"fetchInterval": 2592000,
"score": 1,
"markers": {
"_updmrk_": "1111111-22222222",
"dist": "0",
"_injmrk_": "y",
"_gnmrk_": null,
"_ftcmrk_": null,
"__prsmrk__": null
},
"metadata": {
"CharEncodingForConversion": 0,
"dXRmLTg=",
"OriginalCharEncoding": 0,
"dXRmLTg=",
"_rs_": 0,
"caching·forbidden": 0,
"_csh_": 0
},
"batchId": "111111111-222222222",
"baseUrl": "http://localhost:8080/",
"status": 2,
"protocolStatus": {
"code": 1,
"lastModified": 0
},
"content": 0,
"contentType": "application/xhtml+xml",
"headers": {
"Connection": "close",
"Server": "Apache",
"Content-Type": "text/html"
},
"signature": 0,
"title": "Tutorials: Java, EJB, WebSphere AS, Oracle, Ubuntu, Subversion, Nexus, SonarQube, Jenkins, …",
"text": "Tutorials: Java, EJB, WebSphere AS, Oracle, Ubuntu, Subversion, Nexus, SonarQube, Jenkins, … mtitek.com Home Samples Install Tutorials Contact mtitek.com",
"parseStatus": {
"majorCode": 1,
"minorCode": 0
},
"outlinks": {
"http://localhost:8080/tools·php": "Install"
},
"retriesSinceFetch": 0,
"modifiedTime": 0,
"prevModifiedTime": 0,
"inlinks": {
"http://localhost:8080/": "mtitek.com"
}
}
{
"_id": "localhost:http/tools.php",
"status": 1,
"fetchTime": "1111111111111",
"fetchInterval": 2592000,
"retriesSinceFetch": 0,
"score": 0,
"inlinks": {
"http://localhost:8080/": "Install"
},
"markers": {
"dist": "1"
},
"metadata": {
"_csh_": 0,
"AAAAAA=="
}
}