Apache Nutch | MTI TEK

Big Data | Apache Nutch

References
Install Nutch
Configure Nutch
Configure Nutch with Solr
Start Nutch crawling job: crawl script
Start Nutch crawling job: Step-by-Step commands

References
See these pages for more details on how to install and use Apache Nutch:
https://wiki.apache.org/nutch/NutchTutorial
https://wiki.apache.org/nutch/CommandLineOptions

See these pages for more details on how to install MongoDB, Solr, Zookeeper:
http://www.mtitek.com/tutorials/tools/mongodb.php
http://www.mtitek.com/tutorials/solr
http://www.mtitek.com/tutorials/zookeeper
Install Nutch
Download Apache Nutch: https://nutch.apache.org/downloads.html

Extract the file "apache-nutch-2.3.1-src.tar.gz" in the folder you want to install Nutch: /opt/apache-nutch-2.3.1
```
$ tar -xf ~/Downloads/apache-nutch-2.3.1-src.tar.gz -C /opt/
```
Note: In the following sections, the environment variable ${NUTCH_SRC_HOME} will refer to this location '/opt/apache-nutch-2.3.1'

You need to choose the Gora backend for your Nutch.
For example, to use MongoDB as the Gora backend you need to enable "gora-mongodb" dependency in '${NUTCH_SRC_HOME}/ivy/ivy.xml':
```

<dependency org="org.apache.gora" name="gora-mongodb" rev="0.6.1" conf="*->default" />
```
You are ready to compile Nutch:
```
$ cd ${NUTCH_SRC_HOME}
$ ant runtime
```
This will create the Nutch runtime directory '${NUTCH_SRC_HOME}/runtime'.
The folder '${NUTCH_SRC_HOME}/runtime/local' is your Nutch home folder.

Note: In the following sections, the environment variable ${NUTCH_HOME} will refer to this location '${NUTCH_SRC_HOME}/runtime/local'

For all the required configuration you need to modify files in this folder '${NUTCH_HOME}/conf'.

Run this command 'ant clean' to delete Nutch runtime directory '${NUTCH_SRC_HOME}/runtime':
```
$ cd ${NUTCH_SRC_HOME}
$ ant clean
```

Configure Nutch

Use this file '${NUTCH_HOME}/conf/nutch-site.xml':
► to configure MongoDB as the GORA backend.
► to configure HTTP agent name.
► to use Nutch with Solr.

<configuration>
    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.mongodb.store.MongoStore</value>
    </property>

    <property>
        <name>http.agent.name</name>
        <value>mtitek nutch</value>
    </property>

    <property>
        <name>plugin.includes</name>
        <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    </property>
</configuration>

Use this file '${NUTCH_HOME}/conf/gora.properties' to configure MongoDBStore as the default datastore:

############################
# MongoDBStore properties  #
############################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=nutchdb
#gora.mongodb.login=login
#gora.mongodb.secret=secret

Configure Nutch with Solr
This section provide quick instructions on how to configure Nutch with Solr.

Download Apache Solr: http://lucene.apache.org/solr

Download Apache ZooKeeper: http://zookeeper.apache.org

Extract the file "solr-7.3.1.zip" in the folder you want to install Solr: /opt/solr-7.3.1
```
$ unzip ~/Downloads/solr-7.3.1.zip -d /opt/
```
Extract the file "zookeeper-3.5.4-beta.tar.gz" in the folder you want to install Zookeeper: /opt/zookeeper-3.5.4-beta
```
$ tar -xf ~/Downloads/zookeeper-3.5.4-beta.tar.gz -C /opt/
```
Start Zookeeper:
```
$ cd /opt/zookeeper-3.5.4-beta
$ bin/zkServer.sh start
```
For the purpose of this tutorial you can use the sample solr configuration provided by Solr 'sample_techproducts_configs'.
You can use this sample Solr schema managed-schema (you need to copy it in this folder: /opt/solr-7.3.1/server/solr/configsets/sample_techproducts_configs/conf)

Upload Solr config to ZooKeeper and create a new Solr collection:
```
$ cd /opt/solr-7.3.1/server/scripts/cloud-scripts

$ ./zkcli.sh \
-zkhost "localhost:2181/solr" \
-cmd upconfig \
-confname solr_config1 \
-confdir /opt/solr-7.3.1/server/solr/configsets/sample_techproducts_configs/conf

$ cd /opt/solr-7.3.1/

$ bin/solr start -c -z "localhost:2181/solr"

$ bin/solr create_collection -c collection1 -n solr_config1
```

Start Nutch crawling job: crawl script

Create a seed directory and create a file that contains the list of web sites to be crawled by Nutch:

$ cd ${NUTCH_HOME}
$ mkdir seedDir
$ echo 'http://localhost:8080/' > seedDir/url.txt

Start a crawling job:

$ cd ${NUTCH_HOME}
$ bin/crawl seedDir/ mtitek http://localhost:8983/solr/collection1 1

Start Nutch crawling job: Step-by-Step commands

You can also follow this Step-by-Step Nutch commands to crawl web sites:

#Initialize the crawldb with the selected URLs:
$ bin/nutch inject seedDir/

#Generate a fetch list from the crawldb:
$ bin/nutch generate -topN 100

#Fetch URLs for the generated batchIds:
$ bin/nutch fetch -all

#Parse all fetched URLs:
$ bin/nutch parse -all

#Update the crawldb:
$ bin/nutch updatedb -all

Here's an example of the execution of these commands and their results in MongoDB.

Initialize the crawldb with the selected URLs:

$ bin/nutch inject seedDir/

$ mongo

> use nutchdb
switched to db nutchdb

> show collections
webpage

> db.webpage.find()
{
    "_id": "localhost:http/",
    "fetchTime": "1111111111111",
    "fetchInterval": 2592000,
    "score": 1,
    "markers": {
        "dist": "0",
        "_injmrk_": "y"
    },
    "metadata": {
        "_csh_": 0,
        "AAAAAA=="
    }
}

Generate a fetch list from the crawldb:

$ bin/nutch generate -topN 100

$ mongo

> use nutchdb
switched to db nutchdb

> db.webpage.find()
{
    "_id": "localhost:http/",
    "fetchTime": "1111111111111",
    "fetchInterval": 2592000,
    "score": 1,
    "markers": {
        "dist": "0",
        "_injmrk_": "y",
        "_gnmrk_": "11111111-22222222"
    },
    "metadata": {
        "_csh_": 0,
        "AAAAAA=="
    },
    "batchId": "11111111-22222222"
}

Fetch URLs for the generated batchIds:

$ bin/nutch fetch -all

$ mongo

> use nutchdb
switched to db nutchdb

> db.webpage.find()
{
    "_id": "localhost:http/",
    "fetchTime": "1111111111111",
    "fetchInterval": 2592000,
    "score": 1,
    "markers": {
        "dist": "0",
        "_injmrk_": "y",
        "_gnmrk_": "11111111-22222222",
        "_ftcmrk_": "11111111-22222222"
    },
    "metadata": {
        "_rs_": 0,
        "AAAA2g=="
    },
    "batchId": "11111111-22222222",
    "baseUrl": "http://localhost:8080/",
    "status": 2,
    "prevFetchTime": "11111111111111",
    "protocolStatus": {
        "code": 1,
        "lastModified": 0
    },
    "content": 0,
    "contentType": "application/xhtml+xml",
    "headers": {
        "Connection": "close",
        "Server": "Apache",
        "Content-Type": "text/html"
    }
}

Parse all fetched URLs:

$ bin/nutch parse -all

$ mongo

> use nutchdb
switched to db nutchdb

> db.webpage.find()
{
    "_id": "localhost:http/",
    "fetchTime": "1111111111111",
    "fetchInterval": 2592000,
    "score": 1,
    "markers": {
        "_injmrk_": "y",
        "__prsmrk__": "111111111-2222222222",
        "dist": "0",
        "_gnmrk_": "111111111-2222222222",
        "_ftcmrk_": "111111111-2222222222"
    },
    "metadata": {
        "CharEncodingForConversion": 0,
        "OriginalCharEncoding": 0
    },
    "batchId": "111111111-2222222222",
    "baseUrl": "http://localhost:8080/",
    "status": 2,
    "protocolStatus": {
        "code": 1,
        "lastModified": 0
    },
    "contentType": "application/xhtml+xml",
    "headers": {
        "Connection": "close",
        "Server": "Apache",
        "Content-Type": "text/html"
    },
    "title": "Tutorials: Java, EJB, WebSphere AS, Oracle, Ubuntu, Subversion, Nexus, SonarQube, Jenkins, …",
    "text": "Tutorials: Java, EJB, WebSphere AS, Oracle, Ubuntu, Subversion, Nexus, SonarQube, Jenkins, … mtitek.com   Home   Samples   Install   Tutorials   Contact  © mtitek.com",
    "outlinks": {
        "http://localhost:8080/tools·php": "Install"
    }
}

Update the crawldb:

$ bin/nutch updatedb -all

$ mongo

> use nutchdb
switched to db nutchdb

> db.webpage.find({},{_id:1})
{ "_id" : "localhost:http/" }
{ "_id" : "localhost:http/tools.php" }

> db.webpage.find()

{
    "_id": "localhost:http/",
    "fetchTime": "1111111111111",
    "fetchInterval": 2592000,
    "score": 1,
    "markers": {
        "_updmrk_": "1111111-22222222",
        "dist": "0",
        "_injmrk_": "y",
        "_gnmrk_": null,
        "_ftcmrk_": null,
        "__prsmrk__": null
    },
    "metadata": {
        "CharEncodingForConversion": 0,
        "dXRmLTg=",
        "OriginalCharEncoding": 0,
        "dXRmLTg=",
        "_rs_": 0,
        "caching·forbidden": 0,
        "_csh_": 0
    },
    "batchId": "111111111-222222222",
    "baseUrl": "http://localhost:8080/",
    "status": 2,
    "protocolStatus": {
        "code": 1,
        "lastModified": 0
    },
    "content": 0,
    "contentType": "application/xhtml+xml",
    "headers": {
        "Connection": "close",
        "Server": "Apache",
        "Content-Type": "text/html"
    },
    "signature": 0,
    "title": "Tutorials: Java, EJB, WebSphere AS, Oracle, Ubuntu, Subversion, Nexus, SonarQube, Jenkins, …",
    "text": "Tutorials: Java, EJB, WebSphere AS, Oracle, Ubuntu, Subversion, Nexus, SonarQube, Jenkins, … mtitek.com   Home   Samples   Install   Tutorials   Contact mtitek.com",
    "parseStatus": {
        "majorCode": 1,
        "minorCode": 0
    },
    "outlinks": {
        "http://localhost:8080/tools·php": "Install"
    },
    "retriesSinceFetch": 0,
    "modifiedTime": 0,
    "prevModifiedTime": 0,
    "inlinks": {
        "http://localhost:8080/": "mtitek.com"
    }
}

{
    "_id": "localhost:http/tools.php",
    "status": 1,
    "fetchTime": "1111111111111",
    "fetchInterval": 2592000,
    "retriesSinceFetch": 0,
    "score": 0,
    "inlinks": {
        "http://localhost:8080/": "Install"
    },
    "markers": {
        "dist": "1"
    },
    "metadata": {
        "_csh_": 0,
        "AAAAAA=="
    }
}