• Home
  • LLMs
  • Docker
  • Kubernetes
  • Java
  • Maven
  • About
Big Data | Apache Nutch
  1. References
  2. Install Nutch
  3. Configure Nutch
  4. Configure Nutch with Solr
  5. Start Nutch crawling job: crawl script
  6. Start Nutch crawling job: Step-by-Step commands

  1. References
    See these pages for more details on how to install and use Apache Nutch:
    https://wiki.apache.org/nutch/NutchTutorial
    https://wiki.apache.org/nutch/CommandLineOptions

    See these pages for more details on how to install MongoDB, Solr, Zookeeper:
    http://www.mtitek.com/tutorials/tools/mongodb.php
    http://www.mtitek.com/tutorials/solr
    http://www.mtitek.com/tutorials/zookeeper
  2. Install Nutch
    Download Apache Nutch: https://nutch.apache.org/downloads.html

    Extract the file "apache-nutch-2.3.1-src.tar.gz" in the folder you want to install Nutch: /opt/apache-nutch-2.3.1
    $ tar -xf ~/Downloads/apache-nutch-2.3.1-src.tar.gz -C /opt/

    Note: In the following sections, the environment variable ${NUTCH_SRC_HOME} will refer to this location '/opt/apache-nutch-2.3.1'

    You need to choose the Gora backend for your Nutch.
    For example, to use MongoDB as the Gora backend you need to enable "gora-mongodb" dependency in '${NUTCH_SRC_HOME}/ivy/ivy.xml':
    <!-- Uncomment this to use MongoDB as Gora backend. -->
    <dependency org="org.apache.gora" name="gora-mongodb" rev="0.6.1" conf="*->default" />

    You are ready to compile Nutch:
    $ cd ${NUTCH_SRC_HOME}
    $ ant runtime

    This will create the Nutch runtime directory '${NUTCH_SRC_HOME}/runtime'.
    The folder '${NUTCH_SRC_HOME}/runtime/local' is your Nutch home folder.

    Note: In the following sections, the environment variable ${NUTCH_HOME} will refer to this location '${NUTCH_SRC_HOME}/runtime/local'

    For all the required configuration you need to modify files in this folder '${NUTCH_HOME}/conf'.

    Run this command 'ant clean' to delete Nutch runtime directory '${NUTCH_SRC_HOME}/runtime':
    $ cd ${NUTCH_SRC_HOME}
    $ ant clean

  3. Configure Nutch
    Use this file '${NUTCH_HOME}/conf/nutch-site.xml':
    ► to configure MongoDB as the GORA backend.
    ► to configure HTTP agent name.
    ► to use Nutch with Solr.
    <configuration>
        <property>
            <name>storage.data.store.class</name>
            <value>org.apache.gora.mongodb.store.MongoStore</value>
        </property>
    
        <property>
            <name>http.agent.name</name>
            <value>mtitek nutch</value>
        </property>
    
        <property>
            <name>plugin.includes</name>
            <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
        </property>
    </configuration>

    Use this file '${NUTCH_HOME}/conf/gora.properties' to configure MongoDBStore as the default datastore:
    ############################
    # MongoDBStore properties  #
    ############################
    gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
    gora.mongodb.override_hadoop_configuration=false
    gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
    gora.mongodb.servers=localhost:27017
    gora.mongodb.db=nutchdb
    #gora.mongodb.login=login
    #gora.mongodb.secret=secret

  4. Configure Nutch with Solr
    This section provide quick instructions on how to configure Nutch with Solr.

    Download Apache Solr: http://lucene.apache.org/solr

    Download Apache ZooKeeper: http://zookeeper.apache.org

    Extract the file "solr-7.3.1.zip" in the folder you want to install Solr: /opt/solr-7.3.1
    $ unzip ~/Downloads/solr-7.3.1.zip -d /opt/

    Extract the file "zookeeper-3.5.4-beta.tar.gz" in the folder you want to install Zookeeper: /opt/zookeeper-3.5.4-beta
    $ tar -xf ~/Downloads/zookeeper-3.5.4-beta.tar.gz -C /opt/

    Start Zookeeper:
    $ cd /opt/zookeeper-3.5.4-beta
    $ bin/zkServer.sh start

    For the purpose of this tutorial you can use the sample solr configuration provided by Solr 'sample_techproducts_configs'.
    You can use this sample Solr schema managed-schema (you need to copy it in this folder: /opt/solr-7.3.1/server/solr/configsets/sample_techproducts_configs/conf)

    Upload Solr config to ZooKeeper and create a new Solr collection:
    $ cd /opt/solr-7.3.1/server/scripts/cloud-scripts
    
    $ ./zkcli.sh \
    -zkhost "localhost:2181/solr" \
    -cmd upconfig \
    -confname solr_config1 \
    -confdir /opt/solr-7.3.1/server/solr/configsets/sample_techproducts_configs/conf
    
    $ cd /opt/solr-7.3.1/
    
    $ bin/solr start -c -z "localhost:2181/solr"
    
    $ bin/solr create_collection -c collection1 -n solr_config1

  5. Start Nutch crawling job: crawl script
    Create a seed directory and create a file that contains the list of web sites to be crawled by Nutch:
    $ cd ${NUTCH_HOME}
    $ mkdir seedDir
    $ echo 'http://localhost:8080/' > seedDir/url.txt

    Start a crawling job:
    $ cd ${NUTCH_HOME}
    $ bin/crawl seedDir/ mtitek http://localhost:8983/solr/collection1 1
  6. Start Nutch crawling job: Step-by-Step commands
    You can also follow this Step-by-Step Nutch commands to crawl web sites:
    #Initialize the crawldb with the selected URLs:
    $ bin/nutch inject seedDir/
    
    #Generate a fetch list from the crawldb:
    $ bin/nutch generate -topN 100
    
    #Fetch URLs for the generated batchIds:
    $ bin/nutch fetch -all
    
    #Parse all fetched URLs:
    $ bin/nutch parse -all
    
    #Update the crawldb:
    $ bin/nutch updatedb -all

    Here's an example of the execution of these commands and their results in MongoDB.

    Initialize the crawldb with the selected URLs:
    $ bin/nutch inject seedDir/
    
    $ mongo
    
    > use nutchdb
    switched to db nutchdb
    
    > show collections
    webpage
    
    > db.webpage.find()
    {
        "_id": "localhost:http/",
        "fetchTime": "1111111111111",
        "fetchInterval": 2592000,
        "score": 1,
        "markers": {
            "dist": "0",
            "_injmrk_": "y"
        },
        "metadata": {
            "_csh_": 0,
            "AAAAAA=="
        }
    }

    Generate a fetch list from the crawldb:
    $ bin/nutch generate -topN 100
    
    $ mongo
    
    > use nutchdb
    switched to db nutchdb
    
    > db.webpage.find()
    {
        "_id": "localhost:http/",
        "fetchTime": "1111111111111",
        "fetchInterval": 2592000,
        "score": 1,
        "markers": {
            "dist": "0",
            "_injmrk_": "y",
            "_gnmrk_": "11111111-22222222"
        },
        "metadata": {
            "_csh_": 0,
            "AAAAAA=="
        },
        "batchId": "11111111-22222222"
    }

    Fetch URLs for the generated batchIds:
    $ bin/nutch fetch -all
    
    $ mongo
    
    > use nutchdb
    switched to db nutchdb
    
    > db.webpage.find()
    {
        "_id": "localhost:http/",
        "fetchTime": "1111111111111",
        "fetchInterval": 2592000,
        "score": 1,
        "markers": {
            "dist": "0",
            "_injmrk_": "y",
            "_gnmrk_": "11111111-22222222",
            "_ftcmrk_": "11111111-22222222"
        },
        "metadata": {
            "_rs_": 0,
            "AAAA2g=="
        },
        "batchId": "11111111-22222222",
        "baseUrl": "http://localhost:8080/",
        "status": 2,
        "prevFetchTime": "11111111111111",
        "protocolStatus": {
            "code": 1,
            "lastModified": 0
        },
        "content": 0,
        "contentType": "application/xhtml+xml",
        "headers": {
            "Connection": "close",
            "Server": "Apache",
            "Content-Type": "text/html"
        }
    }

    Parse all fetched URLs:
    $ bin/nutch parse -all
    
    $ mongo
    
    > use nutchdb
    switched to db nutchdb
    
    > db.webpage.find()
    {
        "_id": "localhost:http/",
        "fetchTime": "1111111111111",
        "fetchInterval": 2592000,
        "score": 1,
        "markers": {
            "_injmrk_": "y",
            "__prsmrk__": "111111111-2222222222",
            "dist": "0",
            "_gnmrk_": "111111111-2222222222",
            "_ftcmrk_": "111111111-2222222222"
        },
        "metadata": {
            "CharEncodingForConversion": 0,
            "OriginalCharEncoding": 0
        },
        "batchId": "111111111-2222222222",
        "baseUrl": "http://localhost:8080/",
        "status": 2,
        "protocolStatus": {
            "code": 1,
            "lastModified": 0
        },
        "contentType": "application/xhtml+xml",
        "headers": {
            "Connection": "close",
            "Server": "Apache",
            "Content-Type": "text/html"
        },
        "title": "Tutorials: Java, EJB, WebSphere AS, Oracle, Ubuntu, Subversion, Nexus, SonarQube, Jenkins, …",
        "text": "Tutorials: Java, EJB, WebSphere AS, Oracle, Ubuntu, Subversion, Nexus, SonarQube, Jenkins, … mtitek.com   Home   Samples   Install   Tutorials   Contact  © mtitek.com",
        "outlinks": {
            "http://localhost:8080/tools·php": "Install"
        }
    }

    Update the crawldb:
    $ bin/nutch updatedb -all
    
    $ mongo
    
    > use nutchdb
    switched to db nutchdb
    
    > db.webpage.find({},{_id:1})
    { "_id" : "localhost:http/" }
    { "_id" : "localhost:http/tools.php" }
    
    > db.webpage.find()
    
    {
        "_id": "localhost:http/",
        "fetchTime": "1111111111111",
        "fetchInterval": 2592000,
        "score": 1,
        "markers": {
            "_updmrk_": "1111111-22222222",
            "dist": "0",
            "_injmrk_": "y",
            "_gnmrk_": null,
            "_ftcmrk_": null,
            "__prsmrk__": null
        },
        "metadata": {
            "CharEncodingForConversion": 0,
            "dXRmLTg=",
            "OriginalCharEncoding": 0,
            "dXRmLTg=",
            "_rs_": 0,
            "caching·forbidden": 0,
            "_csh_": 0
        },
        "batchId": "111111111-222222222",
        "baseUrl": "http://localhost:8080/",
        "status": 2,
        "protocolStatus": {
            "code": 1,
            "lastModified": 0
        },
        "content": 0,
        "contentType": "application/xhtml+xml",
        "headers": {
            "Connection": "close",
            "Server": "Apache",
            "Content-Type": "text/html"
        },
        "signature": 0,
        "title": "Tutorials: Java, EJB, WebSphere AS, Oracle, Ubuntu, Subversion, Nexus, SonarQube, Jenkins, …",
        "text": "Tutorials: Java, EJB, WebSphere AS, Oracle, Ubuntu, Subversion, Nexus, SonarQube, Jenkins, … mtitek.com   Home   Samples   Install   Tutorials   Contact mtitek.com",
        "parseStatus": {
            "majorCode": 1,
            "minorCode": 0
        },
        "outlinks": {
            "http://localhost:8080/tools·php": "Install"
        },
        "retriesSinceFetch": 0,
        "modifiedTime": 0,
        "prevModifiedTime": 0,
        "inlinks": {
            "http://localhost:8080/": "mtitek.com"
        }
    }
    
    {
        "_id": "localhost:http/tools.php",
        "status": 1,
        "fetchTime": "1111111111111",
        "fetchInterval": 2592000,
        "retriesSinceFetch": 0,
        "score": 0,
        "inlinks": {
            "http://localhost:8080/": "Install"
        },
        "markers": {
            "dist": "1"
        },
        "metadata": {
            "_csh_": 0,
            "AAAAAA=="
        }
    }
© 2025  mtitek