• Home
  • Docker
  • Kubernetes
  • LLMs
  • Java
  • Ubuntu
  • Maven
  • Big Data
  • Archived
Big Data | Apache Nutch
  1. References
  2. Install Nutch
  3. Configure Nutch
  4. Configure Nutch with Solr
  5. Start Nutch crawling job: crawl script
  6. Start Nutch crawling job: Step-by-Step commands

  1. References
    See these pages for more details on how to install and use Apache Nutch:
    https://wiki.apache.org/nutch/NutchTutorial
    https://wiki.apache.org/nutch/CommandLineOptions

    See these pages for more details on how to install MongoDB, Solr, Zookeeper:
    http://www.mtitek.com/tutorials/tools/mongodb.php
    http://www.mtitek.com/tutorials/solr
    http://www.mtitek.com/tutorials/zookeeper
  2. Install Nutch
    Download Apache Nutch: https://nutch.apache.org/downloads.html

    Extract the file "apache-nutch-2.3.1-src.tar.gz" in the folder you want to install Nutch: /opt/apache-nutch-2.3.1

    Note: In the following sections, the environment variable ${NUTCH_SRC_HOME} will refer to this location '/opt/apache-nutch-2.3.1'

    You need to choose the Gora backend for your Nutch.
    For example, to use MongoDB as the Gora backend you need to enable "gora-mongodb" dependency in '${NUTCH_SRC_HOME}/ivy/ivy.xml':
    You are ready to compile Nutch:
    This will create the Nutch runtime directory '${NUTCH_SRC_HOME}/runtime'.
    The folder '${NUTCH_SRC_HOME}/runtime/local' is your Nutch home folder.

    Note: In the following sections, the environment variable ${NUTCH_HOME} will refer to this location '${NUTCH_SRC_HOME}/runtime/local'

    For all the required configuration you need to modify files in this folder '${NUTCH_HOME}/conf'.

    Run this command 'ant clean' to delete Nutch runtime directory '${NUTCH_SRC_HOME}/runtime':
  3. Configure Nutch
    Use this file '${NUTCH_HOME}/conf/nutch-site.xml':
    ► to configure MongoDB as the GORA backend.
    ► to configure HTTP agent name.
    ► to use Nutch with Solr.

    Use this file '${NUTCH_HOME}/conf/gora.properties' to configure MongoDBStore as the default datastore:
  4. Configure Nutch with Solr
    This section provide quick instructions on how to configure Nutch with Solr.

    Download Apache Solr: http://lucene.apache.org/solr

    Download Apache ZooKeeper: http://zookeeper.apache.org

    Extract the file "solr-7.3.1.zip" in the folder you want to install Solr: /opt/solr-7.3.1

    Extract the file "zookeeper-3.5.4-beta.tar.gz" in the folder you want to install Zookeeper: /opt/zookeeper-3.5.4-beta

    Start Zookeeper:
    For the purpose of this tutorial you can use the sample solr configuration provided by Solr 'sample_techproducts_configs'.
    You can use this sample Solr schema managed-schema (you need to copy it in this folder: /opt/solr-7.3.1/server/solr/configsets/sample_techproducts_configs/conf)

    Upload Solr config to ZooKeeper and create a new Solr collection:
  5. Start Nutch crawling job: crawl script
    Create a seed directory and create a file that contains the list of web sites to be crawled by Nutch:
    Start a crawling job:
  6. Start Nutch crawling job: Step-by-Step commands
    You can also follow this Step-by-Step Nutch commands to crawl web sites:
    Here's an example of the execution of these commands and their results in MongoDB.

    Initialize the crawldb with the selected URLs:
    Generate a fetch list from the crawldb:
    Fetch URLs for the generated batchIds:
    Parse all fetched URLs:
    Update the crawldb:
© 2025  mtitek