• Home
  • LLMs
  • Docker
  • Kubernetes
  • Java
  • Maven
  • About
Big Data | Install and configure Apache Hadoop (single node cluster)
  1. References
  2. Create "hadoop" user
  3. Install Hadoop
  4. Switch to "hadoop" user
  5. Update "~/.profile" file
  6. Create "hadoop.tmp.dir" directory
  7. Configure "${HADOOP_HOME}/etc/hadoop/core-site.xml"
  8. Configure "${HADOOP_HOME}/etc/hadoop/hdfs-site.xml"
  9. Configure "${HADOOP_HOME}/etc/hadoop/mapred-site.xml"
  10. Configure "${HADOOP_HOME}/etc/hadoop/hadoop-env.sh"
  11. Format HDFS filesystem
  12. Start single-node Hadoop cluster
  13. Set permission for "/" node in hdfs
  14. Hadoop Ports/Web UIs
  15. Hadoop: status, log files
  16. Stop single-node Hadoop cluster
  17. Uninstall Hadoop
  18. Hadoop "start-all.sh" permission denied: "ssh localhost: Permission denied (publickey, password)"

  1. References
    See this page for more details about Apache Hadoop:
    https://hadoop.apache.org/docs/current/
  2. Create "hadoop" user
    $ sudo addgroup hadoop
    $ sudo adduser --ingroup hadoop hadoop
  3. Install Hadoop
    Download Apache Hadoop: http://hadoop.apache.org/releases.html

    Extract the file "hadoop-3.3.0.tar.gz" in the folder you want to install Hadoop: e.g. '/opt/hadoop-3.3.0'
    $ tar -xf ~/Downloads/hadoop-3.3.0.tar.gz -C /opt/
    $ chmod -R 755 /opt/hadoop-3.3.0
    $ sudo chown -R hadoop:hadoop /opt/hadoop-3.3.0

    Note: In the following sections, the environment variable ${HADOOP_HOME} will refer to this location '/opt/hadoop-3.3.0'.
  4. Switch to "hadoop" user
    $ su - hadoop
  5. Update "~/.profile" file
    $ vi ~/.profile
    export JAVA_HOME="/opt/jdk1.8.0_172"
    
    export HADOOP_HOME="/opt/hadoop-3.3.0"
    
    export CLASSPATH=$CLASSPATH:$HADOOP_HOME/share/hadoop/common/lib
    
    PATH="$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH"

    Load ".profile" environment variables:
    $ source ~/.profile

    Print Hadoop version:
    $ hadoop version
    Hadoop 3.3.0
  6. Create "hadoop.tmp.dir" directory
    $ mkdir /home/hadoop/hadoop-tmp-dir
  7. Configure "${HADOOP_HOME}/etc/hadoop/core-site.xml"
    See this page for more detail:
    https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml

    $ vi ${HADOOP_HOME}/etc/hadoop/core-site.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>hadoop.tmp.dir</name>
            <value>/home/hadoop/hadoop-tmp-dir</value>
        </property>
    
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://localhost:8020</value>
        </property>
    
        <property>
            <name>hadoop.proxyuser.hadoop.groups</name>
            <value>*</value>
        </property>
    
        <property>
            <name>hadoop.proxyuser.hadoop.hosts</name>
            <value>*</value>
        </property>
    </configuration>
  8. Configure "${HADOOP_HOME}/etc/hadoop/hdfs-site.xml"
    See this page for more detail:
    https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

    $ vi ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
    
        <property>
            <name>dfs.namenode.name.dir</name>
            <value>file://${hadoop.tmp.dir}/dfs/name</value>
        </property>
    
        <property>
            <name>dfs.datanode.data.dir</name>
            <value>file://${hadoop.tmp.dir}/dfs/data</value>
        </property>
    </configuration>
  9. Configure "${HADOOP_HOME}/etc/hadoop/mapred-site.xml"
    See this page for more detail:
    https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

    $ vi ${HADOOP_HOME}/etc/hadoop/mapred-site.xml

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>mapreduce.jobtracker.address</name>
            <value>localhost:54311</value>
        </property>
    </configuration>
  10. Configure "${HADOOP_HOME}/etc/hadoop/hadoop-env.sh"
    Edit file "hadoop-env.sh" and export "JAVA_HOME" environment variable.
    $ vi ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh
    export JAVA_HOME="/opt/jdk1.8.0_172"
  11. Format HDFS filesystem
    $ hadoop namenode -format
  12. Start single-node Hadoop cluster
    $ ${HADOOP_HOME}/sbin/start-all.sh
    Starting namenodes on [localhost]
    Starting datanodes
    Starting secondary namenodes [mtitek]
    Starting resourcemanager
    Starting nodemanagers

    You might get this error "Permission denied (publickey, password)" when you start Hadoop.
    To fix this error, see Hadoop "start-all.sh" permission denied: "ssh localhost: Permission denied (publickey, password)".
  13. Set permission for "/" node in hdfs
    Check permission:
    $ hdfs dfs -getfacl /
    # file: /
    # owner: hadoop
    # group: supergroup
    user::rwx
    group::r-x
    other::r-x

    Set permission:
    $ hdfs dfs -chmod -R 775 /
    $ hdfs dfs -chown -R hadoop:hadoop /
  14. Hadoop Ports/Web UIs
    Ports Used by Hadoop:
    DataNode: 50010 (dfs.datanode.address)
    
    NameNode: 8020 (fs.defaultFS)
    
    Secondary NameNode: 50090 (dfs.namenode.secondary.http-address)

    • HDFS Web UI: http://localhost:9870

      hdfs-web-ui

    • Resource Manager: http://localhost:8088

      resource-manager

    • Node Manager: http://localhost:8042

      node-manager
  15. Hadoop: status, log files
    Hadoop processes info:
    • Java virtual machine process status tool: jps
      $ jps -ml
      12422 org.apache.hadoop.hdfs.server.namenode.NameNode
      12640 org.apache.hadoop.hdfs.server.datanode.DataNode
      13524 org.apache.hadoop.yarn.server.nodemanager.NodeManager
      13156 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
      12901 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode

    • Display process info: ps -fp <pid> | less
      $ ps -fp 6476 | less
      UID        PID  PPID  C STIME TTY          TIME CMD
      hadoop    6476     1  0 09:55 ?        00:00:09 /opt/jdk1.8.0_172/bin/java -Dproc_datanode -Djava.net.preferIPv4Stack=true
      -Dhadoop.security.logger=ERROR,RFAS -Dyarn.log.dir=/opt/hadoop-3.3.0/logs -Dyarn.log.file=hadoop-hadoop-datanode-mtitek.log
      -Dyarn.home.dir=/opt/hadoop-3.3.0 -Dyarn.root.logger=INFO,console -Djava.library.path=/opt/hadoop-3.3.0/lib/native
      -Dhadoop.log.dir=/opt/hadoop-3.3.0/logs -Dhadoop.log.file=hadoop-hadoop-datanode-mtitek.log -Dhadoop.home.dir=/opt/hadoop-3.3.0
      -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.datanode.DataNode

    • Display active TCP connections: sudo netstat -plten

    Hadoop log files:
    • Hadoop log files can be found in "${HADOOP_HOME}/logs/"
    • Hadoop jetty web app in "/tmp/jetty*"
    • Hadoop pid files: "/tmp/hadoop-*.pid"

    $ ls -al ${HADOOP_HOME}/logs/
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-namenode-mtitek.log
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-namenode-mtitek.out
    
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-datanode-mtitek.log
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-datanode-mtitek.out
    
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-secondarynamenode-mtitek.log
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-secondarynamenode-mtitek.out
    
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-nodemanager-mtitek.log
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-nodemanager-mtitek.out
    
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-resourcemanager-mtitek.log
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-resourcemanager-mtitek.out
    
    -rw-rw-r-- hadoop hadoop SecurityAuth-hadoop.audit
    
    drwxr-xr-x hadoop hadoop userlogs

    $ ls -al /tmp/ | grep hadoop
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-namenode.pid
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-datanode.pid
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-secondarynamenode.pid
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-nodemanager.pid
    -rw-rw-r-- hadoop hadoop hadoop-hadoop-resourcemanager.pid
    
    drwxrwxr-x hadoop hadoop jetty-0.0.0.0-9870-hdfs-_-any-4404239779972224131.dir
    drwxrwxr-x hadoop hadoop jetty-localhost-34228-datanode-_-any-3510019043507033994.dir
    drwxrwxr-x hadoop hadoop jetty-0.0.0.0-9868-secondary-_-any-5247285317055775022.dir
    drwxrwxr-x hadoop hadoop jetty-0.0.0.0-8042-node-_-any-3822503648428813577.dir
    drwxrwxr-x hadoop hadoop jetty-0.0.0.0-8088-cluster-_-any-3021095246783254089.dir
  16. Stop single-node Hadoop cluster
    $ ${HADOOP_HOME}/sbin/stop-all.sh
  17. Uninstall Hadoop
    Make sure that Hadoop is not running (see above how to stop Hadoop).

    $ hadoop namenode -format
    
    $ sudo rm -rf /opt/hadoop-3.3.0
    
    $ sudo rm -rf /home/hadoop/hadoop-tmp-dir
    
    $ sudo userdel hadoop
    $ sudo groupdel hadoop
    $ sudo rm -rf /home/hadoop/

    Note: You also need to delete Hadoop environment variables from "~/.profile" file.
  18. Hadoop "start-all.sh" permission denied: "ssh localhost: Permission denied (publickey, password)"
    To fix this issue you have to generate an SSH key:
    $ cd ~/.ssh/
    $ ssh-keygen -t rsa -P ""
    $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

    Optional: You may also need to edit the file "sshd_config" and update "PubkeyAuthentication" and "AllowUsers" variables.
    $ sudo vi /etc/ssh/sshd_config
    PubkeyAuthentication yes
    AllowUsers hadoop

    Reload SSH configs.
    $ sudo /etc/init.d/ssh reload

    To test SSH connection:
    $ ssh localhost

    To debug SSH connection:
    $ ssh -v localhost
© 2025  mtitek