• Home
  • LLMs
  • Docker
  • Kubernetes
  • Java
  • Maven
  • About
Big Data | Install and configure Apache Spark (standalone)
  1. References
  2. Create "spark" user
  3. Install Spark
  4. Switch to "spark" user
  5. Update "~/.profile" file
  6. Start Spark
  7. Stop Spark
  8. Spark Ports/Web UIs
  9. Spark: status, log files
  10. Spark "start-all.sh" permission denied: "spark@localhost: Permission denied (publickey, password)"

  1. References
    See these pages for more details about Cluster Modes and Apache Spark Standalone Mode:
    https://spark.apache.org/docs/latest/spark-standalone.html
    https://spark.apache.org/docs/latest/cluster-overview.html

    See this page for more details about Apache Spark Configuration:
    https://spark.apache.org/docs/latest/configuration.html

    See these pages for more details about Apache Hadoop, Apache Hive:
    Install Apache Hadoop
    Install Apache Hive
  2. Create "spark" user
    $ sudo addgroup spark
    $ sudo adduser --ingroup spark spark
    $ sudo usermod -a -G hadoop spark
  3. Install Spark
    Download Apache Spark: https://spark.apache.org/downloads.html

    - Choose a Spark release: 3.0.0 (Jun 18 2020)
    - Choose a package type: Pre-built for Apache Hadoop 3.2 and later
    - Download Spark: spark-3.0.0-bin-hadoop3.2.tgz

    Extract the file "spark-3.0.0-bin-hadoop3.2.tgz" in the folder you want to install Spark: e.g. '/opt/spark-3.0.0-bin-hadoop3.2'
    $ tar -xf ~/Downloads/spark-3.0.0-bin-hadoop3.2.tgz -C /opt/
    $ chmod -R 755 /opt/spark-3.0.0-bin-hadoop3.2
    $ sudo chown -R spark:spark /opt/spark-3.0.0-bin-hadoop3.2

    Note: In the following sections, the environment variable ${SPARK_HOME} will refer to this location '/opt/spark-3.0.0-bin-hadoop3.2'
  4. Switch to "spark" user
    $ su - spark
  5. Update "~/.profile" file
    $ vi ~/.profile
    export JAVA_HOME="/opt/jdk1.8.0_172"
    
    export SPARK_HOME="/opt/spark-3.0.0-bin-hadoop3.2"
    
    export HADOOP_HOME="/opt/hadoop-3.3.0"
    
    export HIVE_HOME="/opt/apache-hive-3.1.2-bin"
    
    export CLASSPATH=$CLASSPATH:$SPARK_HOME/jars:$HADOOP_HOME/share/hadoop/common/lib:$HIVE_HOME/lib
    
    PATH="$JAVA_HOME/bin:$SPARK_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin:$PATH"

    Load ".profile" environment variables:
    $ source ~/.profile
  6. Start Spark
    $ ${SPARK_HOME}/sbin/start-all.sh

    To Start the Spark Master and Workers separately:
    $ ${SPARK_HOME}/sbin/start-master.sh
    
    $ ${SPARK_HOME}/sbin/start-slave.sh spark://localhost:7077

    You might get this error "Permission denied (publickey, password)" when you start Spark.
    To fix this error, see Spark "start-all.sh" permission denied: "spark@localhost: Permission denied (publickey, password)".
  7. Stop Spark
    $ ${SPARK_HOME}/sbin/stop-all.sh
  8. Spark Ports/Web UIs
    Ports Used by Spark:
    Default Master RPC port: 7077
    
    Default Worker RPC port: 7078
    
    Default Master web UI port: 18080
    
    Default Worker web UI port: 18081
    
    History Server: 18088 (history.port)
    
    Shuffle service: 7337

    ► Spark web UI: http://localhost:8080

    spark-web-ui

    ► Spark shell application UI (see bellow how to start spark shell and make sure that the port is 4040): http://localhost:4040

    spark-shell-application-ui
  9. Spark: status, log files
    Spark process info:
    • Java virtual machine process status tool: jps
      $ jps -ml
      14945 org.apache.spark.deploy.master.Master --host localhost --port 7077 --webui-port 8080
      15103 org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://localhost:7077

    • Display process info: ps -fp <pid> | less
      $ ps -fp 12835 | less
      UID        PID  PPID  C STIME TTY          TIME CMD
      spark    12835     1  1 13:51 pts/0    00:00:04 /opt/jdk1.8.0_172/bin/java
      -cp /opt/spark-3.0.0-bin-hadoop3.2/conf/:/opt/spark-3.0.0-bin-hadoop3.2/jars/*
      -Xmx1g org.apache.spark.deploy.master.Master --host localhost --port 7077 --webui-port 8080

      $ ps -fp 13005 | less
      UID        PID  PPID  C STIME TTY          TIME CMD
      spark    13005     1  1 13:51 ?        00:00:05 /usr/lib/jvm/java-11-openjdk-amd64/bin/java
      -cp /opt/spark-3.0.0-bin-hadoop3.2/conf/:/opt/spark-3.0.0-bin-hadoop3.2/jars/*
      -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://localhost:7077

    Spark log files:
    • Spark log files can be found in "${SPARK_HOME}/logs/"
    • Spark workers log files can be found in "${SPARK_HOME}/work/"
    • Spark pid files: "/tmp/spark-*.pid"

    $ ls -al ${SPARK_HOME}/logs/
    -rw-rw-r-- spark spark spark-spark-org.apache.spark.deploy.master.Master-1-mtitek.out
    -rw-rw-r-- spark spark spark-spark-org.apache.spark.deploy.worker.Worker-1-mtitek.out

    $ ls -al ${SPARK_HOME}/work/
    drwxrwxr-x spark spark app-11111111081718-0000
    drwxrwxr-x spark spark app-11111111081740-0001

    $ ls -al ${SPARK_HOME}/work/app-11111111081718-0000/0/
    -rw-rw-r-- spark spark stderr
    -rw-rw-r-- spark spark stdout

    $ ls -al /tmp/ | grep spark
    -rw-rw-r-- spark spark spark-spark-org.apache.spark.deploy.master.Master-1.pid
    -rw-rw-r-- spark spark spark-spark-org.apache.spark.deploy.worker.Worker-1.pid
  10. Spark "start-all.sh" permission denied: "spark@localhost: Permission denied (publickey, password)"
    To fix this issue you have to generate an SSH key:
    $ cd ~/.ssh/
    $ ssh-keygen -t rsa -P ""
    $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

    Optional: You may also need to edit the file "sshd_config" and update "PubkeyAuthentication" and "AllowUsers" variables.
    $ sudo vi /etc/ssh/sshd_config
    PubkeyAuthentication yes
    AllowUsers spark

    Reload SSH configs.
    $ sudo /etc/init.d/ssh reload

    To test SSH connection:
    $ ssh localhost

    To debug SSH connection:
    $ ssh -v localhost
© 2025  mtitek