Spark Interactive Shell (Scala): <code>spark-shell</code>

Big Data | Spark Interactive Shell (Scala): spark-shell

spark-shell command line options
Start spark-shell
spark-shell commands help
spark-shell default import
spark-shell default variables
Running Linux commands

spark-shell command line options

$ ${SPARK_HOME}/bin/spark-shell --help
Usage: ./bin/spark-shell [options]

Scala REPL options:

-I <file>                   preload <file>, enforcing line-by-line interpretation

Generic options:

--name NAME                 A name of your application.

--master MASTER_URL         local, spark://host:port, mesos://host:port, yarn, or k8s://https://host:port.
                            (Default: local[*])

--deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client")
                            or on one of the worker machines inside the cluster ("cluster").
                            (Default: client)

--conf PROP=VALUE           Arbitrary Spark configuration property.

--properties-file FILE      Path to a file from which to load extra properties.
                            If not specified, this will look for conf/spark-defaults.conf.

--class CLASS_NAME          Your application’s main class (for Java / Scala apps).

--jars JARS                 Comma-separated list of jars to include on the driver and executor classpaths.

--packages                  Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths.
                            Will search the local maven repo, then maven central and any additional remote repositories given by --repositories.
                            The format for the coordinates should be groupId:artifactId:version.

--exclude-packages          Comma-separated list of groupId:artifactId,
                            to exclude while resolving the dependencies provided in --packages
                            to avoid dependency conflicts.

--repositories              Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages.

--py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.

--files FILES               Comma-separated list of files to be placed in the working directory of each executor.
                            File paths of these files in executors can be accessed via SparkFiles.get(fileName).

--driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options       Extra Java options to pass to the driver.
--driver-library-path       Extra library path entries to pass to the driver.
--driver-class-path         Extra class path entries to pass to the driver.
                            Note that jars added with --jars are automatically included in the classpath.

--executor-memory MEM       Memory per executor (e.g. 1000M, 2G).
                            (Default: 1G)

--proxy-user NAME           User to impersonate when submitting the application.
                            This argument does not work with --principal / --keytab.

--help, -h                  Show this help message and exit.
--verbose, -v               Print additional debug output.
--version,                  Print the version of current Spark.

Cluster deploy mode only:

--driver-cores NUM          Number of cores used by the driver, only in cluster mode.
                            (Default: 1)

Spark standalone or Mesos with cluster deploy mode only:

--supervise                 If given, restarts the driver on failure.

--kill SUBMISSION_ID        If given, kills the driver specified.

--status SUBMISSION_ID      If given, requests the status of the driver specified.

Spark standalone and Mesos only:

--total-executor-cores NUM  Total cores for all executors.

Spark standalone and YARN only:

--executor-cores NUM        Number of cores per executor.
                            (Default: 1 in YARN mode, or all available cores on the worker in standalone mode)

YARN only:

--queue QUEUE_NAME          The YARN queue to submit to.
                            (Default: "default")

--num-executors NUM         Number of executors to launch.
                            If dynamic allocation is enabled, the initial number of executors will be at least NUM.
                            (Default: 2)

--archives ARCHIVES         Comma separated list of archives to be extracted into the working directory of each executor.

--principal PRINCIPAL       Principal to be used to login to KDC, while running on secure HDFS.

--keytab KEYTAB             The full path to the file that contains the keytab for the principal specified above.
                            This keytab will be copied to the node running the Application Master via the Secure Distributed Cache,
                            for renewing the login tickets and the delegation tokens periodically.

Start spark-shell

Spark’s shell provides an interactive shell to learn the Spark API. It is available in either Scala (spark-shell) or Python (pyspark).

$ spark-shell
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://localhost:4040
Spark context available as ’sc’ (master = local[*], app id = local-1558893967168).
Spark session available as ’spark’.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  ’_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_172)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Default Spark Session configuration:

scala> spark.conf.getAll.foreach(println(_))
(spark.app.name,Spark shell)
(spark.app.id,local-1562026247680)
(spark.master,local[*])
(spark.submit.deployMode,client)
(spark.driver.host,192.168.2.33)
(spark.driver.port,43485)
(spark.executor.id,driver)
(spark.sql.catalogImplementation,hive)
(spark.home,/opt/spark-2.4.3-bin-hadoop2.7)
(spark.repl.class.uri,spark://192.168.2.33:43485/classes)
(spark.jars,)
(spark.repl.class.outputDir,/tmp/spark-f9071428-76aa-49bc-84de-3163bd98e0b4/repl-e149d9f1-3060-4476-a11f-f2337c381a42)
(spark.ui.showConsoleProgress,true)

spark-shell commands help

scala> :help
All commands can be abbreviated, e.g., :he instead of :help.
:help [command]           |  print this summary or command-specific help
:quit                     |  exit the interpreter

:type [-v] <expr>         |  display the type of an expression without evaluating it
:kind [-v] <expr>         |  display the kind of expression’s type

:imports [name name ...]  |  show import history, identifying sources of names
:implicits [-v]           |  show the implicits in scope

:history [num]            |  show the history (optional num is commands to show)
:h? <string>              |  search the history
:edit <id>|<line>         |  edit history
:warnings                 |  show the suppressed warnings from the most recent line which had any

:javap <path|class>       |  disassemble a file or class name
:line <id>|<line>         |  place line(s) at the end of history
:load <path>              |  interpret lines in a file
:paste [-raw] [path]      |  enter paste mode or paste a file
:power                    |  enable power user mode
:replay [options]         |  reset the repl and replay all previous commands
:require <path>           |  add a jar to the classpath
:reset [options]          |  reset the repl to its initial state, forgetting all session entries
:save <path>              |  save replayable session to a file
:sh <command line>        |  run a shell command (result is implicitly => List[String])
:settings <options>       |  update compiler options, if possible; see reset
:silent                   |  disable/enable automatic printing of results

spark-shell default import

spark-shell will automatically import some packages.
To list the imported packages use the command :imports:

scala> :imports
 1) import org.apache.spark.SparkContext._ (70 terms, 1 are implicit)
 2) import spark.implicits._       (1 types, 67 terms, 37 are implicit)
 3) import spark.sql               (1 terms)
 4) import org.apache.spark.sql.functions._ (385 terms)

spark-shell default variables

spark-shell will create automatically an instance of Spark Session (accessible using spark variable) and an instance of Spark Context (accessible using sc variable).
► Spark Session available as "spark".
► Spark Context available as "sc".

Spark Session variable:

scala> :type spark
#org.apache.spark.sql.SparkSession

Spark Context variable:

scala> :type sc
#org.apache.spark.SparkContext

You can use auto-completion (tab key) to get available options:

scala> spark.<tab>
baseRelationToDataFrame   conf              emptyDataFrame   implicits         range        sessionState   sql          streams   udf
catalog                   createDataFrame   emptyDataset     listenerManager   read         sharedState    sqlContext   table     version
close                     createDataset     experimental     newSession        readStream   sparkContext   stop         time

scala> sc.<tab>
accumulable             binaryRecords           defaultParallelism        getPersistentRDDs     killExecutor       objectFile              setCallSite         submitJob
accumulableCollection   broadcast               deployMode                getPoolForName        killExecutors      parallelize             setCheckpointDir    textFile
accumulator             cancelAllJobs           doubleAccumulator         getRDDStorageInfo     killTaskAttempt    range                   setJobDescription   uiWebUrl
addFile                 cancelJob               emptyRDD                  getSchedulingMode     listFiles          register                setJobGroup         union
addJar                  cancelJobGroup          files                     hadoopConfiguration   listJars           removeSparkListener     setLocalProperty    version
addSparkListener        cancelStage             getAllPools               hadoopFile            longAccumulator    requestExecutors        setLogLevel         wholeTextFiles
appName                 clearCallSite           getCheckpointDir          hadoopRDD             makeRDD            requestTotalExecutors   sparkUser
applicationAttemptId    clearJobGroup           getConf                   isLocal               master             runApproximateJob       startTime
applicationId           collectionAccumulator   getExecutorMemoryStatus   isStopped             newAPIHadoopFile   runJob                  statusTracker
binaryFiles             defaultMinPartitions    getLocalProperty          jars                  newAPIHadoopRDD    sequenceFile            stop

Running Linux commands
First start by by importing the package sys.process._.
To run a Linux command, just wrap it in double quotes followed by a dot and exclamation characters .!
```
scala> import sys.process._
#import sys.process._

scala> "hdfs dfs -ls /".!
drwxrwxr-x+ - hive hadoop /hive
drwx-wx-wx  - hive hadoop /tmp
```