• Home
  • Docker
  • Kubernetes
  • LLMs
  • Java
  • Ubuntu
  • Maven
  • Big Data
  • Archived
Big Data | Spark API: RDD, DataFrame, Dataset
  1. References
  2. Notes
  3. Create RDD, DataFrame, Dataset
  4. Read/Write supported options
  5. Read files (Text, CSV, JSON, Parquet, ORC)
    1. Supported methods of spark.read
    2. Read CSV files
    3. Read JSON files
    4. Read Text, Parquet, ORC files
    5. spark.read.format("...").load("...")
  6. Create schema
  7. Add schema to a DataFrame (change the schema of a DataFrame)
  8. Provide schema while reading CSV files
  9. Write Dataset/DataFrame to Text, CSV, JSON, Parquet, ORC files
    1. Supported methods of [Dataset/DataFrame].write
    2. [Dataset/DataFrame].write
    3. [Dataset/DataFrame].write.format("...").save("...")
  10. Use coalesce and repartition to manage partitions
  11. Dataset methods

  1. References
    Spark SQL, DataFrames and Datasets Guide: https://spark.apache.org/docs/2.4.3/sql-getting-started.html

    See this page for the API of Dataset: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html

    See this page for the API of RDD: https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html

    See this page for details how to use spark-shell: spark-shell
  2. Notes
    RDD (Resilient Distributed Dataset) is the basic abstraction in Spark.
    RDD is an immutable distributed collection of elements partitioned across nodes of the cluster that can be operated on in parallel (using low-level API that allow applying transformations and performing actions on the RDD).

    DataFrame is an immutable, partitioned collection of elements that can be operated on in parallel.
    Elements are organized into named columns (similar to a relational table).

    Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.
    Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row (Dataset[Row]).
  3. Create RDD, DataFrame, Dataset
    Create a simple-test text file:
    • Create RDD from text file:


    • Create Dataset from text file:


    • Create DataFrame from text file:


    • Create DataFrame from RDD:


    • Create DataFrame from DataSet:


    • Create DataFrame from sequence:

  4. Read/Write supported options
    To find out the list of supported options for reading and writing different file formats, you can visit the links bellow (scala code source):
    • Text Options

    • CSV Options

    • JSON Options

    • Parquet Options

    • Orc Options
  5. Read files
    1. Supported methods of spark.read

      • text
      • textFile
      • csv
      • json
      • parquet
      • orc

      • load

      • option
      • options

      • format

      • schema

      • jdbc

      • table
    2. Read CSV files
    3. Read JSON files
      • Single line:



      • Multi line:



      • Character encoding (charset):

        By default spark detects the character encoding but it's possible to explicitly specify the character encoding of a file by using the option charset.

    4. Read Text, Parquet, ORC files
      • Read Text file:



      • Read Parquet file:


      • Read ORC file:

    5. spark.read.format("...").load("...")
      • Read Text file:


      • Read CSV file:



      • Read JSON file:



      • Read Parquet file:


      • Read ORC file:

  6. Create schema
    • Create schema using StructType/StructField:


    • Create schema using StructType/StructField (dynamic fields list + single data type):

  7. Add schema to a DataFrame (change the schema of a DataFrame)
  8. Provide schema while reading CSV files
    • Provide schema while reading a CSV file:


    • Use the header of the CSV file as the schema:


  9. Write Dataset/DataFrame to Text, CSV, JSON, Parquet, ORC files
    Let's use the DataFrame dfSequence created from a sequence (see "Create DataFrame from sequence:" above).
    The same examples can be applied to Dataset.

    1. Supported methods of [Dataset/DataFrame].write

      • text
      • csv
      • json
      • parquet
      • orc

      • save
      • saveAsTable
      • insertInto

      • option
      • options

      • mode
      • format

      • partitionBy
      • bucketBy

      • sortBy

      • jdbc
    2. [Dataset/DataFrame].write
      • Write DataFrame to Text file:


        To fix the error above either use the csv method (see write.csv below) or use the following custom code (but be aware that this might not work properly in some cases):
        To see the generated files:
      • Write DataFrame to CSV file:


      • Write DataFrame to JSON file:


      • Write DataFrame to Parquet file:


      • Write DataFrame to ORC file:

    3. [Dataset/DataFrame].write.format("...").save("...")
      • Write DataFrame to CSV file:


      • Write DataFrame to JSON file:


      • Write DataFrame to Parquet file:


      • Write DataFrame to ORC file:

  10. Use coalesce and repartition to manage partitions
    Writing the DataFrame will result on many partitions to disk.
    To verify the number of partitions of DataFrame:
    To be able to write the DataFrame in one single file, use either coalesce and repartition:
  11. Dataset methods
    Let's use the DataFrame dfSequence created from a sequence (see "Create DataFrame from sequence:" above).
    The same examples can be applied to DataFrame.

    • Schema:
      • <dataset>.printSchema: Prints the schema to the console in a nice tree format.

      • <dataset>.dtypes: Returns all column names and their data types as an array.

      • <dataset>.schema: Returns the schema of this Dataset.

      • <dataset>.columns: Returns all column names as an array.

    • Columns:
      • <dataset>.withColumn(columnName, column): Returns a new Dataset by adding a column or replacing the existing column that has the same name.


      • <dataset>.withColumnRenamed(existingColumnName, newColumnName): Returns a new Dataset with a column renamed.


      • <dataset>.drop(columnNames): Returns a new Dataset with columns dropped.


    • Rows:
      • <dataset>.count: Returns the number of rows in the Dataset.

      • <dataset>.show: Displays the top 20 rows of Dataset in a tabular form.

      • <dataset>.head: Returns the first row.

      • <dataset>.first: Returns the first row.

      • <dataset>.take(n): Returns the first n rows in the Dataset.

      • <dataset>.limit(n): Returns a new Dataset by taking the first n rows.

      • <dataset>.distinct: Returns a new Dataset that contains only the unique rows from this Dataset.

    • Select:
      • <dataset>.select(columnName, columnNames): Selects a set of columns.



    • Filter:
      • <dataset>.filter(condition): Filters rows using the given condition.



    • Sort:
      • <dataset>.sort(columns): Returns a new Dataset sorted by the given expressions.


      • <dataset>.orderBy(columns): Returns a new Dataset sorted by the given expressions.


    • Views:
      • <dataset>.createTempView(viewName): Creates a local temporary view using the given name.

© 2025  mtitek