• Home
  • LLMs
  • Docker
  • Kubernetes
  • Java
  • Maven
  • About
Big Data | Parquet Tools
  1. References
  2. Quick notes about Parquet
  3. Usage
  4. Command: cat
  5. Command: head
  6. Command: schema
  7. Command: meta
  8. Command: dump
  9. Command: merge

  1. References
    See these pages for more details about Parquet Tools:
    https://github.com/apache/parquet-mr/tree/master/parquet-tools
    https://mvnrepository.com/artifact/org.apache.parquet/parquet-tools
  2. Quick notes about Parquet
    Parquet is a columnar (column-oriented) storage format for Hadoop.

    Columnar Layout r{1}-c{1} , r{2}-c{1} , r{3}-c{1} , ... r{1}-c{2} , r{2}-c{2} , r{3}-c{2} , ... r{1}-c{3} , r{2}-c{3} , r{3}-c{3} , ...

    Row-based Layout r{1}-c{1} , r{1}-c{2} , r{1}-c{3} , ... r{2}-c{1} , r{2}-c{2} , r{2}-c{3} , ... r{3}-c{1} , r{3}-c{2} , r{3}-c{3} , ...

    Key concepts
    - Block Size
    - Row Group
    - Page

    File:
    - Magic Number
    - Row group 0
     - Column 1
       - Page 0
  3. Usage
    - Usage (hadoop): hadoop jar parquet-tools-*.jar COMMAND [GENERIC-OPTIONS] [COMMAND-OPTIONS] PARUQET-FILE-PATH

    - Usage (local): java -jar parquet-tools-*.jar COMMAND [GENERIC-OPTIONS] [COMMAND-OPTIONS] PARUQET-FILE-PATH

    Commands:
       cat  Prints out content for a given parquet file.
      head  Prints out the first n records for a given parquet file (default: 5).
    schema  Prints out the schema for a given parquet file.
      meta  Prints out metadata for a given parquet file.
      dump  Prints out row groups and metadata for a given parquet file.
     merge  Merges multiple Parquet files into one Parquet file.

    Generic options:
    --debug     |  Enable debug output.
    -h,--help   |  Show this help string.
    --no-color  |  Disable color output even if supported.

    parquet-tools-*.jar print help when invoked without parameters or with "-help" or "--h" parameter:
    hadoop jar parquet-tools-*.jar --help.

    To print the help of a specific command use the following syntax:
    hadoop jar parquet-tools-*.jar COMMAND --help.
  4. Command: cat
    Prints out content for a given parquet file.

    • Usage:
      $ hadoop jar parquet-tools-1.9.0.jar cat --help
      usage: cat [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input>
      
      where <input> is the parquet file to print to standard output.

      Command options:
      -j,--json  |  Show records in JSON format.

    • Example:
      $ hadoop jar parquet-tools-1.9.0.jar cat hdfs://localhost:8020/test1.parquet
      INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
      INFO hadoop.ParquetFileReader: reading another 1 footers
      INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
      INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records.
      INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
      INFO compress.CodecPool: Got brand-new decompressor [.snappy]
      INFO hadoop.InternalParquetRecordReader: block read in memory in 24 ms. row count = 1
      field1 = 123
      field2 = abc
  5. Command: head
    Prints out the first n records for a given parquet file (default: 5).

    • Usage:
      $ hadoop jar parquet-tools-1.9.0.jar head --help
      usage: head [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input>
      
      where <input> is the parquet file to print to standard output.

      Command options:
      -n,--records <arg>  |  The number of records to show (default: 5).

    • Example:
      $ hadoop jar parquet-tools-1.9.0.jar head hdfs://localhost:8020/test1.parquet
      INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
      INFO hadoop.ParquetFileReader: reading another 1 footers
      INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
      INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records.
      INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
      INFO compress.CodecPool: Got brand-new decompressor [.snappy]
      INFO hadoop.InternalParquetRecordReader: block read in memory in 45 ms. row count = 1
      field1 = 123
      field2 = abc

      $ hadoop jar parquet-tools-1.9.0.jar head -n 10 hdfs://localhost:8020/test1.parquet
  6. Command: schema
    Prints out the schema for a given parquet file.

    • Usage:
      $ hadoop jar parquet-tools-1.9.0.jar schema --help
      usage: schema [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input>
      
      where <input> is the parquet file containing the schema to show.

      Command options:
      -d,--detailed  |  Show detailed information about the schema.

    • Example:
      $ hadoop jar parquet-tools-1.9.0.jar schema hdfs://localhost:8020/test1.parquet
      message spark_schema {
        optional int64 field1;
        optional binary field2 (UTF8);
      }
  7. Command: meta
    Prints out metadata for a given parquet file.

    • Usage:
      $ hadoop jar parquet-tools-1.9.0.jar meta --help
      usage: meta [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input>
      
      where <input> is the parquet file to print to standard output.

    • Example:
      $ hadoop jar parquet-tools-1.9.0.jar meta hdfs://localhost:8020/test1.parquet
      INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
      INFO hadoop.ParquetFileReader: reading another 1 footers
      INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
      file:          hdfs://localhost:8020/test1.parquet
      creator:       parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828)
      extra:         org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"field1","type":"long","nullable":true,"metadata":{}},{"name":"field2","type":"string","nullable":true,"metadata":{}}]}
      
      file schema:   spark_schema
      --------------------------------------------------------------------------------
      field1:       OPTIONAL INT64 R:0 D:1
      field2:       OPTIONAL BINARY O:UTF8 R:0 D:1
      
      row group 1:   RC:2 TS:925 OFFSET:1
      --------------------------------------------------------------------------------
      field1:        INT64 SNAPPY DO:0 FPO:4 SZ:87/87/1.00 VC:5 ENC:BIT_PACKED,RLE,PLAIN
      field2:        BINARY SNAPPY DO:0 FPO:180 SZ:315/319/1.01 VC:5 ENC:BIT_PACKED,RLE,PLAIN
  8. Command: dump
    Prints out row groups and metadata for a given parquet file.

    • Usage:
      $ hadoop jar parquet-tools-1.9.0.jar dump --help
      usage: dump [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input>
      
      where <input> is the parquet file to print to standard output.

      Command options:
      -c,--column <arg>  |  Dump only the given column, can be specified more than once.
      -d,--disable-data  |  Do not dump column data.
      -m,--disable-meta  |  Do not dump row group and page metadata.
      -n,--disable-crop  |  Do not crop the output based on console width.

    • Example:
      $ hadoop jar parquet-tools-1.9.0.jar dump hdfs://localhost:8020/test1.parquet
      INFO compress.CodecPool: Got brand-new decompressor [.snappy]
      row group 0
      --------------------------------------------------------------------------------
      field1:        INT64 SNAPPY DO:0 FPO:4 SZ:87/87/1.00 VC:5 ENC:RLE,BIT [more]...
      field2:        BINARY SNAPPY DO:0 FPO:180 SZ:315/319/1.01 VC:5 ENC:RL [more]...
      
          field1 TV=1 RL=0 DL=1
          ----------------------------------------------------------------------------
          page 0:                           DLE:RLE RLE:BIT_PACKED VLE:PLAIN [more]... VC:5
      
          field2 TV=1 RL=0 DL=1
          ----------------------------------------------------------------------------
          page 0:                           DLE:RLE RLE:BIT_PACKED VLE:PLAIN [more]... VC:5
      
      INT64 field1
      --------------------------------------------------------------------------------
      *** row group 1 of 1, values 1 to 1 ***
      value 1: R:0 D:1 V:1173960334
      
      BINARY field2
      --------------------------------------------------------------------------------
      *** row group 1 of 1, values 1 to 1 ***
      value 1: R:0 D:1 V:0-c281a405-1dc6-3a60-90c1-790de339df40
  9. Command: merge
    Merges multiple Parquet files into one Parquet file.

    • Usage:
      $ hadoop jar parquet-tools-1.9.0.jar merge --help
      usage: merge [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input> [<input> ...] <output>
      
      where:
      <input> is the source parquet files/directory to be merged.
      <output> is the destination parquet file.

    • Example:
      $ hadoop jar parquet-tools-1.9.0.jar merge hdfs://localhost:8020/test1.parquet hdfs://localhost:8020/test2.parquet hdfs://localhost:8020/test1-test2.parquet
© 2025  mtitek