Parquet Tools: Apache Hadoop

Big Data | Parquet Tools

References
Quick notes about Parquet
Usage
Command: cat
Command: head
Command: schema
Command: meta
Command: dump
Command: merge

References
See these pages for more details about Parquet Tools:
https://github.com/apache/parquet-mr/tree/master/parquet-tools
https://mvnrepository.com/artifact/org.apache.parquet/parquet-tools
Quick notes about Parquet
Parquet is a columnar (column-oriented) storage format for Hadoop.

Columnar Layout r{1}-c{1} , r{2}-c{1} , r{3}-c{1} , ... r{1}-c{2} , r{2}-c{2} , r{3}-c{2} , ... r{1}-c{3} , r{2}-c{3} , r{3}-c{3} , ...

Row-based Layout r{1}-c{1} , r{1}-c{2} , r{1}-c{3} , ... r{2}-c{1} , r{2}-c{2} , r{2}-c{3} , ... r{3}-c{1} , r{3}-c{2} , r{3}-c{3} , ...

Key concepts
- Block Size
- Row Group
- Page

File:
- Magic Number
- Row group 0
- Column 1
- Page 0

Columnar Layout	r{1}-c{1} , r{2}-c{1} , r{3}-c{1} , ...	r{1}-c{2} , r{2}-c{2} , r{3}-c{2} , ...	r{1}-c{3} , r{2}-c{3} , r{3}-c{3} , ...

Row-based Layout	r{1}-c{1} , r{1}-c{2} , r{1}-c{3} , ...	r{2}-c{1} , r{2}-c{2} , r{2}-c{3} , ...	r{3}-c{1} , r{3}-c{2} , r{3}-c{3} , ...

Usage

- Usage (hadoop): hadoop jar parquet-tools-*.jar COMMAND [GENERIC-OPTIONS] [COMMAND-OPTIONS] PARUQET-FILE-PATH

- Usage (local): java -jar parquet-tools-*.jar COMMAND [GENERIC-OPTIONS] [COMMAND-OPTIONS] PARUQET-FILE-PATH

Commands:

   cat  Prints out content for a given parquet file.
  head  Prints out the first n records for a given parquet file (default: 5).
schema  Prints out the schema for a given parquet file.
  meta  Prints out metadata for a given parquet file.
  dump  Prints out row groups and metadata for a given parquet file.
 merge  Merges multiple Parquet files into one Parquet file.

Generic options:

--debug     |  Enable debug output.
-h,--help   |  Show this help string.
--no-color  |  Disable color output even if supported.

parquet-tools-*.jar print help when invoked without parameters or with "-help" or "--h" parameter:
hadoop jar parquet-tools-*.jar --help.

To print the help of a specific command use the following syntax:
hadoop jar parquet-tools-*.jar COMMAND --help.

Command: cat

Prints out content for a given parquet file.

Usage:

$ hadoop jar parquet-tools-1.9.0.jar cat --help
usage: cat [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input>

where <input> is the parquet file to print to standard output.

Command options:

-j,--json  |  Show records in JSON format.

Example:

$ hadoop jar parquet-tools-1.9.0.jar cat hdfs://localhost:8020/test1.parquet
INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
INFO hadoop.ParquetFileReader: reading another 1 footers
INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records.
INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
INFO compress.CodecPool: Got brand-new decompressor [.snappy]
INFO hadoop.InternalParquetRecordReader: block read in memory in 24 ms. row count = 1
field1 = 123
field2 = abc

Command: head

Prints out the first n records for a given parquet file (default: 5).

Usage:

$ hadoop jar parquet-tools-1.9.0.jar head --help
usage: head [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input>

where <input> is the parquet file to print to standard output.

Command options:

-n,--records <arg>  |  The number of records to show (default: 5).

Example:

$ hadoop jar parquet-tools-1.9.0.jar head hdfs://localhost:8020/test1.parquet
INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
INFO hadoop.ParquetFileReader: reading another 1 footers
INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records.
INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
INFO compress.CodecPool: Got brand-new decompressor [.snappy]
INFO hadoop.InternalParquetRecordReader: block read in memory in 45 ms. row count = 1
field1 = 123
field2 = abc

$ hadoop jar parquet-tools-1.9.0.jar head -n 10 hdfs://localhost:8020/test1.parquet

Command: schema

Prints out the schema for a given parquet file.

Usage:

$ hadoop jar parquet-tools-1.9.0.jar schema --help
usage: schema [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input>

where <input> is the parquet file containing the schema to show.

Command options:

-d,--detailed  |  Show detailed information about the schema.

Example:

$ hadoop jar parquet-tools-1.9.0.jar schema hdfs://localhost:8020/test1.parquet
message spark_schema {
  optional int64 field1;
  optional binary field2 (UTF8);
}

Command: meta

Prints out metadata for a given parquet file.

Usage:

$ hadoop jar parquet-tools-1.9.0.jar meta --help
usage: meta [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input>

where <input> is the parquet file to print to standard output.

Example:

$ hadoop jar parquet-tools-1.9.0.jar meta hdfs://localhost:8020/test1.parquet
INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
INFO hadoop.ParquetFileReader: reading another 1 footers
INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
file:          hdfs://localhost:8020/test1.parquet
creator:       parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828)
extra:         org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"field1","type":"long","nullable":true,"metadata":{}},{"name":"field2","type":"string","nullable":true,"metadata":{}}]}

file schema:   spark_schema
--------------------------------------------------------------------------------
field1:       OPTIONAL INT64 R:0 D:1
field2:       OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1:   RC:2 TS:925 OFFSET:1
--------------------------------------------------------------------------------
field1:        INT64 SNAPPY DO:0 FPO:4 SZ:87/87/1.00 VC:5 ENC:BIT_PACKED,RLE,PLAIN
field2:        BINARY SNAPPY DO:0 FPO:180 SZ:315/319/1.01 VC:5 ENC:BIT_PACKED,RLE,PLAIN

Command: dump

Prints out row groups and metadata for a given parquet file.

Usage:

$ hadoop jar parquet-tools-1.9.0.jar dump --help
usage: dump [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input>

where <input> is the parquet file to print to standard output.

Command options:

-c,--column <arg>  |  Dump only the given column, can be specified more than once.
-d,--disable-data  |  Do not dump column data.
-m,--disable-meta  |  Do not dump row group and page metadata.
-n,--disable-crop  |  Do not crop the output based on console width.

Example:

$ hadoop jar parquet-tools-1.9.0.jar dump hdfs://localhost:8020/test1.parquet
INFO compress.CodecPool: Got brand-new decompressor [.snappy]
row group 0
--------------------------------------------------------------------------------
field1:        INT64 SNAPPY DO:0 FPO:4 SZ:87/87/1.00 VC:5 ENC:RLE,BIT [more]...
field2:        BINARY SNAPPY DO:0 FPO:180 SZ:315/319/1.01 VC:5 ENC:RL [more]...

    field1 TV=1 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:                           DLE:RLE RLE:BIT_PACKED VLE:PLAIN [more]... VC:5

    field2 TV=1 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:                           DLE:RLE RLE:BIT_PACKED VLE:PLAIN [more]... VC:5

INT64 field1
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 ***
value 1: R:0 D:1 V:1173960334

BINARY field2
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 ***
value 1: R:0 D:1 V:0-c281a405-1dc6-3a60-90c1-790de339df40

Command: merge

Merges multiple Parquet files into one Parquet file.

Usage:

$ hadoop jar parquet-tools-1.9.0.jar merge --help
usage: merge [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input> [<input> ...] <output>

where:
<input> is the source parquet files/directory to be merged.
<output> is the destination parquet file.

Example:

$ hadoop jar parquet-tools-1.9.0.jar merge hdfs://localhost:8020/test1.parquet hdfs://localhost:8020/test2.parquet hdfs://localhost:8020/test1-test2.parquet