Columnar Layout | r{1}-c{1} , r{2}-c{1} , r{3}-c{1} , ... | r{1}-c{2} , r{2}-c{2} , r{3}-c{2} , ... | r{1}-c{3} , r{2}-c{3} , r{3}-c{3} , ... |
---|
Row-based Layout | r{1}-c{1} , r{1}-c{2} , r{1}-c{3} , ... | r{2}-c{1} , r{2}-c{2} , r{2}-c{3} , ... | r{3}-c{1} , r{3}-c{2} , r{3}-c{3} , ... |
---|
hadoop jar parquet-tools-*.jar COMMAND [GENERIC-OPTIONS] [COMMAND-OPTIONS] PARUQET-FILE-PATH
java -jar parquet-tools-*.jar COMMAND [GENERIC-OPTIONS] [COMMAND-OPTIONS] PARUQET-FILE-PATH
cat Prints out content for a given parquet file. head Prints out the first n records for a given parquet file (default: 5). schema Prints out the schema for a given parquet file. meta Prints out metadata for a given parquet file. dump Prints out row groups and metadata for a given parquet file. merge Merges multiple Parquet files into one Parquet file.
--debug | Enable debug output. -h,--help | Show this help string. --no-color | Disable color output even if supported.
parquet-tools-*.jar
print help when invoked without parameters or with "-help
" or "--h
" parameter:hadoop jar parquet-tools-*.jar --help
.hadoop jar parquet-tools-*.jar COMMAND --help
.$ hadoop jar parquet-tools-1.9.0.jar cat --help usage: cat [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input> where <input> is the parquet file to print to standard output.
-j,--json | Show records in JSON format.
$ hadoop jar parquet-tools-1.9.0.jar cat hdfs://localhost:8020/test1.parquet INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 INFO hadoop.ParquetFileReader: reading another 1 footers INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records. INFO hadoop.InternalParquetRecordReader: at row 0. reading next block INFO compress.CodecPool: Got brand-new decompressor [.snappy] INFO hadoop.InternalParquetRecordReader: block read in memory in 24 ms. row count = 1 field1 = 123 field2 = abc
$ hadoop jar parquet-tools-1.9.0.jar head --help usage: head [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input> where <input> is the parquet file to print to standard output.
-n,--records <arg> | The number of records to show (default: 5).
$ hadoop jar parquet-tools-1.9.0.jar head hdfs://localhost:8020/test1.parquet INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 INFO hadoop.ParquetFileReader: reading another 1 footers INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records. INFO hadoop.InternalParquetRecordReader: at row 0. reading next block INFO compress.CodecPool: Got brand-new decompressor [.snappy] INFO hadoop.InternalParquetRecordReader: block read in memory in 45 ms. row count = 1 field1 = 123 field2 = abc
$ hadoop jar parquet-tools-1.9.0.jar head -n 10 hdfs://localhost:8020/test1.parquet
$ hadoop jar parquet-tools-1.9.0.jar schema --help usage: schema [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input> where <input> is the parquet file containing the schema to show.
-d,--detailed | Show detailed information about the schema.
$ hadoop jar parquet-tools-1.9.0.jar schema hdfs://localhost:8020/test1.parquet message spark_schema { optional int64 field1; optional binary field2 (UTF8); }
$ hadoop jar parquet-tools-1.9.0.jar meta --help usage: meta [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input> where <input> is the parquet file to print to standard output.
$ hadoop jar parquet-tools-1.9.0.jar meta hdfs://localhost:8020/test1.parquet INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 INFO hadoop.ParquetFileReader: reading another 1 footers INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file: hdfs://localhost:8020/test1.parquet creator: parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828) extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"field1","type":"long","nullable":true,"metadata":{}},{"name":"field2","type":"string","nullable":true,"metadata":{}}]} file schema: spark_schema -------------------------------------------------------------------------------- field1: OPTIONAL INT64 R:0 D:1 field2: OPTIONAL BINARY O:UTF8 R:0 D:1 row group 1: RC:2 TS:925 OFFSET:1 -------------------------------------------------------------------------------- field1: INT64 SNAPPY DO:0 FPO:4 SZ:87/87/1.00 VC:5 ENC:BIT_PACKED,RLE,PLAIN field2: BINARY SNAPPY DO:0 FPO:180 SZ:315/319/1.01 VC:5 ENC:BIT_PACKED,RLE,PLAIN
$ hadoop jar parquet-tools-1.9.0.jar dump --help usage: dump [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input> where <input> is the parquet file to print to standard output.
-c,--column <arg> | Dump only the given column, can be specified more than once. -d,--disable-data | Do not dump column data. -m,--disable-meta | Do not dump row group and page metadata. -n,--disable-crop | Do not crop the output based on console width.
$ hadoop jar parquet-tools-1.9.0.jar dump hdfs://localhost:8020/test1.parquet INFO compress.CodecPool: Got brand-new decompressor [.snappy] row group 0 -------------------------------------------------------------------------------- field1: INT64 SNAPPY DO:0 FPO:4 SZ:87/87/1.00 VC:5 ENC:RLE,BIT [more]... field2: BINARY SNAPPY DO:0 FPO:180 SZ:315/319/1.01 VC:5 ENC:RL [more]... field1 TV=1 RL=0 DL=1 ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN [more]... VC:5 field2 TV=1 RL=0 DL=1 ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN [more]... VC:5 INT64 field1 -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 1 *** value 1: R:0 D:1 V:1173960334 BINARY field2 -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 1 *** value 1: R:0 D:1 V:0-c281a405-1dc6-3a60-90c1-790de339df40
$ hadoop jar parquet-tools-1.9.0.jar merge --help usage: merge [GENERIC-OPTIONS] [COMMAND-OPTIONS] <input> [<input> ...] <output> where: <input> is the source parquet files/directory to be merged. <output> is the destination parquet file.
$ hadoop jar parquet-tools-1.9.0.jar merge hdfs://localhost:8020/test1.parquet hdfs://localhost:8020/test2.parquet hdfs://localhost:8020/test1-test2.parquet