This is the documentation for Cloudera Enterprise 5.12.x. Documentation for other versions is available at Cloudera Documentation.

Accessing Parquet Files From Spark SQL Applications

Spark SQL supports loading and saving DataFrames from and to a variety of data sources and has native support for Parquet. For information about Parquet, see Using Apache Parquet Data Files with CDH.

To read Parquet files in Spark SQL, use the SQLContext.read.parquet("path") method.

To write Parquet files in Spark SQL, use the DataFrame.write.parquet("path") method.

To set the compression type, configure the spark.sql.parquet.compression.codec property:
sqlContext.setConf("spark.sql.parquet.compression.codec","codec") 
The supported codec values are: uncompressed, gzip, lzo, and snappy. The default is gzip.

For an example of writing Parquet files to Amazon S3, see Examples of Accessing S3 Data from Spark.

Page generated August 14, 2017.