How two read SAS data with PySpark

For some reason, I have to convert sas data to hdfs then analyse withÂ pyspark. after some research I found spark-sas7bdat is the best solution for me.

This package allows reading SAS files in local or distributed filesystem asÂ Spark DataFrames.

Schema is automatically inferred from meta information embedded in the SAS file.

Thanks to the splittableÂ SasInputFormat, we are able to convert a 200GB (1.5Bn rows) .sas7bdat file to .csv files using 2000 executors in under 2 minutes.

here is the example that convert sas data to HDFS(parquet),

from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df=spark.read.format('com.github.saurfang.sas.spark')\
.load("file:////Users/steven/job/spark/sas_data.sas7bdat")
#write to parquet
df.write.parquet("file:////Users/steven/job/spark/sas_data")
df=spark.read.parquet("/Users/steven/job/spark/sas_data")

anther way is through spark SQL,

CREATE TEMPORARY TABLE cars
USING com.github.saurfang.sas.spark
OPTIONS (path "sas_data.sas7bdat")

btw, I also tried,

using SAS to convert SAS dataÂ to csv
usingÂ Â readstatÂ to convert SAS data to csv
using pandas.read_sas to convert SAS data to pandas data frame, then toÂ spark data. frame

2 thoughts on “How two read SAS data with PySpark”

Pingback: Python For SAS Users | RubyPDF Blog
westonsankey says:

July 3, 2019 at 12:30 am

I tried running this but got a NoClassDefFoundError for com.epam.parso.impl.SasFileReaderImpl. The docs for spark-sas7bdat specify that the parso library is a dependency, and I tried including that in the `spark.jars.packages` config option when creating the SparkSession, but that did not resolve the issue. How did you include this dependency?

You must be logged in to post a comment.

How two read SAS data with PySpark

2 thoughts on “How two read SAS data with PySpark”

Leave a Reply