RubyPDF Blog machine learning How two read SAS data with PySpark

How two read SAS data with PySpark

For some reason, I have to convert sas data to hdfs then analyse with  pyspark. after some research I found spark-sas7bdat is the best solution for me.

This package allows reading SAS files in local or distributed filesystem as Spark DataFrames.

Schema is automatically inferred from meta information embedded in the SAS file.

Thanks to the splittable SasInputFormat, we are able to convert a 200GB (1.5Bn rows) .sas7bdat file to .csv files using 2000 executors in under 2 minutes.

here is the example that convert sas data to HDFS(parquet),

from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df=spark.read.format('com.github.saurfang.sas.spark')\
.load("file:////Users/steven/job/spark/sas_data.sas7bdat")
#write to parquet
df.write.parquet("file:////Users/steven/job/spark/sas_data")
df=spark.read.parquet("/Users/steven/job/spark/sas_data")

anther way is through spark SQL,

CREATE TEMPORARY TABLE cars
USING com.github.saurfang.sas.spark
OPTIONS (path "sas_data.sas7bdat")

btw, I also tried,

  1. using SAS to convert SAS data  to csv
  2. using  readstat to convert SAS data to csv
  3. using pandas.read_sas to convert SAS data to pandas data frame, then to  spark data. frame

2 thoughts on “How two read SAS data with PySpark”

  1. I tried running this but got a NoClassDefFoundError for com.epam.parso.impl.SasFileReaderImpl. The docs for spark-sas7bdat specify that the parso library is a dependency, and I tried including that in the `spark.jars.packages` config option when creating the SparkSession, but that did not resolve the issue. How did you include this dependency?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.