How two read SAS data with PySpark

For some reason, I have to convert sas data to hdfs then analyse with  pyspark. after some research I found spark-sas7bdat is the best solution for me.

This package allows reading SAS files in local or distributed filesystem as Spark DataFrames.

Schema is automatically inferred from meta information embedded in the SAS file.

Thanks to the splittable SasInputFormat, we are able to convert a 200GB (1.5Bn rows) .sas7bdat file to .csv files using 2000 executors in under 2 minutes.

here is the example that convert sas data to HDFS(parquet),

anther way is through spark SQL,

btw, I also tried,

  1. using SAS to convert SAS data  to csv
  2. using  readstat to convert SAS data to csv
  3. using pandas.read_sas to convert SAS data to pandas data frame, then to  spark data. frame

