For some reason, I have to convert sas data to hdfs then analyse with pyspark. after some research I found spark-sas7bdat is the best solution for me.
This package allows reading SAS files in local or distributed filesystem as Spark DataFrames.
Schema is automatically inferred from meta information embedded in the SAS file.
Thanks to the splittableÂ
SasInputFormat
, we are able to convert a 200GB (1.5Bn rows) .sas7bdat file to .csv files using 2000 executors in under 2 minutes.
here is the example that convert sas data to HDFS(parquet),
from pyspark.sql import SparkSession spark = SparkSession.builder.\ config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\ .enableHiveSupport().getOrCreate() df=spark.read.format('com.github.saurfang.sas.spark')\ .load("file:////Users/steven/job/spark/sas_data.sas7bdat") #write to parquet df.write.parquet("file:////Users/steven/job/spark/sas_data") df=spark.read.parquet("/Users/steven/job/spark/sas_data")
anther way is through spark SQL,
CREATE TEMPORARY TABLE cars USING com.github.saurfang.sas.spark OPTIONS (path "sas_data.sas7bdat")
btw, I also tried,
- using SAS to convert SAS data to csv
- using  readstat to convert SAS data to csv
- using pandas.read_sas to convert SAS data to pandas data frame, then to spark data. frame
I tried running this but got a NoClassDefFoundError for com.epam.parso.impl.SasFileReaderImpl. The docs for spark-sas7bdat specify that the parso library is a dependency, and I tried including that in the `spark.jars.packages` config option when creating the SparkSession, but that did not resolve the issue. How did you include this dependency?