One of my customer asked how to package PySpark application in one file with PyInstaller, after some research, I got the answer and share it here.
PyInstaller freezes (packages) Python applications into stand-alone executables, under Windows, GNU/Linux, Mac OS X, FreeBSD, Solaris and AIX.
- OS:MacOS Mojave 10.14.5
- Python :Anaconda 2019.03 for macOS
- Spark : spark-2.4.3-bin-hadoop2.7
- PostgreSQL: 11.2
- PostgreSQL JDBC: 42.2.5
- UPX: brew install upx
code is from PySpark Read/Write PostgreSQL
from future import print_function import findspark findspark.init(spark_home="spark") from pyspark.sql import SparkSession from pyspark import SparkConf,SparkContext spark = SparkSession.builder\ .master("local[*]")\ .appName('jdbc PG')\ .getOrCreate() df = spark.read.csv(path = '/Users/steven/Desktop/hengshu/data/iris.csv', header = True,inferSchema = True) df=df.toDF(*(c.replace('.', '_').lower() for c in df.columns)) db_host = "" db_port = 5432 table_name = "iris" db_name = "steven" db_url = "jdbc:postgresql://{}:{}/{}".format(db_host, db_port, db_name) options = { "url": db_url, "dbtable": table_name, "user": "steven", "password": "password", "driver": "org.postgresql.Driver", "numPartitions": 10, } options['dbtable']="iris" df.write.format('jdbc').options(**options).mode("overwrite").save() df1=spark.read.format('jdbc').options(**options).load() df1.count() df1.printSchema() spark.stop()
After compile with the following code, we can get an app with 297M and named pyspark_pg.
pyinstaller pyspark_pg.py \ --onefile \ --hidden-import=py4j.java_collections\ --add-data /Users/steven/spark/spark-2.4.3-bin-hadoop2.7:pyspark\ --add-data /Users/steven/spark/jars/postgresql-42.2.5.jar:pyspark/jars/

Let’s run it, ./pyspark_pg

btw, this app does not include JDK, but Python3.7 and Spark-2.4.3-bin-hadoop2.7
