In SQL theÂ
UNION
 clause combines the results of two SQL queries into a single table of all matching rows. The two queries must result in the same number of columns and compatible data types in order to unite. Any duplicate records are automatically removed unlessÂUNION ALL
 is used.
UNION
 can be useful in data warehouse applications where tables aren’t perfectly normalized.[2] A simple example would be a database having tablesÂsales2005
 andÂsales2006
 that have identical structures but are separated because of performance considerations. AÂUNION
 query could combine results from both tables.Note thatÂ
UNION ALLÂ
does not guarantee the order of rows. Rows from the second operand may appear before, after, or mixed with rows from the first operand. In situations where a specific order is desired,ÂORDER BY
 must be used.Note thatÂ
UNION ALL
 may be much faster than plainÂUNION
.
but in pyspark
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name).
and here is the test code,
from pyspark import SparkConf from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() iris_tbl=spark.read.csv("iris.csv",header=True) df=iris_tbl.union(iris_tbl) print("union:",df.count())Â #union: 300 df1=iris_tbl.unionAll(iris_tbl) print("union all:",df1.count())Â #union all: 300 print("union distinct",df.distinct().count())Â #union distinct 149 iris_tbl.registerTempTable("iris") df_union=spark.sql("select * from iris union select * from iris") print("union:",df_union.count())Â #union: 149 df_union_all=spark.sql("select * from iris union all select * from iris") print("union all:",df_union_all.count())Â #union all: 300