Pay attention to union function of pyspark

InÂ SQLÂ theÂ UNIONÂ clause combines the results of two SQL queries into a singleÂ tableÂ of all matchingÂ rows. The two queries must result in the same number ofÂ columnsÂ and compatibleÂ data typesÂ in order to unite. Any duplicate records are automatically removed unlessÂ UNION ALLÂ is used.

UNIONÂ can be useful inÂ data warehouseÂ applications where tables aren’t perfectlyÂ normalized.^[2]Â A simple example would be a database having tablesÂ sales2005Â andÂ sales2006Â that have identical structures but are separated because of performance considerations. AÂ UNIONÂ query could combine results from both tables.

Note thatÂ UNION ALLÂ does not guarantee the order of rows. Rows from the second operand may appear before, after, or mixed with rows from the first operand. In situations where a specific order is desired,Â ORDER BYÂ must be used.

Note thatÂ UNION ALLÂ may be much faster than plainÂ UNION.

but in pyspark

This is equivalent toÂ UNION ALLÂ in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.

Also as standard in SQL, this function resolves columns by position (not by name).

and here is the test code,

from pyspark import SparkConf
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

iris_tbl=spark.read.csv("iris.csv",header=True)

df=iris_tbl.union(iris_tbl)
print("union:",df.count())Â  #union: 300
df1=iris_tbl.unionAll(iris_tbl)
print("union all:",df1.count())Â #union all: 300
print("union distinct",df.distinct().count())Â  #union distinct 149

iris_tbl.registerTempTable("iris")
df_union=spark.sql("select * from iris union select * from iris")
print("union:",df_union.count())Â  #union: 149
df_union_all=spark.sql("select * from iris union all select * from iris")
print("union all:",df_union_all.count())Â #union all: 300

Pay attention to union function of pyspark

Leave a Reply