In SQL the
UNIONclause combines the results of two SQL queries into a single table of all matching rows. The two queries must result in the same number of columns and compatible data types in order to unite. Any duplicate records are automatically removed unless
UNION ALLis used.
UNIONcan be useful in data warehouse applications where tables aren’t perfectly normalized. A simple example would be a database having tables
sales2006that have identical structures but are separated because of performance considerations. A
UNIONquery could combine results from both tables.
UNION ALLdoes not guarantee the order of rows. Rows from the second operand may appear before, after, or mixed with rows from the first operand. In situations where a specific order is desired,
ORDER BYmust be used.
UNION ALLmay be much faster than plain
but in pyspark
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name).
and here is the test code,
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
print("union:",df.count()) #union: 300
print("union all:",df1.count()) #union all: 300
print("union distinct",df.distinct().count()) #union distinct 149
df_union=spark.sql("select * from iris union select * from iris")
print("union:",df_union.count()) #union: 149
df_union_all=spark.sql("select * from iris union all select * from iris")
print("union all:",df_union_all.count()) #union all: 300