RubyPDF Blog machine learning Pay attention to union function of pyspark

Pay attention to union function of pyspark

In SQL the UNION clause combines the results of two SQL queries into a single table of all matching rows. The two queries must result in the same number of columns and compatible data types in order to unite. Any duplicate records are automatically removed unless UNION ALL is used.

UNION can be useful in data warehouse applications where tables aren’t perfectly normalized.[2] A simple example would be a database having tables sales2005 and sales2006 that have identical structures but are separated because of performance considerations. A UNION query could combine results from both tables.

Note that UNION ALL does not guarantee the order of rows. Rows from the second operand may appear before, after, or mixed with rows from the first operand. In situations where a specific order is desired, ORDER BY must be used.

Note that UNION ALL may be much faster than plain UNION.

but in pyspark

This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.

Also as standard in SQL, this function resolves columns by position (not by name).

and here is the test code,

from pyspark import SparkConf
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

iris_tbl=spark.read.csv("iris.csv",header=True)

df=iris_tbl.union(iris_tbl)
print("union:",df.count())  #union: 300
df1=iris_tbl.unionAll(iris_tbl)
print("union all:",df1.count()) #union all: 300
print("union distinct",df.distinct().count())  #union distinct 149

iris_tbl.registerTempTable("iris")
df_union=spark.sql("select * from iris union select * from iris")
print("union:",df_union.count())  #union: 149
df_union_all=spark.sql("select * from iris union all select * from iris")
print("union all:",df_union_all.count()) #union all: 300

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.