Shay Palachy-Affek
08/03/2023, 11:13 AMshift
operation on a column that operates in-partition only, enabling it to run efficiently in a distributed manner?
The shift
method of spark.DataFrame
doesn’t fulfill this requirement, as stated in the documentation:
“The current implementation of shift uses Spark’s Window without specifying partition specification. This leads to move all data into a single partition in a single machine and could cause serious performance degradation. Avoid this method against very large dataset.”
Thank you, 🧀
Shaijonathan rowe
08/04/2023, 12:02 PMItagyba Abondanza Kuhlmann
08/17/2023, 9:32 PMdfN = spark.read.parquet("<hdfs://data-lake/layer/table_name/partition=keyN>")
You can try to find a way to parallelize that with your orchestrator (Airflow, Oozie, SFN)