Shay Palachy-Affek

08/03/2023, 11:13 AM
Hey everyone! I’ve got a couple of questions related to the same solution design that I’d love your help with: 1) I have a DeltaTable in DataBricks that is partitioned based on a specific field. How can I instruct Spark to load the entire table to the cluster, ensuring that rows from the same partition always arrive at the same node? 2) After accomplishing the above, how can I perform a
operation on a column that operates in-partition only, enabling it to run efficiently in a distributed manner? The
method of
doesn’t fulfill this requirement, as stated in the documentation: “The current implementation of shift uses Spark’s Window without specifying partition specification. This leads to move all data into a single partition in a single machine and could cause serious performance degradation. Avoid this method against very large dataset.” Thank you, 🧀 Shai

jonathan rowe

08/04/2023, 12:02 PM
When I last looked at this, DataFrame doesn't allow custom partitioner, you have drop down into RDD which allows a Partitioner to be defined You need to think very carefully if you want to do this! try and implement logic in dataframes

Itagyba Abondanza Kuhlmann

08/17/2023, 9:32 PM
I would probably go for a filesystem level when loading the data since it is already partitioned. For instance: hdfs://data-lake/layer/table_name/partition=key1 hdfs://data-lake/layer/table_name/partition=key2 hdfs://data-lake/layer/table_name/partition=key3hdfs://data-lake/layer/table_name/partition=keyN I would do some:
Copy code
dfN ="<hdfs://data-lake/layer/table_name/partition=keyN>")
You can try to find a way to parallelize that with your orchestrator (Airflow, Oozie, SFN)