Lennart Skogmo

03/13/2023, 10:04 PM
Does anyone happen to know any spark tricks that can be used to influence on what workers/in memory partitions the scan/read 'stages' put induvidual records? Even though neither data sources/data frames have hard partitions per se, it would be nice to be able to co-locate based on key columns on read to avoid exchange/shuffle etc later on when doing joins or using window functions. I suppose the command repartition(partitionExprs : Column)* is the closest but even that happens after the data is read if I'm understanding correctly.

Dominique Brezinski

03/13/2023, 10:37 PM
It will always require a network shuffle after read except when the read is from a partitioned table, so that `repartition(partitionExprs : Column)` is what you are looking for.*

Lennart Skogmo

03/13/2023, 10:38 PM
Ok thanks :)

JosephK (exDatabricks)

03/13/2023, 11:46 PM
What are you trying to do? Repartition(columns) is likely an antipattern

Lennart Skogmo

03/13/2023, 11:56 PM
I'm playing around with a CDC style framework for spark/delta. Doing in memory/no staging CDC. So I guess you could say I'm trying to find code patterns that can operate on any source data frame but still be reasonable efficient. So far I havent exprimented with big data/big clusters so atm atleast I'm leaving repartion alone and letting spark exchange/shuffle/sort as it sees fit. I didnt notice much difference on my small tests.