https://delta.io logo
o

Omer Ozsakarya

03/13/2023, 6:27 AM
Hi all, I am using concat_Ws in Databricks notebook. Is the concat_Ws function deterministic? Does it return same order in every run?
y

Yousry Mohamed

03/13/2023, 7:36 AM
If the inputs to be concatenated are deterministic, then the output will be deterministic. If one of the inputs is something like
current_timestamp()
, then you will get something different each time an action is executed on the dataframe (unless being cached).
1
o

Omer Ozsakarya

03/13/2023, 2:11 PM
Thank you How is it working under the hood? Spark is a distributed engine. How is it dividing the dataset into n part (number of nodes) and returning the same result each time?
d

Dominique Brezinski

03/13/2023, 5:39 PM
Because
concat_ws
if row oriented, so A) the split across workers doesn't matter if operating on a single DataFrame (or table) B) joins are deterministic so long as the join keys are deterministic, so the resulting rows will be the same when the inputs are the same in the resulting DataFrame (see A).
y

Yousry Mohamed

03/13/2023, 9:26 PM
Not sure if there is another facet to the question around the order of rows returned. That’s another story but for a single row (with a certain ID or PK), if one column is produced by
concat_ws
and the inputs don’t change across different invocations, then the function will produce same result. I took a chance of your question and wrote a post on caching yesterday that may be a bit relevant 🙂 https://yousry.medium.com/back-to-basics-spark-caching-key-ideas-789be2b04ebd
c

Christopher Grant

03/14/2023, 8:03 PM
i don't think there are any guarantees. might be better to sort the arrays as an extra step if you need determinism
👆 1
4 Views