https://delta.io logo
v

vijay

01/25/2023, 12:53 AM
Hi All, I am trying to understand the internals shuffle hash join. I want to check if my understanding of it is correct. Let’s say I have two tables t1 and t2 joined on column country (8 distinct values). If I set the number of shuffle partitions as 4 with two executors. In this case, data from t1 on both the executors is first split into 4 partitions (let’s say part 0 - part 3)/files (stored in disk or memory as an intermediate step) using a hash of key % 4, and the same is done with data from t2 across two executors. In the reduce phase, data from the same partitions are merged which finally results in 4 partitions (eg: part 0 data from t1 and t2 from both the executors is merged into one big part 0 ) before performing the join. Is my understanding of it correct? Thanks for the help!
j

JosephK (exDatabricks)

01/25/2023, 12:58 AM
This isn’t a delta lake question
v

vijay

01/25/2023, 1:43 AM
Do we have a channel to post spark related questions?
j

JosephK (exDatabricks)

01/25/2023, 1:04 PM
There is https://community.databricks.com/s/ for spark and databricks questions. Your question isn’t even specifically related to those so you could just google hash joins and there is likely dozens of videos and blogs about it.
6 Views