https://delta.io logo
m

Martin Beaussart

03/24/2023, 11:59 AM
Hi all, My team and I are using PySpark to manage delta tables in a delta lake folder on AW S3 We have to manage big tables ( table folder size of .parquet files is in TB ), and we have to manage merge query on them. We are getting bad performance by doing this, and so our plan is doing partitioning on those big table to improve the merge performences on our big tables I want to ask you if this is the way to go to improve our merge in our tables ? IS there another ways of improving the merge with Spark ?
d

Dave Carroll

03/24/2023, 1:31 PM
Hi Martin, Partitioning sounds like a good idea, you could also look at adding a zOrder to the join key field, If you are merging smaller tables onto the bigger ones, perhaps look into using a Broadcast join. Maybe also check your individual Parquet file sizes, with tables in the TB, you could consider files of perhaps up-to 1GB, though this comes with trade offs.
❤️ 1
m

Martin Beaussart

03/24/2023, 1:44 PM
Helo Dave, Thanks a lot for your answer, so i see we're on the good way I have to see for the zOrder stuff, for Broadcasting it seems great for us beceaus we merge little amount of data but we do it a lot And then i really have to consider how we can manage the size of parquet file
👍 1
j

Jim Hibbard

03/26/2023, 7:08 AM
Hi Martin, this talk on optimizing merges on Delta Lake from a past Data+AI Summit might be helpful as well. The slides are really concise too if you don't like watching videos. He highlights the metrics he uses to optimize a merge on the difference between optimizing a merge for reads and writes. Hope it helps! https://www.databricks.com/session_eu20/delta-lake-optimizing-merge
👍 1
🙌 1
m

Martin Beaussart

03/27/2023, 5:39 AM
Thanks for the share, it is really interesting and can help me on my research on merge improvment, thanks !
👍 1
j

Jim Hibbard

03/27/2023, 5:40 AM
Of course! Don't hesitate to reach out if you have more questions :)
👍 1
m

Martin Beaussart

03/27/2023, 5:43 AM
Yes, actually i found a video of the powerpoint as well, may be helfull for the guys that prefer videos :

https://www.youtube.com/watch?v=o2k9PICWdx0

3 Views