Is there a way to do merge-on-read in delta oss with spark batch job.
Hi Guys, Any help on this. As my merge command is using joining key as id and the ids are generally distributed across all the files so even if one record is updated the whole data in that file is shuffled and re-written which is not very optimized. So any way to by pass this issue.
03/24/2023, 12:13 PM
thats just the way Delta and most other file-based data formats work. Its indeed not optimized for merge/update/delete but thats in the nature of a file-based data format.
If one row in a file changes, the whole file is re-written as a new version
03/26/2023, 7:32 AM
Hi Nikhil, Gerhard is completely correct that rewriting the whole file is an inherent feature of file-based data formats. You can however optimize the performance of your merges by doing things like changing the size of your files and grouping records that are likely to change at the same time together in the same file.
This talk on optimizing merges on Delta Lake from a past Data+AI Summit might be helpful as well. The slides are really concise too if you don't like watching videos. He highlights the metrics he uses to optimize a merge on the difference between optimizing a merge for reads and writes. Hope it helps!