ShanmukhSaavinay Gajula

06/08/2023, 11:58 AM
Hi Team , We are using DeltaLake RunMerge for solving our use case of handling upserts in the data but found out that this utility is taking close to 40-60% time of the whole extraction Load Job , and we are trying to optimise the exectuion time of the job , Can some one help me here ?
Copy code
Extraction-Load Job Process
1)Read the table data from postgres server in incremental fashion
2)Write data in S3 in parquet format
3)Write the data in AWS-S3  in DeltaFormat
4)Run the Merge to handle upserts

For instance 
If a job is taking 3 min to finish
Step 3 & 4 take close to 1.6-1.7 mins to finish the job
Please find screenshot for more information
@Harshit Soni

Venkat Hari

06/08/2023, 1:37 PM
* Optimize merge condition * Make sure you capture all the right columns in the match condition to do an update or insert * if the table is partitioned, and you are updating a particular partition, then provide that in where clause of merge condition * From my recent experiments, the merge runtime is way less by avoiding partition for the tables that are not going to grow more than 1TB * Deletion vector * we know it improves the deletion, but soon planning to do some experiments to understand whether without Databricks Photon/Predictive I/O does it improve I/O

Harshit Soni

06/12/2023, 10:19 AM
@Venkat Hari we have all required columns in merge condition. Table is not partitioned There is no deletion in records IF we see logs much time is taken in Compute Snapshot and Writing merge data with multiple JOBIDs can we reduce them
@Venkat Hari ^^^^

Venkat Hari

06/26/2023, 1:41 PM
Not sure whether you can turn off the snapshot. But Merge itself combination of multiple tasks where it has to scan the target table and then do join between target and source and then write.