Patrik Ekman

05/04/2023, 7:49 AM
Hello! I stumbled upon some pyspark code where there is a merge to a delta table followed by an append to the same table, which appends the same data that is referred to in the insert clause in the merge. I was told that spark handles this in its execution plan, where all rows but one are appended, and that only one row will be inserted during merge - making the code run quicker as most rows are appended without further checks. Does this make sense? Does someone have an explanation for this? Thank you so much!
.merge( source=df_stage.alias("updates"), condition=join_condition)
.whenMatchedUpdate( condition="...", set={...})
.whenNotMatchedInsert( values={...})
df_load = df_stage.write.format("delta").mode("append") ...