https://delta.io logo
a

Abolfazl karimian

06/24/2023, 10:03 AM
Hi everyone. I'm using delta on a cluster with huge amount of data every day so i should delete files older than 5 days and i use delta VACUUM. but i have encountered into a big problem. when i use VACUUM delta must delete invalidated files but it also deletes some validated files and it's really strange to me. it actually deletes files older than 5 days and also deletes files from current day which its data is completing. I also have set the logRetentionDuration config to delete old logs. I would really appreciate a solution. delta_table = DeltaTable.forPath(session , path) delta_table.logRetentionDuration = "interval 30 days" delta_table.deletes( col("date") < "specified_day" ) session.sql("vacuum delta.path RETAIN 0 HOURS") session.sql("vacuum delta.path RETAIN 0 HOURS")
j

JosephK (exDatabricks)

06/24/2023, 3:06 PM
0 hours != 5 days
d

Dominique Brezinski

06/24/2023, 7:02 PM
To be more specific, you pretty much never want to vacuum RETAIN 0 HOURS on a table that has active writes or optimizations happening. Retain should be no shorter than your deletion period.
a

Abolfazl karimian

06/26/2023, 8:03 AM
@JosephK (exDatabricks) Actually i don't get your point. I'm deleting partitions older than five days today and i want it to be VACUUMed(Delete from disk) NOW. What should i do? @Dominique Brezinski Your analogy says if i delete partitions older than 5 days then i should at least VACUUM RETAIN 5 days. While as we know RETAIN deletes files that have been invalidated for more than 5 days. and that means if i delete some partitions today, the VACUUM job that will be running 5 days later will delete these files. am i wrong?
d

Dominique Brezinski

06/26/2023, 2:16 PM
That is correct, but you are not accounting for the uncommitted writes happening on the table. Vacuum does a listing of the directories and cross references them with files referenced in the delta log and deletes what isn’t referenced in log. You need to provide a safety margin for files that are part of an active transaction to be committed to the log, otherwise they will get deleted before they are committed. You don’t necessarily need to use retain at five days, but very minimally you should do at least one day. 7 days is the default.
👍 2