Halvar Trøyel Nerbø

01/12/2023, 3:21 PM
I have a question regarding Vacuum and ./_delta_log directory. It seems to me that when running VACUUM no files in _delta_log directory are directly affected? For one of our tables - after running Vacuum we still have 600k files in _delta_log directory taking up 10gb of space. The table is being inserted into a number of times a day (500-1000 single writes) per day, with one row per time. After a deep clone the data + log directory is only 120mb. My question is: • Is this by design? • Are there any intended ways of cleaning up the _delta_log directory, apart from doing a copy/clone and drop ?
👀 1
Using Databricks on Azure

Jon Stockham

01/12/2023, 3:25 PM
It will automatically clean them up after each checkpoint is created, by default it cleans up log files older than 30 days
configure with
delta.logRetentionDuration = "interval <interval>"

Halvar Trøyel Nerbø

01/12/2023, 3:44 PM
Thanks! Will try to set this as a table property

Scott Sandre (Delta Lake)

01/12/2023, 10:32 PM
Yup! VACUUM cleans up old files not longer part of the latest table (tombstones). Log cleanup happens every 10 commits and will remove old log files.

Halvar Trøyel Nerbø

01/13/2023, 7:36 AM
After setting the logRetentionDuration explicitly on the table there are still 600k files in _delta_log directory. There has been around 3000 individual inserts since the property was set. I have also run optimize and vacuum to no great change. The number of files under _delta_log directory with their last modified date looks like this : Seeing that the majority of the 600k files are older than 30 days. Inserts run on Databricks SQL "Databricks-Runtime/11.3.x-photon-scala2.12" and also tested doing 15 individual inserts using a regular interactive cluster "Databricks-Runtime/11.2.x-scala2.12". Does anyone have any suggestions to where the issue might be?
Table properties look like this:
I do see that the previous checkpoint files are removed in batch once in a while. Only not the .json or .crc files.

Nick Karpov

01/25/2023, 5:43 PM
@Halvar Trøyel Nerbø curious, did you figure this out?

Halvar Trøyel Nerbø

01/25/2023, 7:47 PM
Not yet @Nick Karpov. Databricks product team have been working on it. The usual set of delta lake clean up procedures does not seem to clean up when they are run.