https://delta.io logo
m

Marius Grama

02/21/2023, 8:51 AM
Can transaction log files be safely removed from a delta table? https://github.com/delta-io/delta/blob/master/PROTOCOL.md#last-checkpoint-file
The Delta transaction log will often contain many (e.g. 10,000+) files.
g

Gerhard Brueckl

02/21/2023, 10:07 AM
while I would definitely not recommend deleting them manually, it should technically work. But you will end up with orphaned data files and should run a
VACUUM
afterwards to free up the space of the data files also. but in your case I would rather investigate why the logs are not cleaned up automatically (assuming they are older than your
delta.logRetentionDuration
and a checkpoint was recently created) OR why you do have so many logs at all. It could very well be that you have a streaming job that writes into your delta table continuously thereby producing a lot of log files
m

Marius Grama

02/21/2023, 10:09 AM
I’m actually only wondering why this part of the process is not a bit more transparent in delta. Is there any open source corresponding operation to this Databricks functionality?
g

Gerhard Brueckl

02/21/2023, 10:12 AM
I am quite sure this part is not proprietary to Databricks but is also in the OSS version. What Databricks functionality are you referring to?
m

Marius Grama

02/21/2023, 10:13 AM
I’m referring to the process used to take care of removing outdated transaction log files based on
delta.logRetentionDuration
setting
g

Gerhard Brueckl

02/21/2023, 1:23 PM
that should just work the very same way in OSS as it does in Databricks
m

Marius Grama

02/21/2023, 1:25 PM
Can you give me a bit more details? So there is in delta lake oss a routine used to remove transaction log files when they are older than the specified
delta.logRetentionDuration
period?
g

Gerhard Brueckl

02/21/2023, 2:05 PM
yes, otherwise all implementations/tools that use Delta OSS would never cleanup log entries
m

Marius Grama

02/21/2023, 2:33 PM
I did a test by adding a few hundreds of transaction log files on Databricks, (i did insert, delete, vacuum, optimize) and have
delta.logRetentionDuration
set to
2 minutes
, but no file has been removed. Moreover there is no reference in the protocol that the transaction log files should be removed or how they are supposed to be removed.
g

Gerhard Brueckl

02/21/2023, 2:43 PM
can you try to set it to
interval 2 minutes
not sure though if minute intervals actually work
or try
interval 0 hours
m

Marius Grama

02/21/2023, 2:45 PM
Gerhard, thank you for your patience. I will try setting to
interval 0 hours
to see what happens
g

Gerhard Brueckl

02/21/2023, 2:45 PM
https://docs.delta.io/latest/delta-batch.html#data-retention (but I guess you came across this already)
👍 1
m

Marius Grama

02/21/2023, 2:47 PM
did you see the functionality in action? does the cleanup happen right after creating the checkpoint or is it a managed service which asynchronously sweeps dangling transaction log entries from all the tables of the customer?
g

Gerhard Brueckl

02/21/2023, 2:48 PM
I dont know to be honest but again, if it did not work, a lot of people would have already complained I guess 😄
d

Dominique Brezinski

02/21/2023, 7:40 PM
It is not a managed service.
Copy code
/**
* Returns the duration in millis for how long to keep around obsolete logs. We may keep logs * beyond this duration until the next calendar day to avoid constantly creating checkpoints. */
m

Marius Grama

02/21/2023, 7:43 PM
thank you Dominique
d

Dominique Brezinski

02/21/2023, 7:43 PM
The code logs to Info level the date is expiring to and how many files it deleted
m

Marius Grama

02/21/2023, 7:43 PM
i am looking at the code and see that rightafter the checkpoint gets created, the cleanup should be performed
However, I couldn’t reproduce this behavior on Databricks 11.3 LTS while using 0 hours, 1 minute or 1 hour intervals
d

Dominique Brezinski

02/21/2023, 7:45 PM
Because it truncates the retention parameter to day, so if you are setting a short one, it won't delete until the first checkpoint on the following day
👍 1
found it
will try tommorow then on Databricks to see whether it is effective
Thank you for pointing me in the right direction
One day later 🙂 from
457
the number of transaction log files went to
29
👍 1
6 Views