https://delta.io logo
n

Naama Gal-Or

05/23/2023, 12:50 PM
Hi, is there a way to define a lifecycle policy on a delta lake table according to a partition column? Something like auto deletion of all the data older than X days according to a partition column?
r

rtyler

05/23/2023, 4:08 PM
assuming you're using S3, you can set lifecycle policies based on the prefix of an object, e.g. table/day=Sunday/part-000.parquet using "table/day=Sunday" in the prefix If you're talking about doing something dependent on the partition value, that might be possible but I'm not sure offhand if/how that can be implemented
n

Naama Gal-Or

05/23/2023, 4:10 PM
But how can we rely on S3 retention without making the delta log aware of these changes? Sounds like this is going to break the table
r

rtyler

05/23/2023, 4:12 PM
a great deal of cleverness is required 😆 You can use S3 Object Event Notifications to look for those Delete operations and have a Lambda make the corresponding transaction log removals in the
_delta_log/
🤔 this might be a good blog post for me to write 🤔
I don't know if that approach would cause any issues with
vacuum
though
n

Naama Gal-Or

05/24/2023, 3:58 AM
Is this something that is supported by delta lake? How would you manipulate the delta log?
r

rtyler

05/24/2023, 4:46 AM
Bear in mind I am a contributor to #delta-rs so I might be a bit more comfortable than most directly working with the transaction log 😉 Anyways, the transaction log doesn't know anything other than what you tell it, it's just JSON sitting in storage. My suggestion is to come along after an object has been deleted and add the corresponding "remove" transaction. This is relatively straightforward from the Rust libraries but I'm not sure what is feasible from the Spark layer unfortunately
1
I started playing with this a bit last night, I'll see if I can put something together in the form of a lambda this weekend 🤞