https://delta.io logo
l

Lucas Zago

03/27/2023, 6:05 PM
Hi all, there are some best practice about how long to store data in delta lake ? Given a table you have updates, insertions, deletes every day in it. There are some guidance about to handle it ?
r

rtyler

03/27/2023, 6:07 PM
There's a lot that is subjective. In our case we look at the cost of storage and the use-case and set up automatic
VACUUM
jobs that will trim older deletes every now and again (like once a month/week)
1
m

Matthew Powers

03/27/2023, 9:49 PM
It depends on your needs to time travel vs storage costs. The more you vacuum, the more you save on storage, but the less your ability to time travel. So like Tyler mentioned, it really depends on your specific circumstances (and them might be different for different tables). I wrote a blog post on vacuum that you might find useful: https://delta.io/blog/2023-01-03-delta-lake-vacuum-command/
1
j

Jim Hibbard

03/27/2023, 10:55 PM
Great question! Outside of any regulatory/legal obligations you may have to retain data, it's definitely a cost versus performance and flexibility consideration. Vitor Teixeria shared his flow chart on keeping Delta Lakes fast and clean that you may find useful. Hope it helps! https://towardsdatascience.com/delta-lake-keeping-it-fast-and-clean-3c9d4f9e2f5e
1
l

Lucas Zago

03/28/2023, 1:43 PM
In the process we have a landing zone where the .avro from golden gate comes in, so we are thinking in a strategy how to deal with them