https://delta.io logo
m

Martin Beaussart

04/19/2023, 9:01 AM
Hello guys ! I have a delta lake detabase folder on a AWS S3 bucket, on it i have delta table on which i am continuously doing updates, insert and deletes Some weeks ago its size was 6 TB, now it is 17 TB ! So my questions is : how can we reduce the size of tables were updates, inserts and deletions are made continously ? I saw there is with delta table what we call 'Compact files' , I guess I should go toward this solutions but i wanted to ask you to be sure if this is THE way to go when facing this problem or if there is another way to do it Thanks in advance :)
m

Martin Beaussart

04/19/2023, 10:02 AM
Ok on the doc you provided i see :
VACUUM removes all files from the table directory that are not managed by Delta, as well as data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold.
So i think this will definitily solve my problem beceause as i understand, using Vacuum on delta table, will delete all the old data transaction files and only keep the ones that reflect the latest status of the data in the table, am i correct ? Also i am using PySpark and i saw this can be done as well on the docs, here is the link to the doc for anyone interested : https://docs.delta.io/latest/delta-utility.html#language-python:~:text=table%20to%20Delta-,Remov[…]0behavior%2C%20see%20Data%20retention.,-Important
5 Views