https://delta.io logo
j

Javier Carbajo

06/29/2023, 7:58 AM
Hi! I’m quite beginner with Delta but glad to be here! I would ask a little questions, hopefully some of you can give me some light: I’m processing events and storing them in a Delta Table, partitioned by
event_name
and
date
. So the more time I’m ingesting data, the more files and size the table has. Every night I’m running a compaction + vacuum process in order to maintain the table in the best possible way. Compaction (optimize) is perfect because I’m able to filter the partitions I want to optimize but the vacuum command doesn’t allow me to filter the partitions I want to vacuum and does the vacuum of the whole table each time. As I only append data to the table, if we extend this behavior, the vacuum will spend too much time running. For the moment it’s not a problem but I want to anticipate it. Any idea on how can I vacuum/clean the table/old files filtering the partitions I want to vacuum? Or any workaround to avoid too large vacuums? Thanks a lot for your help! Have a nice Thursday! 🙂
g

Gerhard Brueckl

06/29/2023, 8:30 AM
if the table is appendOnly, then
VACUUM
would not do anything and you can just omit it to remove data from older dates you would need to run a regular
DELETE
command instead of
VACUUM
j

Javier Carbajo

06/29/2023, 10:34 AM
Hi @Gerhard Brueckl, thanks for your comment. I don’t understand
vacuum would not do anything and you can just omit it
. If I’m appending data and optimizing it to merge small files into larger ones.. Vacuum will delete the small (and old) files, no? Or am I missing something?
g

Gerhard Brueckl

06/29/2023, 10:39 AM
ups sorry, sure, VACUUM would then delete the small files that were merged into bigger files by OPTIMIZE VACUUM contains two steps 1. find the outdated (or orphaned) files 2. delete the files identified by 1. the first step can be distributed across the cluster and should run faster if you have a bigger cluster the second step should not do too much if you run VACUUM frequently
j

Javier Carbajo

06/29/2023, 10:43 AM
And if i’m not wrong, who is in charge of the step 1, is the master node, right? Maybe only adding more resources in the master node will be faster? It’s something that I thought before, but I tried to ask here for a “magical filter” to improve even more the vacuum process 😅
g

Gerhard Brueckl

06/29/2023, 11:05 AM
as I said, step 1 is distributed across the cluster - so adding more nodes would help and probably also scaling up the master and/or the nodes
d

Dominique Brezinski

06/29/2023, 3:44 PM
The VACUUM implementation has already been optimized to work against tables that contain more than 20M files, so not sure you need to be concerned about partition filters. In fact, partiton filters make no sense in the context of VACUUM. The reason being is that it has to compare all the files referenced in the retained version history in the delta log against all the files in the storage system under the table path prefix, minus prefixes that start with underscore, and delete the files not referenced in the retained log history. Since open write transactions on the table will write files to the storage system that are not yet referenced in the log, VACUUM needs have a safety time window to not delete these new files. Soooo, if you only look at the subset of files in the log restricted by partition values compared to the listing, you would delete the data files for the rest of the table. Conversely, the files in the storage path that are not in the transaction log cannot be associated with what partition they were once in. Delta supports random data file prefixes, so the partition values are not always in the path. Given that, partition filters are nonsensical wrt VACUUM.
s

Sherlock Beard

06/29/2023, 4:19 PM
VACUUM is a non-blocking operation, which means it can be run at the same time as write operations after you have done compaction.
d

Dominique Brezinski

06/29/2023, 4:32 PM
@Sherlock Beard Yes, but only when your retain interval is longer than any open write transactions. The default retain interval is 7 days. Running VACUUM with RETAIN 0 is incompatible with active writes for example.
🙇‍♂️ 3
j

Javier Carbajo

06/30/2023, 9:10 AM
My thoughts about filtering when vacuum were because my table which is appendonly, so is getting bigger and bigger, so the vacuum will take longer and longer. And I asked trying to find a workaround. But I totally understand what you are saying. Thank you all for the responses!
d

Dominique Brezinski

06/30/2023, 11:33 AM
Being appendOnly doesn’t make it worse. The only operation that takes longer as the object count in the table increases is the underlying list operation to the storage system and the log traversal. Since you are optimizing recent partitions and vacuuming on a regular schedule, the number of objects that need to be deleted each vacuum will only vary based on your recent write volume (more volume == more objects to compact). The key is just keep doing these table maintenance operations on a regular schedule and it will all be fine ;)
m

Michael Nacey

07/05/2023, 4:56 PM
Ya, just vacuum often to keep the runtime down as others have said. If you wait for a year to vacuum, you will be sad: trust me.