https://delta.io logo
r

Reginaldo Silva

01/28/2023, 9:56 PM
Hello everyone. I'm trying to get an estimate of space that Vacuum can free up, Dry run on Databricks only returns the first 1k files and only name doesn't have the file size. I'm trying to understand how Vacuum lists these files, using the last checkpoint file + last commit Jsons is not working so i'm trying to understand the source code of Vacuum and LogStore, it's a bit difficult for me cause i don't know Scala. Does anyone know how Vacuum lists the files to be remove? Reading Delta log or read all files in table storage and subtracting files from current version? @Denny Lee can you give a tip?
g

Gerhard Brueckl

01/30/2023, 8:08 AM
i once ran the
VACUUM DRY RUN
in scala and it produced an additional output row at least telling me the total files to be deleted
on Databricks: the output of how many files are affected is missing in
%python
r

Reginaldo Silva

01/30/2023, 10:58 AM
Good morning. @Gerhard Brueckl amazing, thank you so much, how can i get this text result?
g

Gerhard Brueckl

01/30/2023, 11:36 AM
well, as I said, you need to run your could via a Scala interpreter and not use SparkSQL or PySpark
r

Reginaldo Silva

01/30/2023, 12:11 PM
@Gerhard Brueckl ok, but is possible get the message from command? When i run in scala i receive a Dataframe with paths similar to pyspark. Do you know how i get this string? Found 14 files (121640706 bytes) and directories in a total of 98 directories that are safe to delete. val q = sql(s"VACUUM delta.
$path
RETAIN 0 HOURS DRY RUN")
g

Gerhard Brueckl

01/30/2023, 12:12 PM
are you on Databricks?
r

Reginaldo Silva

01/30/2023, 12:12 PM
Yes
g

Gerhard Brueckl

01/30/2023, 12:12 PM
hmm, then its weird, for me this just shows up as if it were any regular
print
statement
ah, I think I got your request - you want to read that value into a variable?
r

Reginaldo Silva

01/30/2023, 12:13 PM
yeah, i want put this string in a variable
g

Gerhard Brueckl

01/30/2023, 12:14 PM
the only thing that would come into my mind is to redirect the output stream but tbh I am no expert in Scala nor with output streams 😄
r

Reginaldo Silva

01/30/2023, 12:56 PM
Thanks @Gerhard Brueckl, looks like very complicated rs. My initial goal is: estimate how much space the vacuum can free up and create a monitor on it, I do this by reading the storage and calculating how much space this table has and then subtracting the actual size from the last version, its work but it's very slowly reading all the folders and files of each table. So I'm trying to understand how dry vacuum does this, but it's not easy, I've tried reading checkpoints and jsons logs
g

Gerhard Brueckl

01/30/2023, 12:57 PM
you can also find information of how much has been changed in the delta log - wouldnt that be sufficient?
r

Reginaldo Silva

01/30/2023, 1:05 PM
I've tried summing and counting the "remove" column for files that are no longer part of the current version, but it doesn't match with Vacuum Dry Run. I also tried looking at the last checkpoint.
4 Views