https://delta.io logo
h

Hendry Suwanda

01/27/2023, 2:49 PM
Hi All, i have a readStream job to delta, but the job failed due to some anomaly in one partition, the partition contains many files (around 5000 files). I run the compaction for that partition, when I check with
describe detail
the total of files has been reduced. but, when I run the readStream job, the job still read the old files (scan more than 5000 files), then I try to run the
vacuum(0)
, but the readStream still read the file that has been deleted by vacuum command. But, when I try to run the
select
command, there is no error. Is there any way to invalidate the delta table? why the readStream still read the file that has been deleted? note: I use delta version 1.0.0 Thanks
g

Gerhard Brueckl

01/27/2023, 7:26 PM
compaction/optimize has no impact on you stream. It creates a new version in your delta-log but as no data was actually changed by compaction/optimize but just the file layout, they are not considered for streaming. If they were, you would stream the very same data twice
h

Hendry Suwanda

01/28/2023, 1:19 AM
hi @Gerhard Brueckl thanks for your reply. I see. But, after I run vacuum(0) & re-run the streaming (the checkpoint is still empty), why is the streaming still trying to read a file deleted by vacuum? i assume, after vacuum(0) they delta history must be clear
g

Gerhard Brueckl

01/30/2023, 8:05 AM
yes, but your stream is broken as it is missing some data. lets say your stream has read until the 2nd of January and you get new data every day. If you run
vacuum
on the 5th and delete all changes before, and restart the stream, it will complain that the files between 2nd and 5th are missing. you could use ``ignoreMissingFiles`` but obviously the changes between 2nd and 5th will be missing from your stream then https://learn.microsoft.com/en-us/azure/databricks/kb/delta/file-transaction-log-not-found
d

Dominique Brezinski

01/30/2023, 5:08 PM
Stream reads progress through the log starting at the first version and move forward. Any optimization/compaction sets dataChange=false in the version they commit, and the stream reads the data from the earliest add action it encounters, and it skips future add actions that have dataChange=false that were created by optimizations. Now that you have broken the version history by aggressively vacuuming, you need to re-start the stream with both a new checkpoint and starting version after you vacuumed.
v

vijay

01/30/2023, 10:10 PM
Does z-order or optimize re-arrange data across files?
d

Dominique Brezinski

01/30/2023, 11:12 PM
Yes, unless you restrict it by partition predicate.
🙌 1
Of course that yes really means it evaluates all files, but it may not choose to re-write a specific file for various reasons.
6 Views