https://delta.io logo
g

Gal Stainfeld

08/21/2023, 8:40 PM
Hi all, i’m using spark structured streaming to process events from a delta table by defining a foreach batch method that work on new events.The delta table is partitioned in it’s highest level by event’s
date
and inside it again by
normalized_at
date. Is there a way to tell the stream to only look for new data each interval from the
normalized_at
partition that suits today and yesterday? Not sure how exactly how it knows from where on the table to take the new events from the last checkpoint so maybe by utilizing the table’s partitions can help it. Thanks.
n

Nick Karpov

08/22/2023, 4:47 PM
the stream works based on the transaction log and must read the whole thing sequentially, aka. there is no way to apriori select only parts of it (assuming I've understood you correctly)... you'll have to filter out whatever is irrelevant in a given batch in the
foreachBatch
g

Gal Stainfeld

08/22/2023, 4:53 PM
Thanks for the answer @Nick Karpov. I just want to clarify - Since i have the `normalized_at`date as partition, i know the new events on the table must be from the partition of
today|yesterday.
So the question is if putting such filter on the
foreachBatch
will help the the stream in any way (because it will focus on reading two partitions only when getting the new data for the batch) or it won’t matter at all.
n

Nick Karpov

08/22/2023, 5:03 PM
hmm, good question, i'm not sure how partition filtering (and nested) will work in this context... i would try the variations and look at the spark UI to see what actually ends up getting read