https://delta.io logo
j

Jiri Vinarek

06/12/2023, 2:01 PM
Hi all, I'd like to do data deduplication on a streaming delta table and have the result saved also as a streaming delta table. I've found this guide in Databricks documentation but I don't understand this bit:
In another streaming query, you can continuously read deduplicated data from this Delta table. This is possible because an insert-only merge only appends new data to the Delta table.
The target table will not be a streaming table, as I understand. I can get around this by enabling change data feed and then consuming INSERT operations from the feed. Is this the proposed way? Or is there another way how to get deduplicated streaming table?
g

Gerhard Brueckl

06/13/2023, 6:56 AM
that code (as I read it) will simply not add duplicates if the same unique key was already added within the last 7 days so the target Delta table (where you INSERT/MERGE INTO) will not contain any duplicates and can be used as the source for another streaming operation the logic above ensures it does not contain any duplicates (at least not within a 7 day timespan)
j

Jiri Vinarek

06/13/2023, 7:08 AM
Thank you Gerhard! I thought I need to write the table with
writeStream
to have a "streaming" delta table. I think I got this part wrong. As I understand, just appending to a delta table with any operation will work and I can then use the table as a source for structured streaming.
g

Gerhard Brueckl

06/13/2023, 7:09 AM
exactly, any Delta Table can be used as source or target for a stream you need to take care of deduplication though which is what the code above helps with
gratitude thank you 1