https://delta.io logo
v

Vikas Kumar

08/02/2023, 10:22 AM
Hello everyone, I'm working with a streaming DataFrame that contains duplicate entries. (Entries gets appended with updated data), and I'm currently performing aggregation operations on this streaming DataFrame. My goal is to eliminate the duplicate entries and retain only the most recent ones. The issue I'm facing is that the default behavior of the
removeDuplicate
function keeps the oldest entry, which is not what I need. I attempted to address this by sorting the DataFrame based on the date, but I encountered an error stating that sorting is not applicable to a streaming DataFrame. I would greatly appreciate any suggestions or insights you might have regarding this challenge.
d

Dominique Brezinski

08/02/2023, 1:22 PM
You can either use a window function with watermark, do the deduplication with a window in foreachBatch (but this will only dedup within the microbatch, so not a global dedup), use a withState operation, or use local dedup in foreachBatch and a Merge operation into the table to do the global dedup.