https://delta.io logo
d

Dan Loftus

08/31/2023, 3:14 PM
Hello... I have a process streaming from an Azure EventHub to delta. I am seeing some records in the delta dataset with the same partition and offset. Any ideas under what circumstances we can get duplicates? Given its the same offset and partition I'm assuming the duplicates are not in the stream but created during the sink to delta.
r

rtyler

08/31/2023, 3:15 PM
I am not familiar with Azure Event Hubs, does it provide at-most-once delivery?
d

Dan Loftus

08/31/2023, 3:18 PM
its basically a managed kafka... I'm not sure of the exact semantics, I think if you used a consumer group it might provide at-most-once delivery but isn't spark streaming and checkpointing handling the consumer offsets?
r

rtyler

08/31/2023, 3:19 PM
I cannot speak to what Spark may provide. I know with the implementation of kafka-delta-ingest we did some extra bookkeeping to minimize the potential for duplicate records in a (topic,partition) tuple
d

Dan Loftus

08/31/2023, 3:20 PM
let me take a look at this... I am using the azure eventhub library
r

rtyler

08/31/2023, 3:22 PM
Unless you're fluent in Rust, and even then 😆 , this code may not be helpful ! https://github.com/delta-io/kafka-delta-ingest/blob/main/doc/DESIGN.md#end-to-end-transactions--kafka-offset-handling has some details which could be compelling. Basically it comes down to trying to track reads from the kafka topic+partition all the way through to the delta transaction log. That said, I don't want to overcomplicate things, I would hope that whatever reads from Eventhub can help handle some of the durability needs you're concerned with
d

Dan Loftus

08/31/2023, 3:23 PM
I appreciate it... I guess I always assumed there was tracking of that transaction all the way through (without me knowing the details 🙂 ). I never saw this happen using kinesis -> s3 delta storage