https://delta.io logo
k

Kilic Ali-Firat

04/18/2023, 11:40 AM
Hello the team! I would like to have a clarification when storing the metadata on s3. Let's say that I'm consuming my input data either from s3 either from kafka and I write in the Delta Lake that will be store in S3 (metadata included). May I face a consistent issue if I parallelize the writes in the same Delta Lake table inside the same component ? It will be the only component that will writes in the Delta Lake tables. Thanks 🙂
l

Leandro Rouberte

04/18/2023, 1:34 PM
Hi, Kilic. I think this answers your question https://docs.delta.io/latest/delta-storage.html#multi-cluster-setup. If I understood correctly you have two different spark workloads, one consuming from s3 the other from Kafka. If they are managed by different spark drivers processes, I believe you will be in the multicluster writer scenario, where you will need to configure a dynamodb integration to guarantee write after write consistency. If both workloads are managed by the same spark driver process then you'll be in this scenario, https://docs.delta.io/latest/delta-storage.html#single-cluster-setup-default, and nothing needs to be done; you will already have write after write consistency. Also, I'm considering when you mentioned "Iceberg" you want it to say delta, otherwise I didn't understand the question and you may disregard this answer 😅.
k

Kilic Ali-Firat

04/18/2023, 1:52 PM
Hi @Leandro Rouberte Thanks for your answer. My use case is a bit simplier than that (sorry if it was not clear). I have: 1. a source (either kafka, either s3) 2. a data processing component that transforms the input data to the Delta Lake format 3. a single sink which is a Delta Lake table It's the component in the step 2 (I plan to use the Standalone Delta Lake in plain Scala) that will do the insertions in Delta Lake. My question about this component: if I try parallelize the insertions with N writes in the same table, may I face consistency issue ?
8 Views