Kilic Ali-Firat

04/18/2023, 11:40 AM
Hello the team! I would like to have a clarification when storing the metadata on s3. Let's say that I'm consuming my input data either from s3 either from kafka and I write in the Delta Lake that will be store in S3 (metadata included). May I face a consistent issue if I parallelize the writes in the same Delta Lake table inside the same component ? It will be the only component that will writes in the Delta Lake tables. Thanks 🙂

Leandro Rouberte

04/18/2023, 1:34 PM
Hi, Kilic. I think this answers your question If I understood correctly you have two different spark workloads, one consuming from s3 the other from Kafka. If they are managed by different spark drivers processes, I believe you will be in the multicluster writer scenario, where you will need to configure a dynamodb integration to guarantee write after write consistency. If both workloads are managed by the same spark driver process then you'll be in this scenario,, and nothing needs to be done; you will already have write after write consistency. Also, I'm considering when you mentioned "Iceberg" you want it to say delta, otherwise I didn't understand the question and you may disregard this answer 😅.

Kilic Ali-Firat

04/18/2023, 1:52 PM
Hi @Leandro Rouberte Thanks for your answer. My use case is a bit simplier than that (sorry if it was not clear). I have: 1. a source (either kafka, either s3) 2. a data processing component that transforms the input data to the Delta Lake format 3. a single sink which is a Delta Lake table It's the component in the step 2 (I plan to use the Standalone Delta Lake in plain Scala) that will do the insertions in Delta Lake. My question about this component: if I try parallelize the insertions with N writes in the same table, may I face consistency issue ?