https://delta.io logo
j

Jack

04/22/2023, 4:49 AM
Hello Delta community, what is best and recommended way to ingest data from multiple sources like many s3 into Delta using spark structured streaming. Spark.readStream......load()
j

Jim Hibbard

04/22/2023, 5:23 AM
"Best" is hard to pin down without knowing more about your needs / requirements. But assuming you mean the standard way to ingest data handled by multiple separate clusters into a single Delta Table sink in S3... then you might want to take a look at this blog. You can have each Spark cluster write to the same Delta Table in S3 as long as you have a service that handles maintaining atomicity of writes. There are standard DynamoDB implementations, delta docs here! Hope that helps 😁
And if your setup is one cluster with multiple s3 sources and a single Delta Table sink, the set up is even easier. Just start a separate stream for each source to your single sink 👍
j

Jack

04/22/2023, 7:10 AM
So multiple dataFrames for each data stream load() And then some unionDF =df1.union(df2) And then unionDF.writeStream.format("delta").options().start() Is this looks good choice or just create seperate spark clusters per s3 and then write from each stream.
My case is Simple one, multiple s3 sources just dump in Delta.
Also is .format("sqs") better than plain s3 listing, any optimization done on Spark or DB side to optimize s3 reading directly from s3 instead of sns sqs.
@Jim Hibbard please see if you can give some inputs here