https://delta.io logo
c

Chandra

03/23/2023, 8:11 PM
Hi all, Need guidance in Delta lake. I am constantly getting data from DMS to S3 in parquet. Some times, i see 5 files coming with in a minute due to changes in Source RDBMS table. How can i capture changes in Target Delta lake efficiently. Can i make it as streaming architecture. My application is completely in AWS.
Looks like i can create Spark Streaming App Polling S3 based on thsi URL. https://stackoverflow.com/questions/30994401/spark-streaming-on-a-s3-directory This is what i am thinking. DMS ==> S3 ==> Spark Streaming ==> Delta Merge Do i need to run "Optimize deltatable ZORDER by (X,Y)" after each Merge command to make sure the data is reorganized for next micro batch run of Spark streaming*?*
j

Jim Hibbard

03/26/2023, 8:11 AM
Hi Chandra, depending on what version of Delta Lake you are using, the change data feed may be just what you're looking for. Hope this helps! https://docs.delta.io/latest/delta-change-data-feed.html
k

Kees Duvekot

04/07/2023, 6:29 AM
If you are using DMS .. then I would suggest to use the kafka target end-point .. and see if you can use kafka-delta-ingest to get the raw json from DMS into the delta lake (bronze) .. and than do the further processing of that DMS feed on the delta lake side
2 Views