03/23/2023, 8:11 PM
Hi all, Need guidance in Delta lake. I am constantly getting data from DMS to S3 in parquet. Some times, i see 5 files coming with in a minute due to changes in Source RDBMS table. How can i capture changes in Target Delta lake efficiently. Can i make it as streaming architecture. My application is completely in AWS.
Looks like i can create Spark Streaming App Polling S3 based on thsi URL. This is what i am thinking. DMS ==> S3 ==> Spark Streaming ==> Delta Merge Do i need to run "Optimize deltatable ZORDER by (X,Y)" after each Merge command to make sure the data is reorganized for next micro batch run of Spark streaming*?*

Jim Hibbard

03/26/2023, 8:11 AM
Hi Chandra, depending on what version of Delta Lake you are using, the change data feed may be just what you're looking for. Hope this helps!

Kees Duvekot

04/07/2023, 6:29 AM
If you are using DMS .. then I would suggest to use the kafka target end-point .. and see if you can use kafka-delta-ingest to get the raw json from DMS into the delta lake (bronze) .. and than do the further processing of that DMS feed on the delta lake side