Need guidance in Delta lake.
I am constantly getting data from DMS to S3 in parquet.
Some times, i see 5 files coming with in a minute due to changes in Source RDBMS table.
How can i capture changes in Target Delta lake efficiently. Can i make it as streaming architecture.
My application is completely in AWS.
Looks like i can create Spark Streaming App Polling S3 based on thsi URL.
https://stackoverflow.com/questions/30994401/spark-streaming-on-a-s3-directoryThis is what i am thinking.
DMS ==> S3 ==> Spark Streaming ==> Delta Merge
Do i need to run "Optimize deltatable ZORDER by (X,Y)" after each Merge command to make sure the data is reorganized for next micro batch run of Spark streaming*?*
If you are using DMS .. then I would suggest to use the kafka target end-point .. and see if you can use kafka-delta-ingest to get the raw json from DMS into the delta lake (bronze) .. and than do the further processing of that DMS feed on the delta lake side