Vinodh Thiagarajan

01/17/2023, 7:32 PM
Design Choice Question - We have a requirement of CDC from a popular RDBMS to Cloud. We have the CDC data flowing into topics, the data in the 1 topic represents data from 100+ tables and soon it might be 500+ tables streaming continuously. We chose the merge strategy of Delta and noticed that merging is a high compute op and a microbatch we are trying to merge deals with 50-70 tables and its going to grow. Given the SLA we observe that merging real time continuously using stream data, using a nominal cluster is not cutting it. So we are thinking about append approach and going to let the downstream apply RANK and choose the latest records instead. Are we going in the right direction ? anyone successful with continuous merging across 100s of delta table every few mins ?

Kashyap Bhatt

01/18/2023, 12:05 AM
1. Look at your latency requirements and make your micro-batches as big as possible. E.g. if your
is checking for new msgs every 2 minutes then consider making it every 10 minutes. Depending on other variables, you'll find an optimal trigger time somewhere between real-time-mode (say 1-second) and normal-batch-mode (say once/twice a day). 2. If your incoming data for 100+ tables is on a single topic, it already sounds like a bad idea. It forces consumers to create their own circus to read optimally and do parallel processing. One such circus would be to a. have one top level
that just reads from msg bus and dumps into a DeltaTable as pure append (no updates). Pick good partitioning column. b. have multiple
consumers that use this DeltaTable's change feed (or table itself) as source, apply appropriate filters (e.g.
table_name in (t1, t2, .. t10)
) and then processes only some tables outta the 100+. c. This should give you the ability to have multiple jobs running in parallel consuming from DeltaTable, possibly at different frequencies (depending on how frequently data for given tables arrives). d. Then you can also do more fine tuning of compute resources. 3. "let the downstream apply RANK and choose the latest records instead" --> If your incoming data is such that one table receives different/updated versions of a row very frequently and your downstream logic only cares for the latest one, then it's really a waste of resources to put every version of a row through all down stream transformations. In other words, you should execute downstream logic as infrequently as possible. Basically what's listed under #1 above.
🙇 2

Vinodh Thiagarajan

01/18/2023, 1:52 AM
thanks for detailed explanation @Kashyap Bhatt
👍🏽 1