Vinodh Thiagarajan
01/17/2023, 7:32 PMKashyap Bhatt
01/18/2023, 12:05 AMreadStream()
is checking for new msgs every 2 minutes then consider making it every 10 minutes. Depending on other variables, you'll find an optimal trigger time somewhere between real-time-mode (say 1-second) and normal-batch-mode (say once/twice a day).
2. If your incoming data for 100+ tables is on a single topic, it already sounds like a bad idea. It forces consumers to create their own circus to read optimally and do parallel processing. One such circus would be to
a. have one top level readStream()
that just reads from msg bus and dumps into a DeltaTable as pure append (no updates). Pick good partitioning column.
b. have multiple readStream()
consumers that use this DeltaTable's change feed (or table itself) as source, apply appropriate filters (e.g. table_name in (t1, t2, .. t10)
) and then processes only some tables outta the 100+.
c. This should give you the ability to have multiple jobs running in parallel consuming from DeltaTable, possibly at different frequencies (depending on how frequently data for given tables arrives).
d. Then you can also do more fine tuning of compute resources.
3. "let the downstream apply RANK and choose the latest records instead" --> If your incoming data is such that one table receives different/updated versions of a row very frequently and your downstream logic only cares for the latest one, then it's really a waste of resources to put every version of a row through all down stream transformations. In other words, you should execute downstream logic as infrequently as possible. Basically what's listed under #1 above.Vinodh Thiagarajan
01/18/2023, 1:52 AM