1. producer is writing data in deltaLake using delta-rs library.
2. Consumer is reading the data it run some transformation to the data and saving into the deltaLake in different table space.
3. Consumer needs to consume only the incremental data between last run and current run, is there any way to solve this problem?
4. The consumer is reading the data using python wrapper of delta-rs
Any solution to read this incremental data ?
06/25/2023, 7:39 AM
There are two answers to this - one applies now, and one applies soon-ish.
Right now using the stack you mentioned you would have to load the latest and previous versions of the table (or use the added / deleted files only) and figure out the change manually. There are some scenarios where this is easier - i.e. in append only scenarios, each transaction will only create new files, and you can use low level APIs to get add actions from the latest version.
The community is also working on extending the protocol support level for delta-rs. Once CDC is supported you can get the latest changes from the table directly. That said, CDC support will likely apply to the read path first, and support for writing CDC actions with delta-rs may take a little longer.