https://delta.io logo
c

Cole MacKenzie

06/28/2023, 12:07 AM
Is there a specific reason
DeltaTable
is not cloneable? Regarding API semantics here, the
JsonWriter
only needs to borrow the table, so it is possible to keep both a table and writer, but
DeltaOps
wants to own the table, meaning I can't keep it around to call other methods / ops on it.
So the other question I have is: Does
DeltaOps
really need to consume the table?
w

Will Jones

06/28/2023, 1:03 AM
The reason it's not cloneable is mostly because it's not cheap to clone, so we didn't want to encourage cloning. It owns a whole struct containing the current delta table state (think: all the file paths, partition values, and file statistics). The current model is that methods can mutate the state, so that's why
DeltaOps
consumes the table: it's mutating the inner state itself. I think if we tried harder we could make it take a mutable reference instead; IIRC the only reason we didn't is to avoid lifetime complications inside async functions.
We had discussed earlier maybe putting the table state behind an Arc and then copy-on-write. That would make it cheap to clone a DeltaTable.
But I think we've mostly held off because we are planning to refactor the core state loading logic so that it doesn't eagerly load it all into memory.
I guess we could make DeltaOps not provide the updated table state, and force the user to update the state themselves. Is that what JsonWriter does?
c

Cole MacKenzie

06/28/2023, 1:35 AM
Looks like the
DeltaWriter
trait does the updating. The
JsonWriter
only uses the schema + object store it looks like. https://github.com/delta-io/delta-rs/blob/main/rust/src/writer/mod.rs#L133-L139
+1 for `Arc`ing the table state and going for CoW.
My use case: I have N tables and writers for those tables that act as a buffer. They flush periodically. Trying to run any DeltaOps means I have to recreate the table anyway (which without Clone/CoW could be expensive)