https://delta.io logo
r

Ryan Aston

09/04/2023, 5:29 PM
Does delta-rs currently have support for the following? 1. get the last checkpoint (I do see a function that’s public at the crate level but not outside the crate) 2. get the total size of the table at the current loaded version 3. get the size of the parquet file created from the json writer’s flush and commit
w

Will Jones

09/04/2023, 6:20 PM
1. get the last checkpoint
Do you want to just know what's the most recent version with a checkpoint? Or you want to know the content of the file?
2. get the total size of the table at the current loaded version
You can compute this by summing the
size_bytes
column in `add_actions_table()`: https://docs.rs/deltalake/0.14.0/deltalake/table_state/struct.DeltaTableState.html#method.add_actions_table
3. get the size of the parquet file created from the json writer’s flush and commit
I don't think that's part of the API, but seems like a reasonable request if you want to make a Github issue.
r

Ryan Aston

09/04/2023, 6:35 PM
On 1, I’d just need the version. Makes sense on 2, and I’ll put in a request for 3.
On 3, I did see that the writer has a
buffer_len
function to return the current size of the parquet buffers in bytes, but the value seems to always be returned as zero. I am successfully writing data, so I’m not sure why it never shows any bytes. I’m printing the value of buffer_len after a
.write()
but before the `.flush_and_commit()`:
Copy code
let mut wrtr: deltalake::writer::JsonWriter =
        deltalake::writer::json::JsonWriter::for_table(&table).unwrap();

wrtr.write(events_json.clone()).await.unwrap();
println!("size of buffers - {}", wrtr.buffer_len()); // always prints '0'
wrtr.flush_and_commit(&mut table).await.unwrap();
w

Will Jones

09/04/2023, 8:35 PM
IIRC, buffer_len is the current size of serialized Parquet data that hasn’t been written to disk. If you haven’t gotten enough rows to fill up a row group, it will be zero
I think
buffered_record_batch_count
should be non-zero for you, is it?
r

Ryan Aston

09/05/2023, 2:17 AM
Yes. events_json is a Vec<Value> with 487 tuples. After the write above
buffer_len()
returns 0 and
buffered_record_batch_count()
returns 1. The description for
write()
is “Writes the given values to internal parquet buffers for each represented partition” and the description for
buffer_len()
is “Returns the current byte length of the in memory buffer. This may be used by the caller to decide when to finalize the file write” which implies it would be the size of the data waiting to be flushed to disk, but that doesn’t seem to be the case. For now I’m working around it by calling the helper function I created for question 2 just before I flush and just after and finding the difference. For 1 I’m working around it by basically creating my own helper function using inspiration from these two functions in the delta-rs source: • https://github.com/delta-io/delta-rs/blob/787c13a63efa9ada96d303c10c093424215aaa80/rust/src/action/mod.rs#L910https://github.com/delta-io/delta-rs/blob/787c13a63efa9ada96d303c10c093424215aaa80/rust/tests/checkpoint_writer.rs#L59
Hi Will, I had one more follow-up here. The add_actions approach is working well, but is there a way to get only the add actions that occurred as part of a particular version commit? Also, should I be scoping it to active adds only (would not doing so not take into account deletes)?
w

Will Jones

09/07/2023, 9:21 PM
but is there a way to get only the add actions that occurred as part of a particular version commit?
No I don't think there is
Also, should I be scoping it to active adds only (would not doing so not take into account deletes)?
The table only includes add actions that are active (not any that are removed)
👍 1
r

Ryan Aston

09/07/2023, 10:25 PM
Even if there’s not a native way, are the building blocks there to build my own function to get the adds for a particular version?
w

Will Jones

09/07/2023, 10:54 PM
I think what you want is a function to get the logs for a particular log file; we don't expose anything that low level as far as I can find
👍 1