https://delta.io logo
e

Eero Lihavainen

07/19/2023, 1:38 PM
Hello! I've been testing the Python
deltalake
library in AWS Lambda & S3. It works nicely, but I noticed that query latency varies quite a lot depending on how many files there are in the commit log. If I read the
delta-rs
code correctly, the
update_incremental
function reads and processes all files in sequence, is that correct? And if so, would it be possible to make concurrent requests to the storage instead and buffer the results? https://github.com/delta-io/delta-rs/blob/a7389d581aceeec7187ccdfd1cbb80e7ac0747f9/rust/src/delta.rs#L472
To give some idea of the numbers, RawDeltaTable creation takes about 1.5-3 seconds for a table with ~100 files in the commit log, and 0.7-1.4 seconds for another table with 30 files in the log
w

Will Jones

07/19/2023, 2:42 PM
Ideally we should be making checkpoints, and then you are reading at most 10 files. But we could also read files concurrently as well
e

Eero Lihavainen

07/19/2023, 2:44 PM
I'm only using this setup to read, Databricks with Unity catalog is taking care of the checkpointing. And for some reason it seems to create 50 or even more files between checkpoints
The use case is serving data for a dashboard so the lower the latency the better
r

Robert

07/19/2023, 3:43 PM
Spark / Databricks allows you to configure the interval in which checkpoints are created - if that can be altered, this is a powerfuil way to reduce latency. Less effectuful, but also useful should be to disable laoding tombstones - you don't need them for read-only scenarios.
Not sure how urgent that is, but improving log parsing and state handling in general is something that will likely see significant improvements in the rest of the year - so another strategy is to just wait.... or contribute to delta-rs 🙂.
e

Eero Lihavainen

07/19/2023, 3:47 PM
Thanks, that's good info. I couldn't find much information about checkpointing in Databricks, they just say it uses an "optimized frequency". I would love to contribute, I already played with the code a bit but ... Rust is easier to read as a complete beginner than to write 😅
But I will definitely give it a shot and see if I learn something
🦀 1
r

Robert

07/19/2023, 3:55 PM
we are always happy to support on any PR - giving feedback like here is also a great way to help/contribute.
🤝 1
e

Eero Lihavainen

07/20/2023, 12:12 PM
Is it OK if I open a wip PR for this? I've got something that seems to work, but it could use some expert eyes
r

rtyler

07/20/2023, 1:20 PM
@Eero Lihavainen yes of course!
👍 1
e

Eero Lihavainen

07/20/2023, 1:39 PM
r

rtyler

07/20/2023, 1:43 PM
I saw that come through, let me convert this to a draft and we can discuss more there
e

Eero Lihavainen

07/20/2023, 1:45 PM
Thanks!
Any thoughts on how this could be moved forward? https://github.com/delta-io/delta-rs/pull/1549
👀 1