https://delta.io logo
e

Eero Lihavainen

07/23/2023, 5:32 AM
Is it theoretically possible that the transaction log gets updated faster than it's being read, resulting in the table being loaded for eternity? It seems to me that there's no way of knowing the max_version when starting to load the table. Tangentially related to https://github.com/delta-io/delta-rs/pull/1549 because there will necessarily be some unnecessary requests when doing concurrent reads to the log - which would not be the case if max_version was known
Looks like in the spark implementation the files are listed before starting to load, which would guarantee this will not happen. https://github.com/delta-io/delta/blob/c78daeff2062ce3c9d8c159301add1848f25c809/spark/src/main/scala/org/apache/spark/sql/delta/DeltaLog.scala#L260 For S3 they use
start-after
to list files after a prefix, I assume this is performant https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
d

David Blajda

07/23/2023, 4:37 PM
Yeah in theory the current implementation can get stuck in loop fetching the next version if write speeds exceeds read speeds. The solution you identified will remove that infinite loop. In the object store list_with_offset maps the the S3 param start-after that you mentioned. https://github.com/apache/arrow-rs/blob/d9719d1f2d21ed0c75ffd5c8e8c4fb15ee5ebe54/object_store/src/aws/client.rs#L586-L588
e

Eero Lihavainen

07/23/2023, 6:38 PM
Yeah I noticed that, and added a quick and dirty implementation here in the PR: https://github.com/delta-io/delta-rs/pull/1549/files/b1873745a310117d967d435f229bd77780762e29#diff-3b5c80af9e5527fbeaa6[…]0ba0471c58f243a29dfa99R553 Unit tests run OK but haven't tried it against S3
d

David Blajda

07/23/2023, 9:34 PM
Neat. We have macros to help with integrations tests against the various objects stores: See: https://github.com/delta-io/delta-rs/blob/main/rust/tests/integration_checkpoint.rs#L21 For benchmarking this repo has
criterion
included which is really great for that work
👍 1