https://delta.io logo
e

Eero Lihavainen

09/14/2023, 12:47 PM
Using Python
deltalake
with Pyarrow and S3 and trying to minimize query latency, I noticed that Pyarrow makes HEAD requests for every file it reads. I believe this is to determine the file size so the parquet footer can be read. However, I think the delta log already contains the file size, so in principle this could be communicated to Arrow to avoid the HEAD requests (see e.g. https://github.com/apache/arrow/pull/7547 ). Has anyone looked into something like this?
Or maybe a more relevant question is, is this kind of optimization done in the
delta-rs
s3 filesystem implementation?
w

Will Jones

09/14/2023, 1:03 PM
We don't do that optimization right now, no
Haven't looked into it since I plan on switching away from using PyArrow later anyways.
๐Ÿ‘ 1
e

Eero Lihavainen

09/14/2023, 1:07 PM
Is that the responsibility of Delta Kernel or something else?
w

Will Jones

09/14/2023, 3:00 PM
Yeah might be something we could incorporate into Delta Kernels. It's a good idea to pass that info down to the Parquet reader
e

Eero Lihavainen

09/15/2023, 2:42 PM
I played with this idea a bit, adding a
size
argument to the
get_input_file
file system method: https://github.com/delta-io/delta-rs/pull/1630 With this addition the DeltaFileSystemHandler can be extended on Python side to pass the file sizes from the DeltaTable object. Do you think this would be a reasonable way to do it?
w

Will Jones

09/15/2023, 8:08 PM
Well the base class doesn't have that parameter (docs), so how would PyArrow know to pass the size down?
It's PyArrow not delta-rs that's calling
open_input_file()
e

Eero Lihavainen

09/16/2023, 6:50 AM
Yeah sorry I wasn't clear, I didn't add it to the draft but this parameter allows wrapping the method in the python implementation of DeltaFileSystemHandler. So you can get file sizes using get_add_actions on the DeltaTable, and pass them based on the path that pyarrow calls the method with. Not sure if this makes sense though or if it would be better to do on the Rust side? And are there cases when this HEAD request should not be skipped?
w

Will Jones

09/17/2023, 12:16 AM
Oh hmm.
get_add_actions
isnโ€™t the most efficient way to get that, but might be the only way that the add actions are exposed at the moment. ๐Ÿค”
Yeah that makes sense. I guess DeltaStorageHandler could have a field
known_sizes: Dict[str, int]
that you can populate using
get_add_actions
here: https://github.com/delta-io/delta-rs/blob/ef17f9e87bf7028b6dd00df27549149c60282e02/python/deltalake/table.py#L525
And then alter the filesystem to try to use
known_sizes
instead of making a HEAD request
e

Eero Lihavainen

09/18/2023, 6:40 AM
Opened PR with the complete implementation: https://github.com/delta-io/delta-rs/pull/1630
๐Ÿ‘ 1