Eero Lihavainen
09/14/2023, 12:47 PMdeltalake
with Pyarrow and S3 and trying to minimize query latency, I noticed that Pyarrow makes HEAD requests for every file it reads. I believe this is to determine the file size so the parquet footer can be read. However, I think the delta log already contains the file size, so in principle this could be communicated to Arrow to avoid the HEAD requests (see e.g. https://github.com/apache/arrow/pull/7547 ). Has anyone looked into something like this?delta-rs
s3 filesystem implementation?Will Jones
09/14/2023, 1:03 PMEero Lihavainen
09/14/2023, 1:07 PMWill Jones
09/14/2023, 3:00 PMEero Lihavainen
09/15/2023, 2:42 PMsize
argument to the get_input_file
file system method: https://github.com/delta-io/delta-rs/pull/1630
With this addition the DeltaFileSystemHandler can be extended on Python side to pass the file sizes from the DeltaTable object. Do you think this would be a reasonable way to do it?Will Jones
09/15/2023, 8:08 PMopen_input_file()
Eero Lihavainen
09/16/2023, 6:50 AMWill Jones
09/17/2023, 12:16 AMget_add_actions
isnโt the most efficient way to get that, but might be the only way that the add actions are exposed at the moment. ๐คknown_sizes: Dict[str, int]
that you can populate using get_add_actions
here:
https://github.com/delta-io/delta-rs/blob/ef17f9e87bf7028b6dd00df27549149c60282e02/python/deltalake/table.py#L525known_sizes
instead of making a HEAD requestEero Lihavainen
09/18/2023, 6:40 AM