Eero Lihavainen09/14/2023, 12:47 PM
with Pyarrow and S3 and trying to minimize query latency, I noticed that Pyarrow makes HEAD requests for every file it reads. I believe this is to determine the file size so the parquet footer can be read. However, I think the delta log already contains the file size, so in principle this could be communicated to Arrow to avoid the HEAD requests (see e.g. https://github.com/apache/arrow/pull/7547 ). Has anyone looked into something like this?
s3 filesystem implementation?
Will Jones09/14/2023, 1:03 PM
Eero Lihavainen09/14/2023, 1:07 PM
Will Jones09/14/2023, 3:00 PM
Eero Lihavainen09/15/2023, 2:42 PM
argument to the
file system method: https://github.com/delta-io/delta-rs/pull/1630 With this addition the DeltaFileSystemHandler can be extended on Python side to pass the file sizes from the DeltaTable object. Do you think this would be a reasonable way to do it?
Will Jones09/15/2023, 8:08 PM
Eero Lihavainen09/16/2023, 6:50 AM
Will Jones09/17/2023, 12:16 AM
isn’t the most efficient way to get that, but might be the only way that the add actions are exposed at the moment. 🤔
that you can populate using
known_sizes: Dict[str, int]
instead of making a HEAD request