Matthew Powers
09/05/2023, 10:48 PMPaweł Romanowski
09/06/2023, 8:30 AMChris Nogradi
09/06/2023, 2:58 PMNick Dunn
09/07/2023, 1:53 PMhttps://$<http://ACCOUNT_NAME.blob.core.windows.net|ACCOUNT_NAME.blob.core.windows.net>
. It seems delta-rs makes this assumption, too, and I can't seem to find a way to change it. I'm not exactly proficient with Rust, but my amateurish review of the code seems to confirm it's not possible.
The problem is, there are technically multiple "registered clouds" with Azure. For instance:
❯ az cloud list | jq .[].name
"AzureCloud"
"AzureChinaCloud"
"AzureUSGovernment"
"AzureGermanCloud"
In these other "clouds", the URL for, say, Azure blob storage changes. Is there any way to indicate to delta-rs that the endpoint suffix should not be <http://blob.core.windows.net|blob.core.windows.net>
, but instead something else? I've also been able to work around this in the past by just passing a connection string that includes the EndpointSuffix=$SUFFIX
, but that doesn't really seem to work, either (or I'm missing something). Like I said, I was planning to submit a GH issue, but I figured it was time to ask around first before I go that far.
EDIT: I'm trying to use delta-rs in Python. I should have said that to begin with.Cole MacKenzie
09/08/2023, 6:36 PMfalse
"fixes" it. That would be inline with https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_arrow.rs#L79 and https://github.com/apache/arrow-rs/issues/1697Will Jones
09/10/2023, 2:41 AMWill Jones
09/11/2023, 2:28 AMMatthew Powers
09/13/2023, 3:58 PMCole MacKenzie
09/14/2023, 4:09 AMEero Lihavainen
09/14/2023, 12:47 PMdeltalake
with Pyarrow and S3 and trying to minimize query latency, I noticed that Pyarrow makes HEAD requests for every file it reads. I believe this is to determine the file size so the parquet footer can be read. However, I think the delta log already contains the file size, so in principle this could be communicated to Arrow to avoid the HEAD requests (see e.g. https://github.com/apache/arrow/pull/7547 ). Has anyone looked into something like this?John Darrington
09/14/2023, 11:36 PMrtyler
09/15/2023, 4:10 PMDenny Lee
09/16/2023, 3:31 AMKevin Lim
09/18/2023, 7:12 AMAlex Wilcoxson
09/18/2023, 4:40 PMKevin Lim
09/18/2023, 9:03 PMIon
09/19/2023, 6:00 PMIon
09/19/2023, 6:12 PMrtyler
09/19/2023, 6:14 PMIon
09/19/2023, 6:16 PMrtyler
09/20/2023, 3:22 AMrtyler
09/20/2023, 3:41 AMEero Lihavainen
09/20/2023, 11:15 AMMatthew Powers
09/21/2023, 3:28 AMHey Matt - resurrecting this thread!
We now have the CI infrastructure set up to allow community members to run the full test suite. Additionally, we’ve taken the Delta Lake PR and rebased it for the author here: https://github.com/dagster-io/dagster/pull/16463
We’re just waiting to hear back from the author as there are a few more things that need to be fixed. If he doesn’t get to it we plan on having our engineers take it over the finish line at the end of October. However, if the author ends up finishing it we will ship it earlier.
Let me know if you still want to do comms on this because we definitely do!
Matthew Powers
09/21/2023, 3:28 AMEero Lihavainen
09/21/2023, 11:25 AMlist_with_offset
with S3 currently push the offset down to the S3 API, or does it fall back to the ObjectStore default implementation that uses a simple list
with post-filtering? Given that list_with_offset
is not defined here: https://github.com/delta-io/delta-rs/blob/a74589be7c39315360925049c716d1d70b906970/rust/src/storage/s3.rs#L470rtyler
09/21/2023, 6:46 PMmain
and one or two fixes in flight. I'm planning to pick up your modularization work after that has been released.Tony Wang
09/21/2023, 7:33 PMTony Wang
09/21/2023, 7:33 PMIon
09/22/2023, 6:23 AM