https://delta.io logo
n

Nick Dunn

09/07/2023, 1:53 PM
Hi, everyone. I was going to submit a new issue in GitHub, but decided to confirm something before I do. I would like to use delta-rs to read a delta table in Azure blob storage for a client. Usually the URL, then, is something like
https://$<http://ACCOUNT_NAME.blob.core.windows.net|ACCOUNT_NAME.blob.core.windows.net>
. It seems delta-rs makes this assumption, too, and I can't seem to find a way to change it. I'm not exactly proficient with Rust, but my amateurish review of the code seems to confirm it's not possible. The problem is, there are technically multiple "registered clouds" with Azure. For instance:
Copy code
❯ az cloud list | jq .[].name
"AzureCloud"
"AzureChinaCloud"
"AzureUSGovernment"
"AzureGermanCloud"
In these other "clouds", the URL for, say, Azure blob storage changes. Is there any way to indicate to delta-rs that the endpoint suffix should not be
<http://blob.core.windows.net|blob.core.windows.net>
, but instead something else? I've also been able to work around this in the past by just passing a connection string that includes the
EndpointSuffix=$SUFFIX
, but that doesn't really seem to work, either (or I'm missing something). Like I said, I was planning to submit a GH issue, but I figured it was time to ask around first before I go that far. EDIT: I'm trying to use delta-rs in Python. I should have said that to begin with.
w

Will Jones

09/07/2023, 3:01 PM
I know very little about azure, but these are the available options: https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfigKey.html#variants
So it might not be supported. Are there other libraries that you think solve this well?
(The sheer number of options there are is dizzying 😵 )
n

Nick Dunn

09/07/2023, 3:25 PM
Tell me about it 😆. I'm more of an AWS guy, but you know...sometimes you don't get to pick your cloud. Yeah, I've stared at the docs there for a while now and I've tried more combinations than I can remember, at this rate. Most of the time, they all fail for the same reason, which boils down to something like:
Copy code
error trying to connect: dns error: failed to lookup address information: nodename nor servname provided, or not known
...which makes sense, since the storage account doesn't exist on the "AzureCloud" cloud (i.e.
<http://blob.core.windows.net|blob.core.windows.net>
). One package that did work was delta-lake-reader, but my team would prefer something more actively maintained. Anyway, with that I was able to build an Azure Blob file system object. I only provided a connection string to that object, and then I was able to pass that object to delta-lake-reader. Something like:
Copy code
fs = AzureBlobFileSystem(connection_string=f"DefaultEndpointsProtocol=https;AccountName={AZURE_ACCOUNT_NAME};SharedAccessSignature={AZURE_SAS_TOKEN};EndpointSuffix={AZURE_SUFFIX}")

myTable = DeltaTable(f"{CONTAINER_NAME}/{TABLE_PATH}", file_system=fs)
But I'm not really sure if something like that is feasible here? I didn't see it in the constructor API reference, at least.
Sorry, I should be more clear. I meant to say I don't see it as an option here: https://delta-io.github.io/delta-rs/python/api_reference.html
I probably should have lead with "I'm trying to use this in Python". Sorry about that.
w

Will Jones

09/07/2023, 3:33 PM
Are there other libraries that you think solve this well?
By this, I didn't mean delta lake readers specifically. Just anything that connects to the azure systems, whether it be an official CLI or sdk or some other library. Want to know what are the canonical options to handle this sort of thing
👍 1
n

Nick Dunn

09/07/2023, 5:19 PM
I see - my bad. Let me try again, then. While trying to get this to work, I've become familiar with the Azure storage python client library (pipy and their github). With that library, while you're building a blob client you can specify the account URL (which can include a different suffix) or you can pass a connection string only. Now, I don't think this would really work since it seems delta-rs is designed to make some assumptions based on the protocol specified in the url (e.g.
az://
or
s3://
). There's also the python file system spec library, which includes an Azure DL and blob interface called ADLFS (github). It's primary focus. It once again works a bit differently, but perhaps most importantly I can pass a connection string which follows this standard as detailed by Azure's own documentation. Among other things, you can provide a custom "EndpointSuffix", which would address my particular conundrum. In fact, this is one thing I tried to reproduce with delta-rs. However, Azure's connection string documentation allows you to provide the storage account name within. It seems delta-rs does not support this, but instead requires an
account_name
to be provided via the storage options. If I do that, it seems to ignore the "EndpointSuffix" within the connection string.
i

Ion

09/08/2023, 7:10 PM
I just use the ABFS Protocol so something like this: abfs://<file_system>@<account_name>.<http://dfs.core.windows.net/&lt;path&gt;/&lt;file_name%7Cdfs.core.windows.net/&lt;path&gt;/&lt;file_name>>. Then storage options are: storage_options = {client_id: CLIENT_ID, client_secret:CLIENTSECRET, tenant_id:TENANT_ID}
n

Nick Dunn

09/08/2023, 7:46 PM
I think I've tried this before, but did so again. So, following Azure docs on how to construct the URI, I gave it a try. I think the problem, though, is that Azure is expecting you to use
<http://dfs.core.windows.net|dfs.core.windows.net>
, but client is on one of Azure's registered "clouds" other than the default "cloud". So, Azure rejects it.
Copy code
OSError: Generic MicrosoftAzure error: URL did not match any known pattern for scheme:
If I try to use
<http://dfs.core.windows.net|dfs.core.windows.net>
, I just receive a DNS error since the storage account does not exist on the standard Azure "cloud".
i

Ion

09/08/2023, 7:49 PM
What do you mean with standard Azure cloud? Is your storage account behind a vnet?
n

Nick Dunn

09/08/2023, 7:51 PM
Client is using one of the other registered Azure clouds (and I hate that they call them "clouds"):
Copy code
❯ az cloud list | jq .[].name
"AzureCloud"
"AzureChinaCloud"
"AzureUSGovernment"
"AzureGermanCloud"
A lot of the endpoints change when you use a different "cloud" from the default "AzureCloud".
i

Ion

09/08/2023, 7:52 PM
To my understanding from the error, the Uri is valid, but you not able access since your network connection is blocked. Either incoming or outgoing
n

Nick Dunn

09/08/2023, 7:55 PM
I'm able to access it via the portal, via the CLI, and via other libraries without issue. I believe my issue is that I can't seem to figure out how to pass the somewhat unique settings to delta-rs when using Python. In fact, I'm starting to think there's an assumption that any Azure blob storage will use the
<http://blob.core.windows.net|blob.core.windows.net>
. But if you're using one of their other "clouds" for various government-related entities, that endpoint/suffix changes. For instance with the
az
CLI:
Copy code
❯ az cloud list | jq .[].suffixes.storageEndpoint
"<http://core.windows.net|core.windows.net>"
"<http://core.chinacloudapi.cn|core.chinacloudapi.cn>"
"<http://core.usgovcloudapi.net|core.usgovcloudapi.net>"
"<http://core.cloudapi.de|core.cloudapi.de>"
So, for example, if you have blob storage in the
AzureGermanCloud
"cloud", your blob endpoint becomes something like
<http://storage123456.blob.core.cloudapi.de|storage123456.blob.core.cloudapi.de>
.
i

Ion

09/08/2023, 7:56 PM
Ahh got it.
So what's the url you are using from the az cli?
n

Nick Dunn

09/08/2023, 7:57 PM
Yeah. It's awful. On the flip side, the equivalent in AWS is to just change the region. Very easy.
i

Ion

09/08/2023, 7:58 PM
Are you able to use fsspec with adlfs?
Or does that also not work
n

Nick Dunn

09/08/2023, 8:01 PM
I am able to use fsspec with adlfs. I have to init an azure blob file system and pass in an entire connection string. You can build it where you include a custom endpoint suffix (Azure doc). So, if I init an azure blob file system with only a connection string that includes that custom endpoint suffix, it works. I was able to pass that to the delta-lake-reader that's out there, which is another library we've used. It wasn't clear to me if I could do something similar with delta-rs.