https://delta.io logo
i

Ion

08/18/2023, 9:31 PM
Will this PR fix the issue that a person also needs storage level access when you're trying to access unity catalog tables? Currently in my org, a team only gave access to tables through unity catalog to a service principal, and not to the storage layer. https://github.com/delta-io/delta-rs/pull/1338
r

Robert

08/18/2023, 10:30 PM
This one has been sitting there for a while as we figure some stuff out .. So the answer is maybe 🙂. I'd be very interested in your scenario, and hopefully can support it. From the short description, I do not completely follow on the auth scenario you are describing. Does this refer to service principal credentials, that can be communicated via the Unity APIs, or dies this refer to some form of managed identity in the environment you are working in? AT the very end, we need some form of credential to read the files from storage, and perform list operations on the log ...
i

Ion

08/19/2023, 9:40 AM
So the scenario is as follows: Platform team created some delta tables, and only provides system to system access. So our team (data science) has a service principal in Dev/Test/Acc and one in Prod. Both of them have been granted access through the unity catalog to those tables. This platform team claims they have not granted access directly on the storage, so reading it directly from the storage layer wouldn't work according to them. They mentioned the data gets picked up by a Managed Service Identity (MSI). Now I am not sure how I would setup a connection with Delta-RS using these service principals and access them through UC. Since I cannot create a PAT with my account since they don't grant use catalog rights, use schema on their tables directly as a user. Does this help providing more context?
r

Robert

08/19/2023, 10:34 AM
that does help a bit, but poses a challenge. In the end, some credential is needed to read from storage. So one question would be how you access the service principal credentials, and what your working environment is. It seems the service principal could be used to access the unity catalogue and read the tables metadata. However, if that SP does not have storage access, it cannot be used to actually read the table. Within a databricks environment, it may then be, that some managed identitiy dies the actual reading of the storage, and databricks figures out if you may read a table via unity catalog - so it makes sure that "all powerful" credentials (the managed identity) is only used to access tables your identity has access to... While delta-rs can also use managed identity credentials, I doubt that we can access that identity within databricks, since that would provide a very easy means to circumvent permissions set in unity. maybe you could as the platform team, if they envision scenarios to also read tables from any environment outside of databricks? Of course this answer still contains a lot of assumtions ... In case they do not provide that right now, setting up delta sharing might be a way to get direct read access to the files, and process data in external environments.
i

Ion

08/19/2023, 4:03 PM
We have a resource group with a keyvault in there. The keyvault contains the secrets of our SP (clientid and clientsecret). The platform team has claimed to made a unified way of access. They created a package, that is simply a wrapper around databricks-sql python package, and then the SP credentials are automatically fetched from the keyvault and then you can access all the tables in that catalog. I've spoken with the team lead and I am trying to convince him to also allow storage access directly with Delta-rs without the use of any cluster or sql warehouse in databricks and therefore should be cheaper in the end. The annoying thing here is, this team has setup this solution for one source system in our org. Other type of source systems are managed by a different team, which has setup data access through UC and with ACLs on the storage layer. So there I can access with delta-rs, but these datasets are around 100TBs so there I do need spark-delta instead.
2 Views