https://delta.io logo
r

rtyler

06/02/2023, 4:37 AM
@Robert the support for unity catalog (https://github.com/delta-io/delta-rs/pull/1338) seems to be languishing a bit, that draft now has some conflicts unfortunately. Do you need some help bringing that support across the line? I can deploy some Databricks resources in AWS to validate and test if necessary
r

Robert

06/02/2023, 6:09 AM
I wanted to optimize a few things, specifically we can avoid a bunch of requests by just listing tables from the table summaries query. On top of that I was also contemplating some more improvements in how we create the tables, but that could be a follow up as well, as it may require some internal changes in the core. Specifically, if we lazy-load the log, as @Will Jones suggested, we can probably load the metadata from the catalog, without requiring touching the log at all .. Lastly - and this I am maybe most excited about 😄 - unity catalog also communicates credentials. As a bonus, once we can integrate object_store 0.6, we can likely get rid of some of the client code...
But if you can create a AWS instance to make sure it works, that would be awesome!
r

rtyler

06/02/2023, 6:10 AM
I thought about spinning up a PR to incorporate object_store 0.6 but stopped because of the AWS_PROFILE changes. Should I go ahead and just got that going?
r

Robert

06/02/2023, 6:12 AM
yeah, we'll have to address this! I thing we are tied to 0.5* right now though until datafusion upgraded...
r

rtyler

06/02/2023, 6:13 AM
ah you're right, I'm so used to datafusion and arrow coupling that I didn't realize datafusion also specifies a strict dependency with object_store
Specifically, if we lazy-load the log, as @Will Jones suggested, we can probably load the metadata from the catalog, without requiring touching the log at all ..
I'm curious why that is an improvement. Ultimately the log in object storage is the source of truth for all that metadata anyways is it not?
r

Robert

06/02/2023, 6:20 AM
well, in this case the data in the log would definately take precedence if there is aconflict, but I think the data you get from the catalog is not fetched from the log when you request it there. And we still have to scan the entire log for metadata actions, so initial load of a table could be much cheaper, for us especially also in terms on memory consumed... in the end you are right though, if you want to do more then inspepct the metadata, we'll have to load the log anyhow.