https://delta.io logo
k

Kevin Lim

07/23/2023, 1:00 AM
I see delta-lake can be used without a data catalog and the only two options are AWS glue and databricks unity? Can someone explain to me exactly what a catalog provides and what are the advantages/disadavantages of using one? Does it improve query performance? Or does it just improve discoverability of datasets/tables? Are there any open source/selfhosted catalogs one can use with delta lake? How does delta lake performance compare to just treating your data as an arrow dataset stored as parquet and partitioned appropriately?
y

Yuri Niitsuma

07/23/2023, 1:07 PM
Are there any open source/selfhosted catalogs one can use with delta lake?
Open source you can use Hive catalog Example of simple (not production ready) Hive catalog server: https://github.com/ignitz/jibaro/blob/main/lake_lab/hive/Dockerfile
k

Kevin Lim

07/23/2023, 6:26 PM
Ah, can anyone answer the other questions I had? Looks like the python/rust library for delta-lake only supports aws glue and databricks unity
j

Jordan Fox

07/23/2023, 10:31 PM
Performance: Deltalake is parquet under the hood. It won't be more performant. But parquet is immutable. Delta provides acid transactions. Catalogs: Provides discoverability and potentially governance.
k

Kevin Lim

07/24/2023, 12:07 AM
Am i correct in my understanding that the python delta implementation only supports AWS Glue and databricks unity?
j

Jordan Fox

07/24/2023, 12:30 AM
DeltaTable.from_data_catalog()
calls
RawDeltaTable.get_table_uri_from_data_catalog
and the
DataCatalog
class currently only has
AWS='glue'
and
UNITY='unity'
. The from uri call shows
Copy code
#[classmethod] 
     fn get_table_uri_from_data_catalog( 
         _cls: &PyType, 
         data_catalog: &str, 
         database_name: &str, 
         table_name: &str, 
         data_catalog_id: Option<String>, 
     ) -> PyResult<String> { 
         let data_catalog = 
             deltalake::data_catalog::get_data_catalog(data_catalog).map_err(|_| { 
                 PyValueError::new_err(format!("Catalog '{}' not available.", data_catalog)) 
             })?; 
         let table_uri = rt()? 
             .block_on(data_catalog.get_table_storage_location( 
                 data_catalog_id, 
                 database_name, 
                 table_name, 
             )) 
             .map_err(|err| PyIOError::new_err(err.to_string()))?; 
  
         Ok(table_uri) 
     }
Which had an error saying not supported yet if it isn't in the DataCatalog class. So, tldr, yes, only Glue and Unity if you wana load your table from a Catalog. You can still load your table from literally anywhere else though.
r

rtyler

07/24/2023, 12:36 AM
yeah, we basically haven't anybody step into #delta-rs to provide the Hive catalog implementation 🤷
j

Jordan Fox

07/24/2023, 12:37 AM
I think we're going to fairly quickly be moving away from hive as a source of catalog as a community soon so I'm fine with that.
r

rtyler

07/24/2023, 12:47 AM
@Jordan Fox what do people use out there? We've been using Glue for a few years now
j

Jordan Fox

07/24/2023, 12:49 AM
I'm in private preview for replacing my Thrift servers with Unity. Also, the Nessie project is taking off. I'm surprised the Delta community hasn't already co-opted it.
k

Kevin Lim

07/24/2023, 1:40 AM
Also anyone notice this error when using delta-rs's python bindings?
Copy code
terminate called without an active exception
caused by this:
Copy code
import pandas as pd
from deltalake.writer import write_deltalake

df = pd.DataFrame({"a": [1,2,3,4,5], "b": [3,2,1,2,5]})
write_deltalake("dlake", df)

df = pd.DataFrame({"a": [1,2,3,4,5], "b": [3,2,1,2,5]})
write_deltalake("dlake", df, mode="append")
noticed that the files seemed to have written successfully though
Anyone run into this bug before?
I didn't see this bug in github issues, using python 3.10.9 and the latest version of deltalake
bump
m

Matthew Powers

07/25/2023, 6:01 PM
I’ve never seen that issue before. Seems like a generic Python error and not related to Delta Lake.
k

Kevin Lim

07/25/2023, 6:17 PM
I'm literally just running that script though.
I don't get this issue with any other python code