https://delta.io logo
m

Matthew Powers

02/03/2023, 5:28 PM
Mehdi Ouazza, a popular data LinkedIn person, just joined MotherDuck and pinged me about if we’d like to collab on the Delta Lake / DuckDB integration. Is this something we’d like to invest in now, or should we wait till the ADBC work is a bit farther along?
👍 3
w

Will Jones

02/03/2023, 5:32 PM
I think we should definitely chat with them. Even if we end up waiting, be good to make sure we are aligned on a design
r

Robert

02/03/2023, 6:03 PM
not only as a fan of duck db, i agree with Will 🙂
m

Matthew Powers

02/03/2023, 7:48 PM
Sweet, he just joined and is in Brussels meeting the team now. I’ll try to grab him in the coming weeks and we can all chat and get on the same page!
🙌 2
z

Zach

02/04/2023, 8:53 PM
Ohhh excited to see what comes out of this 🦆
👍 1
k

Kris Geusebroek

02/06/2023, 3:54 PM
I Investigated what needs to be done to add support for delta format in duckdb (parquet seems to be supported). This was for a pet project and me learning duckdb a bit. In my quest to see what was needed to support this I found that there is a different way of thinking about tables implemented. To read parquet data you support a filepath or a glob. This leads to a list of files which are read with the parquetreader class. In Hadoop based readers you can provide a upperlevel directory and all subdirs are checked and added if valid. With the current glob based approach you dont have access to the directory structure so getting to the delta json file is impossible. No way to figure out the correct file list to read. Of course adding delta support should be doable but dont think to lightly about it ;-)
m

Matthew Powers

02/06/2023, 5:01 PM
@Kris Geusebroek - thanks for chiming in. Here’s how I implemented the file readers for the Dask Delta reader if you’d like to take a look at the code: https://github.com/MrPowers/deltadask/blob/main/deltadask/__init__.py. Only 13 lines. I’m guessing the DuckDB readers will be relatively easy to add, but the writers will be tricky. Hopefully the ADBC stuff makes the writers easier to add.
k

Kris Geusebroek

02/07/2023, 7:36 AM
Hi @Matthew Powers bear in mind that the duckdb code is dependency free as I understood it. I guess that makes it more than 13 lines to code 😉
r

Robert

02/07/2023, 9:42 AM
Form my current understanding of ADBC, we would likely aim for an integration on the native side. I.e. we create an so. library that DuckDBs C++ code can load (maybe as an extension / plugin) which allows DuckDB to read the tables. ideally DuckDB would communicate any predicates it wants to push down, so we can leverage optimized reading form object stores etc. That said, there may be more alternatives, e.g. just communicating the files and meta data to duckDB if it should be advantageous for duck DB to handle reading the files from storage. I guess we’ll have to see what comes out of the dicussions 🙂
z

Zach

02/08/2023, 12:50 AM
probably far too early to use, but there is some experimental work done to enable the writing of rust duckdb extensions: https://github.com/Mause/duckdb-extension-framework
And a rust extension for delta lake built with it
3 Views