https://delta.io logo
m

Matthew Powers

06/13/2023, 1:15 PM
The Dask community is making some exciting progress on their connector: https://github.com/dask-contrib/dask-deltatable My high level thoughts on the existing code: • They should just use the delta-rs history() command and not reimplement it • They should use the delta-rs vacuum() command • I’m not sure about the catalog logic, but my guess is that there should not be catalog logic in the Dask connector itself See the thread for the POC reader I created.
Copy code
def read_delta(
    table_uri,
    version=None,
    storage_options=None,
    without_files=False,
    columns=None,
    filters=None,
):
    dt = DeltaTable(table_uri, version, storage_options, without_files)
    fragments = dt.to_pyarrow_dataset().get_fragments(filter=filters)
    filenames = list(f"{table_uri}/{fragment.path}" for fragment in fragments)
    ddf = dd.read_parquet(filenames, engine="pyarrow", columns=columns)
    return ddf