1 question, maybe not just specific to delta-rs but delta in general. what is the main reason for the tight coupling with parquet format?
I was thinking that feature-wise delta is great and has tons of external connectors but one thing that it imposes is to migrate to parquet because the files store metadata/ stats. but then this can be done with other formats as well like AVRO/ ORC.
shouldn't delta be a thin wrapper of the current features around "an existing data lake" rather than "a data lake made of parquet only"? file format implementation can be decoupled from the delta protocol exposed via an interface that connectors implement to support a file format.
This will also be a more future facing - what if tomorrow there's a new shiny amazing file format that beats parquet. delta will require lots of refactoring to possibly support it. but via the interfaces I mentioned above it should be relatively easy.
just some thoughts here, whats your take on this ?
05/21/2023, 9:18 AM
In principle there is nothing in the delta protocol that ties it to parquet. There is even a file format field defined in the metadata. In practice though there seems to have been no real traction around other formats. From the delta-rs perspective the dependency on parquet itself is fairly isolated and I to vaguely remember a discussion if the arrow format should be supported.
05/22/2023, 4:36 PM
In principal Delta Lake is file format agnostic. However, the Spark implementation was developed around Parquet, so there is implicit coupling in that implementation. I don't think anybody in the project would reject efforts to decouple better, but it would be a lot of work to achieve similar features and performance. I could be wrong, but that is my perception from conversations over the years.