I’ve been working on a branch that enables Arrow schema’s using Dictionary columns to convert into Delta schema, using the dictionary value type, also changed how the partition stringification works so it works when partition column is encoded as Dictionary in Arrow. This cleans up some ergonomics around these columns we’re dealing with and prevents failures doing schema conversions.
One thing we’d like to also leverage is to preserve dictionary encoding when reading out from Parquet. Parquet stores the arrow schema metadata when a batch is written out. However all the delta-rs write paths will convert to the Arrow schema for writing using the Delta schema which has no concept of Dictionary afaik.
So i’m coming up with ways ensure the Arrow schema has Dictionary columns where we want when its getting written out so Parquet will read them into dictionary columns when querying.
So far I have:
• when writing out, add option to use the write execution plan’s schema vs reading teh schema from delta and converting to arrow, some validation would still occur
◦ don’t like this as it hits more code paths and is a lower level detail that’s hard to communicate upwards to users using the API
• when defining a delta schema, add metadata to fields to suggest they should be encoded as dictionary. Then on Delta -> Arrow schema conversion that metadata is checked for, if present, wrap arrow type in Dictionary
◦ like this better, as its encoded into the delta schema, inline with how the parquet arrow schema hint works
◦ is this too specific to the rust/arrow impl though? is there a precedent for options/metadata that only apply to one delta impl?
Would love feedback and thoughts, thanks!