We are facing a dilemma around how to improve the layout on disk of data which is going to be queried by using lte/gte filters on a decimal value. This sounds like a great use case for Z-Ordering optimizations in Delta Lake, but we are also concerned that this might result in an extraordinary amount of IO operations or shuffling, and we considered to create our own “bin” column and partition by bin. We haven’t found much on Delta Lake documentation that compares Z-Orders with explicit binning, does anyone have some useful insight on the topic?
02/06/2023, 6:34 PM
can you explain more about your concerns with IO and shuffling?
and how would users use the bin column? would they need to explicitly provide a predicate against the bin column?
02/15/2023, 9:07 PM
That’s right, they would provide a predicate against the bin column
02/16/2023, 5:26 AM
when it comes to z-ordering, those predicates and that type should be supported. maybe trial both binning and z-ordering on a subset of the data and see what happens?
another thing to consider here is that you'll get a lot of IO and shuffling with partitioning, especially if the number of bins you have is considerably large (let's say 10k+)