https://delta.io logo
n

Naama Gal-Or

01/17/2023, 8:34 AM
Is there a way to force
deltaTable.optimize().executeZOrderBy
to sort records within each of the Parquet files according to the sort column? At the moment it seems like the data within each file is not sorted by the sort column and this effects our file sizes as similar records are not located next to each other
j

JosephK (exDatabricks)

01/17/2023, 12:32 PM
What do you mean by sort column?
n

Naama Gal-Or

01/17/2023, 12:34 PM
Prior using z-ordering we were having our own implemented compaction which ordered the data within each Parquet file (using sort prior write). This made similar records get closer in the Parquet file and helped compaction, reduced parquet file size . Now, switching to delta 2, we started to use zordering instead of our own compaction but we seemed to have lost the ordering within the files themselves, resulting in larger Parquet files and an overall larger table size
Is there a way to keep data sorted within each file and still use zordering, so we can enjoy both worlds?
j

JosephK (exDatabricks)

01/17/2023, 12:44 PM
I tried looking around and I found a

talk

. There is a sortWithinPartitions config that you can set to true, but I’m not certain that it’s used for delta and I’m not certain it’s used for delta zordering
https://github.com/delta-io/delta/issues/447 that issue is somewhat related, but I don’t know all the open issues
n

Naama Gal-Or

01/17/2023, 5:59 PM
Thanks, Looking at the zordering optimizer code, it seems to be doing
repartitionByRange
so no sorting option if I understand correctly. https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/skipping/MultiDimClustering.scala#L50
😭 1
5 Views