https://delta.io logo
s

Sasidharan Sangeeth

09/11/2023, 8:49 AM
I have a question regarding the maximum file size created when using the Z-Order optimization command. I have a partitioned Delta Lake table where I structure streaming data and use the merge command in the foreach batch operation. I run the Z-Order optimization on every 100th batch, and I've noticed that each time it runs, it creates a very large file. The current large file is 6.3 GB, and it keeps increasing. I didn't set the 'spark.databricks.delta.optimize.maxFileSize' property, so it should be the default 1 GB, right? Do you have any insights into why it's creating such large files? I'm using Spark 3.4.1 and Delta Lake 2.4.0, and I'm not observing this behavior in non-partitioned tables where the maximum file size remains at 1 GB."
s

Sukumar Nataraj

09/11/2023, 10:03 AM
We have faced similar problem, this is because of repartitionBeforeWrite property. https://delta-users.slack.com/archives/CJ70UCSHM/p1693990567464089 I think, this is right time for you to look for partitioning, if you already have partition column then check out possibility for secondary partitions. Reducing this file size will also improve your MERGE SQL performance.
s

Sangeeth Sasidharan

09/11/2023, 10:23 AM
@Sukumar Nataraj thanks for the response I already have partition for this column..so if I disabled repartion before write property...and run optimize z order on regular interval this issue can be avoided with out secondary partition right
s

Sukumar Nataraj

09/11/2023, 12:32 PM
Yes just to remember that, If your ingesting for every 15 mins once and if you perform Z-Ordering for every 100th batches, it doesn't make sense for file pruning. Because it will do pruning only for 101th batch, but from 102 onwards it will perform similar to non Z-ordering data.