https://delta.io logo
a

Artsiom Yudovin

05/09/2023, 9:04 PM
Hi All, I have faced a case when I have a vast parquet file inside Delta after the merge operation. It looks like Delta makes one huge file instead of making several files. Does anybody meet a similar case before? I use Apache Spark Structured Streaming with DeltaLake and merge each batch to the delta table.
j

JosephK (exDatabricks)

05/09/2023, 9:05 PM
how large is the vast file?
a

Artsiom Yudovin

05/09/2023, 9:58 PM
It can be 1GB, but in general, we have the list of files with ~64MB, and at one time, we got one big file
j

JosephK (exDatabricks)

05/09/2023, 11:03 PM
1gb is a fine file size and not vast. you'l get 1gb files when you do compaction/optimization. It's fine
a

Artsiom Yudovin

05/10/2023, 11:56 AM
When I conduct compaction, I calculate the number of partitions to make the file size 64 MB or 128 MB. In some moments before subsequent compaction, I get one file as 1GB. It looks like strange, unpredictable behavior to me. I try to understand why it’s happened
j

JosephK (exDatabricks)

05/10/2023, 12:24 PM
So the file size will vary based on table size. For the most part, don't worry about it but 64mb is def too small https://learn.microsoft.com/en-us/azure/databricks/delta/tune-file-size
a

Artsiom Yudovin

05/10/2023, 2:53 PM
I use the old version of delta lake and don’t have to optimize. Therefore I conduct compaction by myself to calculate the optimal number partitions. Got you, I just try to understand why it’s happened
j

JosephK (exDatabricks)

05/10/2023, 2:56 PM
There is no perfect number of partitions, it should match your queries. Each partition should be about 10-50gb in size. Your opt file size should just match that table, but just having 1GB is the easiest way to do it.
a

Artsiom Yudovin

05/10/2023, 3:02 PM
We may have a slight misunderstanding. I am concerned that after some merges to my delta table, I have one parquet file under my delta table. I don’t do any optimization or something similar to it. For me, it looks a bit strange, but if I understand you correctly, that’s ok for DeltaTable to create one parquet file and don’t partition it.
j

JosephK (exDatabricks)

05/10/2023, 3:47 PM
if you have 0-2gb of data, then 1 file is fine
a

Artsiom Yudovin

05/11/2023, 10:33 AM
For reading this data, I use Redshift Spectrum. As far as I remember, the recommendation by redshift spectrum is to use files less than 512MB. Is it possible to limit the size of the file?
j

JosephK (exDatabricks)

05/11/2023, 11:45 AM
Spark has no way to limit the size and delta only has it when you use optimize. Probably a good case for upgrading
a

Artsiom Yudovin

05/11/2023, 4:15 PM
Yep, we’re going to update, but unfortunately no capacity now; therefore, we are trying to find a solution to limit the size or stop producing one file. Everything works fine after we conduct compaction, but after some merge has been applied by Delta, we got one file.
@JosephK (exDatabricks), I guess I have met this issue https://github.com/delta-io/delta/issues/500. Do you have any recommendations on how to handle it? I don’t see any opportunity to override repartitionIfNeeded
j

JosephK (exDatabricks)

05/17/2023, 12:24 PM
I am not much help here. I would say that repartition with a column name as an argument was a mistake to write in the first place, but that's not much help.