02/23/2023, 7:12 AM
Hi everyone, I am seeking help finding the solution for a specific requirement. There is a requirement to process parquet files and save them to the delta table. These files always belong to a particular partition, so we processed them concurrently. But writing to the delta table becomes slow with many concurrent writes. So we want to batch the files and write them together. I have tried a union operation on all the data frames before writing to the delta table. But this seems to be not an optimal way as this time is more than the time taken to write individually to a plain parquet table. Writing 25 files into the delta table using the union approach took around 125s, whereas parallel writes to the parquet table took around 97s. The same test for 50 files, delta took 234s whereas plain parquet writes took 171s. Is there a better way to write multiple data frames together in one transaction/commit?

JosephK (exDatabricks)

02/23/2023, 12:16 PM
So much of this depends on your cluster size. You’re going to be limited by your number of cores. Plain parquet is going to take longer because it’s not creating a log. Writing a table is a one time cost, so don’t think about it. You’ll get a much bigger benefit from every subsequent read.