https://delta.io logo
t

Tarun Sharma

03/14/2023, 12:26 PM
I was performing benchmarking for open table formats against some parameters. I created two tables, one is in iceberg format(parquet) and another in delta lake using spark-3.3(emr-6.9). In the initial data load (insert) into the table operation, iceberg output data volume was 102GBs against input data volume of 161GBs. While on the delta lake the output data volume was 228GBs against the same input data volume of 161GBs. Do anyone know the reason?
j

JosephK (exDatabricks)

03/14/2023, 12:28 PM
What is the compression format for both?
Delta also uses parquet underneath so there shouldn’t be a drastic difference in table sizes
t

Tarun Sharma

03/14/2023, 12:33 PM
Input data is in snappy, iceberg dataset is in gzip and delta lake dataset is in snappy
j

JosephK (exDatabricks)

03/14/2023, 12:34 PM
Well, there is your answer
t

Tarun Sharma

03/14/2023, 12:36 PM
But in delta lake, does it suppose to explode from 161GBs to 228GBs since both input and output are in same compression format?
I even ran vaccum queries on my delta table. Still the table has same size.
j

JosephK (exDatabricks)

03/14/2023, 12:37 PM
Depends on the input file format
Compression can also depend on sorting the data, data types, cardinality of columns.
If the table has just been created, vacuum won’t do anything.
t

Tarun Sharma

03/14/2023, 12:40 PM
Exactly, the vacuum did nothing. Okay I’ll give it a try and compare it with iceberg using the same compression format.
j

JosephK (exDatabricks)

03/14/2023, 12:42 PM
Vacuum deletes the old files in the history. The history has only a single version, write, so there is no history.
t

Tarun Sharma

03/14/2023, 12:42 PM
Although both the tables had significant difference in data volume but the query against these tables were taking almost same time to execute.
j

JosephK (exDatabricks)

03/14/2023, 12:44 PM
That’s good. Delta should outperform Iceberg. There are many third party benchmarks that show this. https://databeans-blogs.medium.com/delta-vs-iceberg-vs-hudi-reassessing-performance-cb8157005eb0. Don’t believe anything by us, tabular, or onehouse
👍 1
t

Tarun Sharma

03/14/2023, 4:57 PM
I configured the compression to snappy for iceberg table. The output data volume is 184GBs against input data volume of 161GBs. It is still better than Delta lake (228GBs).
d

Dominique Brezinski

03/14/2023, 11:24 PM
This topic seems familiar...
t

Tarun Sharma

03/15/2023, 6:24 AM
Initial data load query execution time is also different. In Iceberg it took ~28mins while in delta lake it took ~55 mins.
j

JosephK (exDatabricks)

03/15/2023, 11:03 AM
What was the initial file format?
t

Tarun Sharma

03/15/2023, 11:04 AM
Parquet, snappy
j

JosephK (exDatabricks)

03/15/2023, 11:06 AM
So the sizes shouldn’t change so much if you’re reading parquet snappy and writing parquet snappy. All you’re adding is some logs, which on the big side tens of MBs
Iceberg isn’t going to reduce your data size by 40% and neither is Delta
t

Tarun Sharma

03/15/2023, 11:08 AM
Yeah, so as I mentioned earlier I was using default iceberg compression for parquet (i.e. gzip). But later I changed it to snappy and still it is better than delta lake. Although overall the size has been increased for both iceberg and delta table.
j

JosephK (exDatabricks)

03/15/2023, 11:10 AM
The operation is just spark.read.parquet(path).write.format(delta/iceberg).save(path2)
t

Tarun Sharma

03/15/2023, 11:12 AM
For iceberg I used DF writer V2 apis. So it was df.format(“iceberg”).writeTo(table). I don’t think this makes any difference.
j

JosephK (exDatabricks)

03/15/2023, 11:16 AM
The V2 writer might affect things. Not sure what writer delta uses on emr
It will affect speed, but not file size
4 Views