https://delta.io logo
l

Lucas Zago

03/08/2023, 12:55 PM
Hi guys, I'm having some issues on persisting a table in delta, my cluster cannot process the data to do it, there is some best practices to do it ?
(df.write.format("delta")
.mode("overwrite")
.partitionBy("year_partition")
.option("overwriteSchema","true")
.saveAsTable(path))
I'm looking to convert a parquet into delta before persist, I do not know if it is a viable option
j

JosephK (exDatabricks)

03/08/2023, 1:34 PM
How big is the cluster and the data? Even with more data than RAM, spark will just queue things up
l

Lucas Zago

03/08/2023, 1:50 PM
Is there a way to optimize a parquet file like a delta table?
j

JosephK (exDatabricks)

03/08/2023, 4:30 PM
No, that’s one reason not to use parquet. A good way to think of delta is that it fixes all the mistakes we made w/ parquet before. Parquet has schema problems, bad stats, no insertions/deletes
2
l

Lucas Zago

03/08/2023, 5:53 PM
I thought, converting a parquet i could manipulate inplace the delta table, but it's not. To return to original format("parquet"), I can do only
deltaTable.vacuum()
?
j

JosephK (exDatabricks)

03/08/2023, 6:04 PM
yes, you’d have to vacuum 0 hours and delete the log but that’s not recommended
l

Lucas Zago

03/09/2023, 1:42 PM
Hi guys , Made some improvements in my process here: All intermediated tables I written in delta and optimize them. Enable optimizeWrite and autoCompact, still having issue to write the final df. There is a space for some improvement yet ?
j

JosephK (exDatabricks)

03/09/2023, 1:44 PM
Depends on the size of the data, what the operation is, what the error is. You’ll run into a simple problem if you try to have a special character in the schema or it could be harder problems.
2 Views