https://delta.io logo
m

Martin

01/19/2023, 5:55 PM
Hi @Matthew Powers, thanks for the great blog post on partitions: https://delta.io/blog/2023-01-18-add-remove-partition-delta-lake/ There is this code snipped in the post:
Copy code
df.repartition(F.col("country")).write.partitionBy("country").format(
    "delta"
).saveAsTable("country_people")
I was wondering: is there a benefit of performing an explicit
repartition
on the partition column before writing into a partitioned table? Delta does the repartition implicitly anyhow, doesn't it?
👍 1
m

Matthew Powers

01/19/2023, 6:11 PM
Let me do some digging and report back.
n

Nick Karpov

01/19/2023, 8:10 PM
in that code block the benefit of the repartition on the dataframe prior to writing is to ensure that all rows of any given country are colocated in the same partition
partitionBy
doesn't shuffle rows, so without hte repartition call you are likely to have multiple tasks writing multiple files to a single partition
m

Matthew Powers

01/19/2023, 8:25 PM
Yea, think that
repartition(F.col("country"))
was mainly just to output a single file to make the example cleaner. But I should probably note that in the blog.
m

Martin

01/19/2023, 8:28 PM
thanks guys!
g

Gerhard Brueckl

01/20/2023, 8:15 AM
due to the shuffling of
repartition()
I guess it can be slightly slower but ensures that you write the minimum amount of files. Would be interesting if
optimizeWrite
would consolidate the files in a similar way as an explicit
repartition
🤔
n

Nick Karpov

01/20/2023, 4:00 PM
that's exactly what it does 😄 will be exciting to see the PR merged
g

Gerhard Brueckl

01/20/2023, 5:21 PM
So is doing optimizeWrite just doing a repartition in the background?
n

Nick Karpov

01/20/2023, 6:09 PM
simply put yes, but a little fancier https://github.com/delta-io/delta/pull/1198
g

Gerhard Brueckl

01/20/2023, 6:17 PM
Nice!
2 Views