https://delta.io logo
k

kamal kaur

05/05/2023, 7:48 PM
Hi all - Delta partitioning question - We are using open-source Delta on S3. Is it okay to partition delta table by an
id
with around 500k unique values. We are using merge operation in that table using
id
and we will be querying using Trino by that
id
. I know those are too many partitions but if the use case is just to query/merge that table by
id
, will this work?
c

chris fish

05/05/2023, 7:54 PM
you can partition by anything as long as you have a decent amount of data in each partition. read performance suffers as you have more and more objects to read
partitioning by
id
is usually never a good idea
t

Theo LEBRUN

05/05/2023, 8:02 PM
def don’t partition by
id
if you have 500k unique values! Everything will work but you gonna have perf issues
k

kamal kaur

05/05/2023, 8:07 PM
Thanks Chris and Theo. I agree with you both but if the use case is to update 5-10% of data every hour and ability to query this table by the id, will there be performance issues?
t

Theo LEBRUN

05/05/2023, 8:12 PM
I guess if you always query by
id
that’s “fine” but querying without using
id
will be very bad… You don’t have access to a
date
column that you can partition on? Also if it’s only 500k total records, I would not partition the data.
👍 1
c

chris fish

05/05/2023, 8:44 PM
a much better idea would be to partition by buckets of IDs
range partitioning
👍 1
t

Theo LEBRUN

05/05/2023, 8:46 PM
yes definitely!
👍 1
c

chris fish

05/05/2023, 9:02 PM

https://youtu.be/k8ERCB5ThrI?t=535

i give some high level steps for range partitioning in this talk
👍 1
k

kamal kaur

05/05/2023, 9:14 PM
Thanks Chris. Really appreciate it. I was also thinking about hash/range partition but just to make it easy to query from Trino , we were thinking of partitioning by id.
c

chris fish

05/05/2023, 9:24 PM
can you extend trino’s sql engine? then you could handle the query translation in the engine itself
otherwise yeah, you have to rely on users to query the right column
or use generated columns - those didn’t exist when i did this originally
👍 2
d

Dominique Brezinski

05/07/2023, 1:42 PM
Everything mentioned is spot on. One thing that wasn’t mentioned is high cardinality partitioning tends to impact write performance quite a bit. It can be really bad if there is a distribution skew in the partition values as well. You have to weigh the benefits. If your reads are always very selective wrt partition value, and read performance is much more important than write performance, then high cardinality partitioning might be OK. In general using a generated column that applies a range partition and a native delta lake reader that supports them will be the right overall balance.
👍 1
m

Madhumita Bharde

05/10/2023, 3:52 PM
thanks @Chris @dohar silalahi , yet to go through the video
high cardinality partitioning tends to impact write performance quite a bit.
I may be wrong- but if the writes are exclusively in overwrite/merge mode, wouldn’t they actually benefit from partitioning of output table by id that we merge on ?
c

chris fish

05/10/2023, 6:50 PM
with Deletion Vectors, the benefit is much smaller than it used to be, and even then it still depends on the overall distribution of your data. you want each Partition to have a decent amount of data in it - if you partition by ID, what do you expect the average size of a partition will be? 1MB? 1GB? my general rule of thumb is partitions should be at least 1GB bare minimum. you still have dataskipping by IDs as well. this is where you can definitely experiment and there probably is a global optimum to be found between partition structure, file size, range partitions, etc.
t

Theo LEBRUN

05/10/2023, 6:57 PM
If you write/read using only ID then maybe a key-value DB is more adapted 😆
m

Madhumita Bharde

05/10/2023, 7:13 PM
my general rule of thumb is partitions should be at least 1GB bare minimum. you still have dataskipping by IDs as well.
hmm
If you write/read using only ID then maybe a key-value DB is more adapted
🙂
d

Dominique Brezinski

05/10/2023, 7:17 PM
Truth is if you are only looking up by id and a small number at a time, a KV store is much better suited.
m

Madhumita Bharde

05/10/2023, 7:22 PM
thought we would test predicate pushdown and partition pruning to their absolute limits haha
2 Views