https://delta.io logo
r

Randy Sims

04/11/2023, 3:15 PM
With partitioning, does a query have to match the partition exactly? If I have a query using between date1 and date2 but they are full timestamps and my date1 and date2 uses times like 000000.000000 will it still use a partition like ingeat_date=2023-01-01 125620.246314?
A link to a good document would be great for this
j

Jim Hibbard

04/11/2023, 9:43 PM
In that case you wouldn't partition on the datetime columns, but would create an additional column
date
with lower precision to partition on. It won't use a lot of additional memory because parquet uses fancy run-length encoding to reduce the cost of storing repeated values.
I have a short post on optimizing query performance that might be helpful, it links to the docs for each method too which should be a good additional resource. Hope that helps!
r

Randy Sims

04/12/2023, 12:38 AM
Thanks @Jim Hibbard I saw that earlier. I'm asking because this is what someone else wants to do and I've told them that if they do a query like that it's not going to use the partition but it will always use the statistics stored by the delta logs. I just want to make sure that that's accurate
j

Jim Hibbard

04/12/2023, 4:28 AM
Ah, gotcha! Yes, as long as the version of Delta Lake they're using supports file statistics for the column types in their query their engine should use information like
min
and
max
values to prune files that can't match their query 👍
r

Randy Sims

04/12/2023, 11:51 AM
@Jim Hibbard right, prune files but not use the partitions. I read somewhere that delta lake is going to probably work better without partitioning these days unless you're an expert. That delta and spark will likely provide better performance without it. Is that true?
j

Jim Hibbard

04/12/2023, 2:36 PM
I wouldn't go that far, I think you can definitely use partitioning to great effect without being an expert by keeping those "rules of thumb" in mind, e.g. each partition should be > 1 GB and your table should be > 1 TB. Your data access pattern should be aligned with the column you're partitioning.
r

Randy Sims

04/12/2023, 3:23 PM
@Jim Hibbard okay thanks! Just once more, I want to try to nail this down. Will partition pruning specifically work if the where clause doesn't match the partition value exactly?
8 Views