Ran Razy

07/24/2023, 6:35 AM
Hi, Using a delta table on Azure Blob Storage, which has a partition column named
, which stands for a timestamp. When we try to get
for this table, it seems that Spark (3.3.2) is scanning a lot of files, and the operation takes a lot of time. Is there any way to optimize this? Rely on some metadata, or configure the table in any way to help Spark calculate this more efficiently? Thanks!

Tom van Bussel

07/24/2023, 7:52 AM
How many distinct timestamps do you have? Partitioning only allows a single value in a partition column per file. So if you have a lot of unique values in
then this will result in a very large number of files. For high-cardinality columns we recommend using z-ordering (and Liquid in the future) instead of partitioning.

Itai Yaffe

07/27/2023, 1:28 PM
I chatted with @Ran Razy: 1. @Felipe Pessoto’s PR (here), once merged, should resolve the issue Ran raised. 2. As per the
partition column, Ran mentioned it's truncated by a 1-hour interval, so I assume the cardinality is not very high