https://delta.io logo
r

Ran Razy

07/24/2023, 6:35 AM
Hi, Using a delta table on Azure Blob Storage, which has a partition column named
ts
, which stands for a timestamp. When we try to get
max(ts)
for this table, it seems that Spark (3.3.2) is scanning a lot of files, and the operation takes a lot of time. Is there any way to optimize this? Rely on some metadata, or configure the table in any way to help Spark calculate this more efficiently? Thanks!
t

Tom van Bussel

07/24/2023, 7:52 AM
How many distinct timestamps do you have? Partitioning only allows a single value in a partition column per file. So if you have a lot of unique values in
ts
then this will result in a very large number of files. For high-cardinality columns we recommend using z-ordering (and Liquid in the future) instead of partitioning.
i

Itai Yaffe

07/27/2023, 1:28 PM
I chatted with @Ran Razy: 1. @Felipe Pessoto’s PR (here), once merged, should resolve the issue Ran raised. 2. As per the
ts
partition column, Ran mentioned it's truncated by a 1-hour interval, so I assume the cardinality is not very high