https://delta.io logo
a

Ahmad Dorri

05/27/2023, 4:39 PM
Hello I have a Delta table that is defined by a path and is partitioned by time. Currently, when I use the Spark reader to count all the records in the table, it scans the file system and retrieves files in order to perform the count operation. However, I believe that the necessary information to determine the count of records already exists in the checkpoint.parquet file. I would like to know if there is a way to leverage the existing checkpoints and retrieve the count more efficiently, without the need to scan the file system or retrieve additional metadata (just use computed snapshot).
n

Nick Karpov

05/27/2023, 6:22 PM
what version/connector are you using? this has been supported in the spark connector since 2.2 (https://github.com/delta-io/delta/releases/tag/v2.2.0)
Copy code
Aggregate pushdown into Delta scan for SELECT COUNT(*). Aggregation queries such as SELECT COUNT(*) on Delta tables are satisfied using file-level row counts in Delta table metadata rather than counting rows in the underlying data files. This significantly reduces the query time as the query just needs to read the table metadata and could make full table count queries faster by 10-100x.
🙌 1
m

Matthew Powers

05/27/2023, 10:50 PM
Yep, @Scott Sandre (Delta Lake) and I wrote a blog post discussing this with some benchmarks: https://delta.io/blog/2023-04-19-faster-aggregations-metadata/
🙌 1
a

Ahmad Dorri

05/28/2023, 10:26 AM
I'm currently working with Delta Lake version 2.0.2 but previously, about 6 months ago, I tested the
dataframe.count()
operation and noticed that it provided fast results
j

JosephK (exDatabricks)

05/29/2023, 8:40 PM
So technically you need to scan the logs to read the count value, so there is no avoiding the file system