https://delta.io logo
a

Adishesh Kishore

03/10/2023, 9:23 AM
Hi folks, I was wondering if there was a way to collect column statistics from delta tables using spark? I am specifically interested in the distinct values. My current presto cluster is dependant on those stats for optimizations
j

JosephK (exDatabricks)

03/10/2023, 11:44 AM
Delta is just files on disk, so you can read those files and compute statistics. There is a count_distinct() function in spark sql. The delta_log doesn’t compute distinct stats because those don’t help with file skipping.
a

Adishesh Kishore

03/13/2023, 11:17 AM
@JosephK (exDatabricks) is there a specific connector for spark that will allow me to read only the active files while computing stats in the Analyze table command? I need these stats to be populated in my metastore for the query optimizer.
j

JosephK (exDatabricks)

03/13/2023, 11:45 AM
You can use spark to read a table and not need any particular connector:
Copy code
spark.read.format("delta").load(path)
Copy code
df.inputFiles()
will return the files to the table
2 Views