https://delta.io logo
a

Akshay kumar

01/08/2023, 7:40 PM
Hi folks, I can see in databricks, count(*) sort of queries are pretty faster. Attached is the plan information. Although we have seen a significant improvement in Delta 2.2 on this aspect but still it is multifold slower than databricks. Can someone throw a light on what is the reason and do we have plan to improve this in the community?
y

Yousry Mohamed

01/08/2023, 7:56 PM
The attached plan looks like a plan for querying transaction log JSON files. I cannot see how this relates to the question around performance of count(*) statements.
j

JosephK (exDatabricks)

01/08/2023, 8:18 PM
Count() is a metadata operation so should take less than a second bc it only requires reading the log, which contains the rowcount
y

Yousry Mohamed

01/08/2023, 8:39 PM
Makes sense. Then it could be useful to check Spark Web UI SQL tab as there should be some metrics about how many files are read from cloud storage and how long does it take. Maybe there is a tiny files problem.
a

Akshay kumar

01/09/2023, 4:31 AM
HI @Yousry Mohamed, The point I am trying to make is: I am querying same underlying delta table from Databricks as well as open source delta+spark. In databricks it took 3 secs but in OS it took 1.1 mins. The attached plan was from Databricks which was very quick.
Now attaching the ones from OS delta:
y

Yousry Mohamed

01/09/2023, 10:22 AM
Could you share the SQL query or Python statement. It seems the OSS plan reads the whole table in memory and then counts the records for some reason. One of the screenshots shows reading ~200GB of data and converting them from columnar to row format.
a

Akshay kumar

01/09/2023, 5:50 PM
@Yousry Mohamed Query as I mentioned is
Copy code
select count(*) from table
y

Yousry Mohamed

01/09/2023, 6:59 PM
Can you try the below: `SELECT COUNT(*) FROM delta.`<storage-location>`` I wonder how the table is registered on Databricks vs OSS. And what type of Spark+Delta OSS stack do you have (AWS EMR/HDInsight/etc) ?
a

Akshay kumar

01/10/2023, 5:10 AM
@Yousry Mohamed In both the cases results are same. Anyways we are using hive metastore to keep the table info. OSS stack is spark on k8s with Azure VMS and Azure storage ( ADLS gen2).
y

Yousry Mohamed

01/10/2023, 9:54 AM
On k8s, can you try
DESC HISTORY <table>
. I want to confirm that the table is registered correctly as a delta table in hive that’s why I proposed also to try SELECT COUNT() FROM delta.`<storage-location>`* On a Databricks cluster, I tried to count records of a delta table but as just reading plain parquet and I got a similar exec plan like the one you shared for k8s.
spark.read.parquet("dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta").count()
Meaning the plan on k8s seems to be due to the table being registered as a parquet table #random-guess
a

Akshay kumar

01/11/2023, 6:00 AM
Got it. Format is delta. I have run with path as well. Same results.
🤷‍♂️ 1
4 Views