https://delta.io logo
c

Caoduy Vonguyen

05/19/2023, 7:27 AM
image.png,image.png
y

Yousry Mohamed

05/19/2023, 7:55 AM
The screenshots are not equivalent. Are you talking about Spark scala vs Python difference. The left screen description is a
count
while the other is
showString
Actually
df = spark.read.format("delta").load("path").count()
means df holds the count not the DataFrame itself Also the number of tasks in both screens is really massive. Can you check your delta folder and see how many files are there and how big they are. Could be a tiny file problem.
c

Caoduy Vonguyen

05/19/2023, 8:11 AM
i have 12k file and not too small from 50Mb to 512Mb, I have already run OPTIMIZE table ZORDER(column)
y

Yousry Mohamed

05/19/2023, 8:21 AM
Count operation should scan transaction log as record count per each parquet file are stored part of the metadata information in transaction log. Transaction log data volume should be really minimal.
If we assume that average parquet file size in your delta folder is 250MB, this means your table size is around 3TB. (assuming there are no tombstone files)
To get more details about each query, there is a line called
Associated SQL Query
in job details page with a hyperlink. Click the hyperlink and it would take you to a page with more details like how many files scanned. total number of cloud API calls, whether there is a shuffle or not, etc.
c

Caoduy Vonguyen

05/19/2023, 8:51 AM
with scalar
with pyspark
with same resource and same function count()
y

Yousry Mohamed

05/19/2023, 9:17 AM
Performance-wise, they are not too far from each other. Still feel like the whole table is read not just the metadata. What environment do you have? Can you show table history something like
Copy code
spark.sql(f"DESC HISTORY delta.`{<your path variable or hard code it here>}`").show()
I mis-read duration : one is 14 sec and the other is 1 min.
c

Caoduy Vonguyen

05/19/2023, 9:42 AM
desc history