https://delta.io logo
a

Alberto Rguez

08/29/2023, 3:20 PM
I have probably a stupid question but maybe someone can quickly help me. I am running a simple query select max(column) from a table. They column is within the first 32 columns but it does need to scan a table. It is not supposed to get max and min values from metadata and that is why stats are gathered?
t

Tom van Bussel

08/29/2023, 3:21 PM
What version of Delta Lake are you using (Open Source/Fabric/Databricks/etc)? With which system are you using Delta Lake (Spark?) and what is the type of the column?
a

Alberto Rguez

08/29/2023, 3:24 PM
Using Databricks SQL warehouse and the column is a timestamp but I have tried in integers as well
let me check the delta version but I would say it must have the last version
t

Tom van Bussel

08/29/2023, 3:26 PM
No need, I think I know what's going on.
a

Alberto Rguez

08/29/2023, 3:27 PM
well that is fantastic because I am lost
t

Tom van Bussel

08/29/2023, 3:27 PM
The issue is that Delta only stores timestamps at a millisecond level in the statistics (long standing issue). This means that the min/max statistics cannot be used to answer queries using only the metadata. The min/max statistics for timestamps can only be used for data skipping.
a

Alberto Rguez

08/29/2023, 3:29 PM
and why it does not work for integer. It scans the whole table
t

Tom van Bussel

08/29/2023, 3:30 PM
Hmmm, for integers it should definitely work, as long as we have statistics for all files.
a

Alberto Rguez

08/29/2023, 3:31 PM
Well I thought what the only thing needed was to put the column between the 32 first columns
t

Tom van Bussel

08/29/2023, 3:31 PM
Yeah, in that case Delta should gather statistics for the column.
a

Alberto Rguez

08/29/2023, 3:32 PM
the explain plan suggest the contrary but if do elect count from the table it does look at the stats and not scan
t

Tom van Bussel

08/29/2023, 3:32 PM
What does the query look like?
a

Alberto Rguez

08/29/2023, 3:33 PM
Copy code
explain formatted
select count(*) from d_bronze.goco.home_contents
t

Tom van Bussel

08/29/2023, 3:33 PM
Sorry, I have should specified more. I meant the query with the max on the integer column.
a

Alberto Rguez

08/29/2023, 3:33 PM
Copy code
explain formatted
select max(quoteid) from d_bronze.goco.home_contents
yes I was gonna paste both sorry
Copy code
= Physical Plan ==
AdaptiveSparkPlan (9)
+- ColumnarToRow (8)
   +- PhotonResultStage (7)
      +- PhotonAgg (6)
         +- PhotonShuffleExchangeSource (5)
            +- PhotonShuffleMapStage (4)
               +- PhotonShuffleExchangeSink (3)
                  +- PhotonAgg (2)
                     +- PhotonScan parquet d_bronze.goco.home_contents (1)
explain plan for the second one
t

Tom van Bussel

08/29/2023, 3:34 PM
Hmmm, that's odd.
a

Alberto Rguez

08/29/2023, 3:34 PM
Copy code
== Physical Plan ==
LocalTableScan (1)
explain plan for the first one
indeed
I am very surprised as well
t

Tom van Bussel

08/29/2023, 3:35 PM
Since you're using Databricks I'd recommend contacting Databricks Support to help debug the issue.
a

Alberto Rguez

08/29/2023, 3:35 PM
yes, will do that. Thanks for your help, anyway. I was thinking maybe I am doing something wrong
t

Tom van Bussel

08/29/2023, 3:36 PM
One final question: Was the column present at the time the table was created, or was it added at a later time?
a

Alberto Rguez

08/29/2023, 3:37 PM
present
I think I have some colleagues changing the "delta.dataSkippingNumIndexedCols"
do you know if that changes how to make sure stats are gathered?
d

Dominique Brezinski

08/29/2023, 5:20 PM
Yes changing “delta.dataSkippingNumIndexedCols” affects stats generation I believe. Also note that column counts include struct fields etc.
i

Itai Yaffe

09/04/2023, 10:02 AM
Can this be related to the fact that
SELECT COUNT(*)
relies on metadata since version 2.2.0, whereas
SELECT MAX(col)
still does not (see this PR and this thread)? Also, @Tom van Bussel - are you saying that, as long as a column is an integer and is within the
delta.dataSkippingNumIndexedCols
boundary, a query such as
MAX(col)
should use the metadata (and not scan the entire table)? If that's the case, I guess it solves most of the issues that drove @Felipe Pessoto to open the aforementioned PR to begin with, right?