https://delta.io logo
l

Louis Frolio

03/30/2023, 7:35 PM
I just ran a simple test. 1. Create Delta table (myTable), load it up approx 65k rows 2. Three parquet files were created 3. One of the columns is called InvoiceID 4. Ran query: select max(InvoiceID) from myTable 5. I took a look at the query plan and noticed that the table was scanned to get the max(InvoiceID). I would have thought that Delta could leverage file stats to determine this? It seems that it would be a much more efficient process. Any and all feedback would be appreciated. Cheers, Louis.
w

Will Jones

03/30/2023, 7:37 PM
A max value in statistics means all values are less than or equal to it, but not necessarily that the particular value is literally in there
l

Louis Frolio

03/30/2023, 7:38 PM
That doesn't make sense? Why list a max value unless it is in the file.
w

Will Jones

03/30/2023, 7:38 PM
For example, if a column has [‘a’, ‘b’, ‘c’] the max could be ‘d’ since all values are less than it.
This is used in some parquet writers for more compact stats. If a string column has values 1000 characters long, it can just store a prefix that is 10 characters long is is larger/smaller than all values
👍 1
l

Louis Frolio

03/30/2023, 7:40 PM
So counter intuitive
but it is what it is
thanks
w

Will Jones

03/30/2023, 7:40 PM
There is a tight-bounds setting that allows what you are talking about
So there’s probably some implementations that can do that
But it’s not guaranteed
l

Louis Frolio

03/30/2023, 7:41 PM
Got it.
j

Jim Hibbard

03/30/2023, 7:41 PM
That's really interesting, I'd always assumed the bounds were tight.
w

Will Jones

03/30/2023, 7:42 PM
Well I bet many folks did until that clarification was added. That must have been a fun bug report 🙂
l

Louis Frolio

03/30/2023, 7:42 PM
Is tight the same as "end points inclusive?"
w

Will Jones

03/30/2023, 7:43 PM
Oh actually my warning still might apply even for tight bounds
for string and timestamp columns. See the footnote at the bottom of the protocol
j

Jim Hibbard

03/30/2023, 7:49 PM
OK Will, you've inspired me to read the full protocol for more gems like this. Goodbye my weekend, I hardly knew thee.
🎉 1
Also, we need to have a Delta Lake protocol trivia night at Data+AI Summit.
😆 1
🐿️ 1
g

Gerhard Brueckl

03/31/2023, 7:03 AM
when was this introduced? the default is still "tight", right?
2 Views