Yousry Mohamed
02/01/2023, 9:19 PMSELECT count(*)
not SELECT *
. Anyway, bloom filters did not produce the same file pruning behaviour like column-level stats.
I wonder if anyone has used them effectively before and can explain how to verify they are really making a difference.JosephK (exDatabricks)
02/01/2023, 9:43 PMDataFilters
You’re correct that select count(*) is not a good way to benchmark bc it’s a metadataoperation for delta lake and won’t even scan. I’m going to pass this message to the docs team.Yousry Mohamed
02/01/2023, 11:27 PMScan parquet with Bloom Filters
which just an evidence of the existence of _delta_index
folder. And in both cases, there is no DataFilters
in execution plan.
I also verified that the index has non empty data.
== Physical Plan ==
CollectLimit (4)
+- * Filter (3)
+- * ColumnarToRow (2)
+- Scan parquet with Bloom Filters spark_catalog.default.bloom_test (1)
(1) Scan parquet with Bloom Filters spark_catalog.default.bloom_test
Output [7]: [id#496L, str1#497, sha#498, sha1#499, sha2_256#500, row_hash_too_big#501, row_hash#502]
Batched: true
Location: PreparedDeltaFileIndex [dbfs:/tmp/bloom_test]
PushedFilters: [EqualTo(sha,b6589fc6ab0dc82cf12099d1c2d40ab994e8410c)]
ReadSchema: struct<id:bigint,str1:string,sha:string,sha1:string,sha2_256:string,row_hash_too_big:string,row_hash:string>
(2) ColumnarToRow [codegen id : 1]
Input [7]: [id#496L, str1#497, sha#498, sha1#499, sha2_256#500, row_hash_too_big#501, row_hash#502]
(3) Filter [codegen id : 1]
Input [7]: [id#496L, str1#497, sha#498, sha1#499, sha2_256#500, row_hash_too_big#501, row_hash#502]
Condition : (sha#498 = b6589fc6ab0dc82cf12099d1c2d40ab994e8410c)
(4) CollectLimit
Input [7]: [id#496L, str1#497, sha#498, sha1#499, sha2_256#500, row_hash_too_big#501, row_hash#502]
Arguments: 1001