https://delta.io logo
y

Yousry Mohamed

02/01/2023, 9:19 PM
Hi folks, Bloom filter indexes are supposed to be a data skipping method like column-level statistics embedded into transaction log files. When I issue a query that can benefit from column-level stats, Spark SQL UI for the query will show some files being pruned and not read at all hence making the whole query faster. This is not the case I have seen with bloom filter index at least with the notebook provided with the docs page https://docs.databricks.com/optimizations/bloom-filters.html. That notebook does not have the classic benchmarking method of disabling disk cache hence the results could be misleading. It also uses
SELECT count(*)
not
SELECT *
. Anyway, bloom filters did not produce the same file pruning behaviour like column-level stats. I wonder if anyone has used them effectively before and can explain how to verify they are really making a difference.
j

JosephK (exDatabricks)

02/01/2023, 9:43 PM
In the physical plan look for
Copy code
DataFilters
You’re correct that select count(*) is not a good way to benchmark bc it’s a metadataoperation for delta lake and won’t even scan. I’m going to pass this message to the docs team.
Ah, does look like this one will scan because of the filter and not just retrieve metadata
y

Yousry Mohamed

02/01/2023, 11:27 PM
Whether I filter for an indexed column or not, execution plan will have the operator
Scan parquet with Bloom Filters
which just an evidence of the existence of
_delta_index
folder. And in both cases, there is no
DataFilters
in execution plan. I also verified that the index has non empty data.
Copy code
== Physical Plan ==
CollectLimit (4)
+- * Filter (3)
   +- * ColumnarToRow (2)
      +- Scan parquet with Bloom Filters spark_catalog.default.bloom_test (1)


(1) Scan parquet with Bloom Filters spark_catalog.default.bloom_test
Output [7]: [id#496L, str1#497, sha#498, sha1#499, sha2_256#500, row_hash_too_big#501, row_hash#502]
Batched: true
Location: PreparedDeltaFileIndex [dbfs:/tmp/bloom_test]
PushedFilters: [EqualTo(sha,b6589fc6ab0dc82cf12099d1c2d40ab994e8410c)]
ReadSchema: struct<id:bigint,str1:string,sha:string,sha1:string,sha2_256:string,row_hash_too_big:string,row_hash:string>

(2) ColumnarToRow [codegen id : 1]
Input [7]: [id#496L, str1#497, sha#498, sha1#499, sha2_256#500, row_hash_too_big#501, row_hash#502]

(3) Filter [codegen id : 1]
Input [7]: [id#496L, str1#497, sha#498, sha1#499, sha2_256#500, row_hash_too_big#501, row_hash#502]
Condition : (sha#498 = b6589fc6ab0dc82cf12099d1c2d40ab994e8410c)

(4) CollectLimit
Input [7]: [id#496L, str1#497, sha#498, sha1#499, sha2_256#500, row_hash_too_big#501, row_hash#502]
Arguments: 1001
10 Views