https://delta.io logo
c

Cole MacKenzie

05/23/2023, 9:33 PM
Are there any guarantees or tests around the "correctness" of a query using DataFusion on a Delta Lake table (using
delta-rs
)? Basically, how do we know we can trust the result? I was looking at sqlite and see they have a tool (https://www.sqlite.org/sqllogictest/doc/trunk/about.wiki) to validate the result against other DB engines.
Looks like duckdb uses the same tool https://github.com/duckdb/duckdb/tree/master/test/sql
It is also unclear to me how much of the "correctness" falls to datafusion or delta-rs in this scenario
w

Will Jones

05/23/2023, 9:50 PM
The query point is interesting. Naively, one could say all we really have to test are scans on various tables. But there's definitely queries where we might performance certain optimizations and those need to be tested. My initial reaction is that we probably are served well enough with integration tests (which we could always enhance more), but assume to a certain extent that DataFusion is handling queries correctly
r

Robert

05/23/2023, 10:40 PM
Datafusion is of course heavily tested, and they have been investing heavily in sqllogic tests over there, that also validate results against a reference postgres instance. Si guess the main burden of SQL correctness lies over there. Of course we do have something to prove as well and can always get better. For some of the core delta logic - like commits - we are replicating the tests suites from spark. All in all I think we are at an OK state, but as you all previously said, we can always do more 🙂.