https://delta.io logo
j

John Darrington

06/09/2023, 4:53 PM
This might be a dumb question - but for DataFusion to run a query against a parquet/csv file, does it have to load entire file into memory? And how does it query against a full delta table spread across multiple files?
w

Will Jones

06/09/2023, 4:56 PM
but for DataFusion to run a query against a parquet/csv file, does it have to load entire file into memory?
No, it will try to only load the parts of the files it needs to. There's a good blog post diving into how it can do this for Parquet: https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
For delta lake tables, it's similar except we pass more file-level statistics and info, so it can be a little smarter. (not sure how much we have implemented yet though)
j

John Darrington

06/09/2023, 5:30 PM
thank you very much