Matthew Powers
04/26/2023, 9:40 PMSELECT id1, sum(v1) AS v1 from the_table GROUP BY id1
. Here are my results:
• DuckDB: query errors out, even on the 1e8 dataset (notebook)
• DataFusion: 9.8 seconds 1e8 / 99 seconds 1e9 (notebook)
• Polars: 7.5 seconds 1e8 / 229 seconds 1e9 (notebook)
• Pandas 2.0 with usecols: 19.3 seconds 1e8 / 204 seconds 1e9 (notebook)
I tried a bunch of tweaks to get DuckDB working, but it only worked on the tiny 1e7 dataset, so not sure what I’m doing wrong.
I was really impressed with DataFusion’s performance on the 1e9 dataset. The Polars performance surprised me on the 1e9 dataset. And Pandas 2.0 just seems way better than what it used to be.
I am going to use this dataset and structure some queries that show how Delta Lake can allow for faster query runtimes.Will Jones
04/26/2023, 9:58 PMMatthew Powers
04/26/2023, 9:59 PM