Matthew Powers

04/26/2023, 9:40 PM
I am running one query from the h2o benchmarks that were recently updated by duckdb, see here. I ran the first groupby benchmark locally (M1 Mac, 64 GB of RAM) with different query engines. The 100 million row CSV file (1e8) is 5GB on disk and the 1 billion row file (1e9) is 50GB on disk. I am counting data loading time + query runtime. It’s a 9 column datasets and here’s the query:
SELECT id1, sum(v1) AS v1 from the_table GROUP BY id1
. Here are my results: • DuckDB: query errors out, even on the 1e8 dataset (notebook) • DataFusion: 9.8 seconds 1e8 / 99 seconds 1e9 (notebook) • Polars: 7.5 seconds 1e8 / 229 seconds 1e9 (notebook) • Pandas 2.0 with usecols: 19.3 seconds 1e8 / 204 seconds 1e9 (notebook) I tried a bunch of tweaks to get DuckDB working, but it only worked on the tiny 1e7 dataset, so not sure what I’m doing wrong. I was really impressed with DataFusion’s performance on the 1e9 dataset. The Polars performance surprised me on the 1e9 dataset. And Pandas 2.0 just seems way better than what it used to be. I am going to use this dataset and structure some queries that show how Delta Lake can allow for faster query runtimes.

Will Jones

04/26/2023, 9:58 PM
For duckdb maybe worth looking how they wrote it. Noticed some interesting config. here:

Matthew Powers

04/26/2023, 9:59 PM
I tried both of those duckdb flags and still couldn’t get it working. Also tried to manually set a memory max and didn’t work either.