https://delta.io logo
m

Matthew Powers

04/29/2023, 3:59 PM
I created a benchmark that shows how to run a pandas query 20 times faster with a Z ORDERed Delta Lake compared with a CSV file (for a 100 million row / 5GB dataset). I think this is going to be one of the most compelling reasons we can give the pandas community to start using Delta Lake.
🚀 3
t

Theo LEBRUN

04/29/2023, 4:22 PM
very cool! Can you share on which column did you z-order?
id1
,
id2
and
v1
?
m

Matthew Powers

04/29/2023, 8:27 PM
@Theo LEBRUN - The Z ORDER was on
id1
. I am going to make more examples to show the types of queries that benefit from multiple Z ORDER columns.
k

Kees Duvekot

04/30/2023, 2:21 PM
What happens when the CSV file is also ordered on that column? (Don't get me wrong ...still impressive..and I am a big fan of Delta .. but these are the questions others will also ask)
m

Matthew Powers

04/30/2023, 2:23 PM
If the CSV file was ordered, then all the data would still need to be read and the exact same code would have to be executed. I don’t think that would improve query time at all. I will check to confirm & appreciate the comment.
k

Kees Duvekot

04/30/2023, 2:23 PM
Then second question: how long does it take to create that delta table from the csv file?
Can we get the full conversion and calculation within the same 20secs?
m

Matthew Powers

04/30/2023, 2:25 PM
I converted from 5GB CSV => Delta Lake using PySpark locally. It didn’t take too long, but that’s not what PySpark is optimized for. Hopefully we can provide some Polars/DataFusion interface that lets us perform this computation faster.
We also need to add Z ORDER to delta-rs: https://github.com/delta-io/delta-rs/issues/1127
But I’m not arguing that users should switch to Delta Lake for a one off query. They shouldn’t convert just to make one query faster. They should convert for all their downstream queries.
k

Kees Duvekot

04/30/2023, 2:27 PM
Yes .. subsequent queries on the same dataset is always quicker yes ..
But would that also not be the case with the panda dataframe for that csv? Ie ..does pandas keep that dataframe in memory?
If it's about subsequent queries .. you might be able to extend the example by running the same calculation for all possible values of id1 .. (so loop across all distinct values of id1)
Than the "overhead" of the creation of the delta table goes away totally