Matthew Powers

06/05/2023, 2:30 PM
I am giving a talk at the Data + AI Summit in a few weeks on why Delta Lake is great for pandas. Here’s the high level flow of the talk: • use the 1e9 h2o groupby dataset (it’s 50 GB when stored in an uncompressed CSV file) • Run a pandas query on the CSV file and show the runtime • Run the same query on a Parquet file & show how that helps • Run the same query on a Z Ordered Delta table and see how much faster that makes the query • Then discuss all the quality of life improvements Delta Lake offers pandas users (versioned data, time travel, etc.) Here are a few things that could make my presentation better: • translating this R Script to Rust (the R Script is so slow and hard to run) • adding the
delta_table.optimize.z_order(["col1", "col2"])
interface to delta-rs (I am currently Z Ordering with PySpark) • Making a release so I can access the to_pandas filters syntax that @shingo added in this PR, so the code snippet looks nicer • I will be highlighting time travel with Delta Lake. Any of the other interfaces outlined in this doc would also be a good way to show the quality of life improvements deltalake gives pandas users compared with Parquet data lakes.
👍 4

Will Jones

06/05/2023, 2:37 PM
I am adding z order and making a release this week :)
🦀 4

Matthew Powers

06/05/2023, 2:39 PM
Wow, that is so incredibly exciting!!!