I am giving a talk at the Data + AI Summit in a few weeks on why Delta Lake is great for pandas. Here’s the high level flow of the talk:
• use the 1e9 h2o groupby dataset (it’s 50 GB when stored in an uncompressed CSV file)
• Run a pandas query on the CSV file and show the runtime
• Run the same query on a Parquet file & show how that helps
• Run the same query on a Z Ordered Delta table and see how much faster that makes the query
• Then discuss all the quality of life improvements Delta Lake offers pandas users (versioned data, time travel, etc.)
Here are a few things that could make my presentation better:
• translating
this R Script to Rust (the R Script is so slow and hard to run)
• adding the
delta_table.optimize.z_order(["col1", "col2"])
interface to delta-rs (I am currently Z Ordering with PySpark)
• Making a release so I can access the to_pandas filters syntax that
@shingo added in
this PR, so the code snippet looks nicer
• I will be highlighting time travel with Delta Lake. Any of the other interfaces outlined in
this doc would also be a good way to show the quality of life improvements deltalake gives pandas users compared with Parquet data lakes.