See my comment on
the Pandas 2.0 and the Arrow revolution thread. Lots of users are really excited about using Polars / DuckDB to filter datasets and then convert them to pandas datasets to avoid memory errors.
I think we have a great opportunity to educate the pandas community that Delta Lake allows for predicate pushdown filtering and can help solve their memory problems.
I have an idea for benchmarks we can prepare for the pandas community demonstrating how we can solve their out of memory exceptions / make their analyses run faster. Would anyone like to collab with me on these benchmarks? It’s probably like 2 hours of work cause I already have the dataset and problem loaded in my head. I’d be able to generate some really compelling content for the pandas community with these benchmarks.