Matthew Powers

02/28/2023, 3:29 PM
See my comment on the Pandas 2.0 and the Arrow revolution thread. Lots of users are really excited about using Polars / DuckDB to filter datasets and then convert them to pandas datasets to avoid memory errors. I think we have a great opportunity to educate the pandas community that Delta Lake allows for predicate pushdown filtering and can help solve their memory problems. I have an idea for benchmarks we can prepare for the pandas community demonstrating how we can solve their out of memory exceptions / make their analyses run faster. Would anyone like to collab with me on these benchmarks? It’s probably like 2 hours of work cause I already have the dataset and problem loaded in my head. I’d be able to generate some really compelling content for the pandas community with these benchmarks.