Matt Moen

06/24/2023, 4:40 PM
I'm trying to upsert data from a kafka stream into a large-ish delta lake dataset (about 35B rows) using spark structured streaming. Upserted data is relatively small per batch, maybe 1-2m records. The data is partitioned, I'm using a partition field condition in the merge query, and it only affects modest subset of the partitions, but I still see pyspark/delta scan the entire dataset. The table is also zordered by the columns I'm matching on for the upsert. Does delta lake oss upsert use partitions and zordered columns for data skipping? Is it expected that delta needs to scan the entire table to upsert? I'd expect partition pruning at least to make this more efficient but I'm not seeing it. Edit: I've removed streaming from the equation and am just trying to get a series of parquet files uploaded efficiently and seeing the same behavior as above