Eyal Perry

08/20/2023, 4:02 PM
Hi! Regarding handling of corrupt / mismatched records in spark / scala + delta. Suppose that I am consuming a delta table which contains a record that is missing a column, or whose type is mismatched. At runtime, I use
to cast the DataFrame to a case class of the table schema. Is there some configuration like csv/json reader's
which can natively ignore the specific record instead of just throwing an error which terminates my application? The ideal would be to emit metrics whenever a corrupt record is encountered, and be able to filter out these records by returning null, which results in a more robust application. I could always code this with Try + flatMap + transform, but would rather see a more native, performant solution. P.S. not running on databricks and this is not going to change, so I would appreciate a native spark solution