06/01/2023, 8:20 AM
Hello.. i am reading aws kinesis data into delta table. Here one column named "data" is nested json which would have the actual information of my interest. i was explicitly applying schema to this column to get desired output columns. But my question is - can we get/generate the schema dynamically so that any new columns added to "data" column are also added to delta table ?

Gerhard Brueckl

06/01/2023, 10:14 AM
if you have schema evolution enabled and ingest data with additional columns/fields this should be done automatically

Dominique Brezinski

06/01/2023, 1:59 PM
The dynamic json schema discovery is a little wonky. I believe that to accomplish what you want, you have to move to the foreachBatch API, and then use the infer schema on the dataframe passed to foreachBatch, and then append to the table that has schema evolution enabled. Unless something changed in the last few versions of Spark, the json schema inference didn’t work on dataframes that have the is_streaming=true property set on them. I think there is a newer sql function that will produce the schema on a single value json string, but that doesn’t really help your use case. You can use it to produce an extra column with the schema for the specific record, and do a diff between your extract schema and the record schema to produce a column that says it has more fields than were extracted. You would use that to manually update your extract schema and restart. I say all this off the top of my head while making coffee, so I could be wrong 🤣