https://delta.io logo
a

Andrea Campolonghi

08/04/2023, 8:19 PM
Not a 100% delta question. I read a stream from parquet files in s3. These are then appended to a delta table. The writer uses a mergeSchema option. My issue is that the reader does not update the schema after the stream has started.
Copy code
df = spark.\
    readStream.\
    format('parquet').\
    schema(schema). \
    load("s3a://****/public/sales/")
I tried to use spark.sql.streaming.schemaInference but it does not change the result. Once the stream is started the schema dees not change even if newest parquet file have one extra column. Any suggestion?
d

Dominique Brezinski

08/04/2023, 9:11 PM
This is the nature of streams, because they can have schema-dependent operations and are check-pointed for recovery. If the upstream schema changes, the stream needs to be restarted with a new checkpoint. That is quite challenging if the source is not versioned or has some other mechanism to start at certain point.
👍 1