dhia Gharsallaoui

08/12/2023, 11:44 AM
Hello mates 👋, We are currently utilizing Delta as our data lake, with Spark applications utilizing its tables as sources and destinations in Spark streaming. All of this is deployed within a Kubernetes cluster, and we persist checkpoint data in Spark to handle crashes and version updates, among other things. We are in the process of implementing a system to manage schema changes, encompassing deletions, additions, type changes, and more. My question pertains to handling schema changes in Spark streaming. If we were to delete the checkpoints and the schema change doesn't result in the deletion of all the data, the system would end up reprocessing all the data. On the other hand, if I were to let it be, it could potentially crash due to a change in the version of the Delta table. I've considered adding a custom column to store the version and instruct the Spark app to only read data from this particular version. This approach would allow for the safe deletion of checkpoints. However, I'm seeking an idiomatic way to handle this scenario within Spark Delta. Any suggestions or insights would be greatly appreciated. Thank you!