https://delta.io logo
s

Simon Thelin

07/20/2023, 1:19 PM
In the docs it is stated
processAllAvailable
is for testing. However we have a scenario where • A user starts the stream job (due to concurrent insert to a source table) • The job runs and process these events ◦ while this runs, multiple users could add in more data • Once idle happens, we want to shut down. We have period of bursts of data and want to run it as a stream, then keep it shutdown for a longer period of time. Setting like:
processAllAvailable
processingTime="1 second"
Does exactly what we want. But I was afraid ``processAllAvailable` was not deemed safe? https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.streaming.StreamingQuery.processAllAvailable.html Over time this might move into a 24/7 stream but it is not certain. We essentially want to handle concurrent events from source in a stream way, but also handle events being added as the initial stream started. So trigger once, and
availableNow=True
does not pick up changes as the stream runs.