Simon Thelin
07/20/2023, 1:19 PMprocessAllAvailable
is for testing.
However we have a scenario where
• A user starts the stream job (due to concurrent insert to a source table)
• The job runs and process these events
◦ while this runs, multiple users could add in more data
• Once idle happens, we want to shut down.
We have period of bursts of data and want to run it as a stream, then keep it shutdown for a longer period of time.
Setting like:
processAllAvailable
processingTime="1 second"
Does exactly what we want.
But I was afraid ``processAllAvailable` was not deemed safe?
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.streaming.StreamingQuery.processAllAvailable.html
Over time this might move into a 24/7 stream but it is not certain.
We essentially want to handle concurrent events from source in a stream way, but also handle events being added as the initial stream started. So trigger once, and availableNow=True
does not pick up changes as the stream runs.