Hi team, I'm in spark streaming application context and I notice that job seems to list all transactions files contained in '_delta_log' before beginning to compute snapshot. Is it the expected behavior or I missed something ? In my mind, to build a snapshot, it only need to get most recent checkpoint file thanks to 'last_checkpoint' file and read from it and go on with the few others *.json above, am I right ?
👀 2
g
Gerhard Brueckl
03/16/2023, 10:19 AM
the stream will start from the first available version and process the transactions in sequence
It does not just start with the latest version of the source table
m
Morgan
03/16/2023, 10:20 AM
Read the most recent checkpoint file (*.parquet) and the few more recent *.json is not enough to build the current state ?
g
Gerhard Brueckl
03/16/2023, 10:23 AM
its enough to get the current state but as you are streaming from the table it will stream all transactions individually starting with the first
m
Morgan
03/16/2023, 10:37 AM
Thanks for your answers 🙂
I thought that I had a problem with my config but it seems normal^^
Currently I have a streaming job that read from delta table and write on an other (both stored on S3), the job is often interrupted and each time is restarting, it takes long time to begin before processing...
Currently my attemps to reduce this latency is to reduce log retention policy to 7 days and compact files in one each week to avoid loosing data.
Is there an other way of it seems a good practice to you?