https://delta.io logo
r

Rahul Sharma

04/24/2023, 3:46 PM
hii Team , i have set raw log and file retention duration default but why i am getting below error and what would be the best solution. if i change checkpointing loc then job will start duming data again in refine , please give me optimal solution so i can resume my application where it goes failed
g

Gerhard Brueckl

04/25/2023, 6:41 AM
thats the very same question you asked about 2 months ago https://delta-users.slack.com/archives/CJ70UCSHM/p1675833135458149 nothing changed since then
r

Rahul Sharma

04/25/2023, 6:41 AM
yes i checked again this message
but not sure why deltalake gave me error for one job
bcz same error should for others also
why not come
g

Gerhard Brueckl

04/25/2023, 6:43 AM
you might have set up different retention policies for the source tables or one of your streams did not run for a certain time and now fails to catch up due to deleted files
r

Rahul Sharma

04/25/2023, 6:44 AM
but my streaming up and running
g

Gerhard Brueckl

04/25/2023, 6:46 AM
maybe your write operations were not successful so the checkpoint never advanced but the message clearly says you are missing some version between what has already been processed and what is available at the moment
r

Rahul Sharma

04/25/2023, 6:48 AM
can we have a call bcz we had around 4-month completed in production
and only one issue faced and i also faced this issue earlier better would be to sort this into call
CC : @Denny Lee
@Gerhard Brueckl is if i change readVersion will it impact on my data ?
g

Gerhard Brueckl

04/26/2023, 6:28 AM
you mean specifying
.option("startingVersion", "5")
? well it will start reading the changes from there however the individual changes the missing versions are lost as the originating files are already deleted according to the error message
r

Rahul Sharma

04/26/2023, 6:29 AM
then what should i do
g

Gerhard Brueckl

04/26/2023, 6:30 AM
I would probably delete the checkpoint and start the stream from the beginning so you have all the current data + future changes
but historic changes are not reproduceable anymore
r

Rahul Sharma

04/26/2023, 6:31 AM
i have one doubt why delta file retention duration is 7 days even log files are 30 days bcz if delta file are deleted then what is the use of logs file
g

Gerhard Brueckl

04/26/2023, 6:32 AM
to see what operations happened in the last 30 days
r

Rahul Sharma

04/26/2023, 6:33 AM
but our files are deleted with in 7 days then why logs are required for 30 days
g

Gerhard Brueckl

04/26/2023, 6:34 AM
they are not required but keeping logs for 30 days consumes way less storage than keeping 30 days of data and you might be interested in operations happening on the delta table longer
if you are not happy with the defaults, you can overwrite them using table properties
r

Rahul Sharma

04/26/2023, 6:34 AM
but when i did this then i faced above issue
g

Gerhard Brueckl

04/26/2023, 6:37 AM
you faced the above issue because the data was deleted before it was processed by the stream
r

Rahul Sharma

04/26/2023, 6:40 AM
but my streaming was running constinuously
g

Gerhard Brueckl

04/26/2023, 6:44 AM
well, somehow you stream missed to read data for 7 days (if you used default settings) this can also happen if your writing operation fails and so does not update your checkpoint - if this happens for 7 days you are in the situation you described
r

Rahul Sharma

04/26/2023, 6:46 AM
then what could be best and optimal solution
g

Gerhard Brueckl

04/26/2023, 6:50 AM
first you need to find out what actually went wrong did you stream run successfully for the last 7 days? did it successfully write to the output for those 7 days?
r

Rahul Sharma

04/26/2023, 6:50 AM
yes bcz i was monitoring daily
g

Gerhard Brueckl

04/26/2023, 6:54 AM
did you manually check the checkpoint file? its just a JSON
r

Rahul Sharma

04/26/2023, 6:54 AM
sure
g

Gerhard Brueckl

04/26/2023, 6:54 AM
and what does it say? when was it updated the last time?
r

Rahul Sharma

04/26/2023, 6:56 AM
Screenshot 2023-04-26 at 12.25.02 PM.png
checking checkpnt file
commit checkpointing
max i have 17 april data in checkpointing
g

Gerhard Brueckl

04/26/2023, 7:00 AM
well, so the last checkpoint was written > 7 days ago meaning your stream has not successfully processed data since 17th of April
r

Rahul Sharma

04/26/2023, 7:00 AM
Copy code
v1
{"batchWatermarkMs":0,"batchTimestampMs":1681755960109,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.join.stateFormatVersion":"2","spark.sql.streaming.stateStore.compression.codec":"lz4","spark.sql.streaming.stateStore.rocksdb.formatVersion":"5","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}}
{"sourceVersion":1,"reservoirId":"934278ed-eb49-4019-8504-cf360922fd5c","reservoirVersion":179,"index":-1,"isStartingVersion":false}
g

Gerhard Brueckl

04/26/2023, 7:04 AM
you can also check the history of the target table and you will also not see any new data in there I would assume
r

Rahul Sharma

04/26/2023, 7:06 AM
yes you are right
target table have max date 17th april
g

Gerhard Brueckl

04/26/2023, 7:08 AM
target table you mean
r

Rahul Sharma

04/26/2023, 7:08 AM
yes
so what would be best approach to monitor these jobs
is there any way ,i also faced some issue in emr some time streaming job stuck automatically
g

Gerhard Brueckl

04/26/2023, 7:12 AM
sorry, I dont know your setup, architecture, the code or how your jobs are triggered so its impossible to say
r

Rahul Sharma

04/26/2023, 7:13 AM
can you please share me sample architecture from there i can get some ideas
g

Gerhard Brueckl

04/26/2023, 7:16 AM
just check the code/tool that triggered the streaming job - you should hopefully find an error message there
r

Rahul Sharma

04/26/2023, 7:17 AM
ok
sure
thanks
i have some set of question can i ask here
1. if i set custom retention duration then do i have to add retention in vaccum command or it will automatically delete according the retention duration EXm : deltaTable_raw.vacuum(1 hour) or deltaTable_raw.vacuum() will also work when i set retention duration 1 h
2. if i set autoCompaction and autoOptimization then delta will read file from dir and then do the compaction ?
g

Gerhard Brueckl

04/26/2023, 7:44 AM
regarding 1. I think the value specified directly in
vacuum(x hours)
would overwrite the value from the table property so i would advice to set the table property correctly and simply use
vacuum()
without parameters
regardingt 2. this happens when writing the new files - autoCompaction will merge existing files and new files into a common file
r

Rahul Sharma

04/26/2023, 7:46 AM
will any impact on merge refine streaming with autoCompaction and autoOptimization ?
g

Gerhard Brueckl

04/26/2023, 8:18 AM
hard to say without knowing the data and update patterns I would suggest to test it on your own
4 Views