Martin04/27/2023, 4:33 PM
Immediately, we realized our mistake and we tried "roll-back" the operation by restoring a previous version of the Delta Table. In the
df.write .format('delta') .partitionBy(partition_col) .mode('append') .option('userMetadata', json.dumps(commit_info)) .option('mergeSchema', 'true') .option('overwriteSchema', 'true') .option("path", path) .saveAsTable(tablename)
there were JSONs for versions 745, 746, 747, 748, 749, 750 present at that time. We tried to restore version 749 (
). The operation failed, saying that only version(s) [750, 750] are available. Also, reading a version prior to 750 did not work with the same error message. We did not perform any vacuums in the meantime. Does anyone have an explanation what the reason for this might be? I tried to recreate the situation with dummy data, but was not able to. In my experiments, I was always able to restore a previous version - even after a schema change.
Gerhard Brueckl04/27/2023, 5:24 PM
Martin04/27/2023, 6:06 PM
version timestamp 754 2023-04-26T08:02:25Z 753 2023-04-26T07:55:14Z 752 2023-04-26T07:54:48Z 751 2023-04-25T16:35:57Z 750 2023-04-25T12:22:13Z 749 2023-04-25T09:14:36Z 748 2023-04-05T19:40:14Z 747 2023-03-30T09:32:30Z
are not listed when I run
. That means the default values are effective, correct? We did not delete any data files manually. Can I check if they do exist programatically or do I need to look into a version JSON and look up the files manually?
Gerhard Brueckl04/27/2023, 6:10 PM
VERSION AS OF 750` work?
Martin04/27/2023, 6:44 PM
SELECT * FROM table VERSION AS OF 750 LIMIT 10;
does not work We'd like to restore 749
SELECT * FROM table VERSION AS OF 749 LIMIT 10;
Gerhard Brueckl04/27/2023, 6:45 PM
Martin04/27/2023, 6:45 PM
Error: Cannot time travel Delta table to version 749. Available versions: [750, 754].
Gerhard Brueckl04/27/2023, 6:46 PM
Martin04/27/2023, 6:48 PM
Gerhard Brueckl04/27/2023, 6:49 PM
Martin04/27/2023, 6:50 PM
Gerhard Brueckl04/27/2023, 7:39 PM
Martin04/27/2023, 7:45 PM
Gerhard Brueckl04/28/2023, 6:47 AM
Martin04/28/2023, 9:30 AM
folder (for version 750). The checkpoint for 740 (and version JSONs 740 - 746) were deleted since they were older than default
= 30 days. However, this also indirectly renders version 747-749 unreadable since prior checkpoint and the versions in between are required to assemble the version. This makes total sense, but I'm asking myself now if the deletion of checkpoints and version JSON is working properly or if there is a design flaw.
Gerhard Brueckl04/28/2023, 9:45 AM
Scott Sandre (Delta Lake)04/28/2023, 5:27 PM
Due to log entry cleanup, instances can arise where you cannot time travel to a version that is less than the retention interval. Delta Lake requires all consecutive log entries since the previous checkpoint to time travel to a particular version. For example, with a table initially consisting of log entries for versions [0, 19] and a checkpoint at verison 10, if the log entry for version 0 is cleaned up, then you cannot time travel to versions [1, 9]. Increasing the table property delta.logRetentionDuration can help avoid these situations.