abhijeet_naib
04/27/2023, 2:06 PMMartin
04/27/2023, 4:33 PMdf.write
.format('delta')
.partitionBy(partition_col)
.mode('append')
.option('userMetadata', json.dumps(commit_info))
.option('mergeSchema', 'true')
.option('overwriteSchema', 'true')
.option("path", path)
.saveAsTable(tablename)
Immediately, we realized our mistake and we tried "roll-back" the operation by restoring a previous version of the Delta Table.
In the _delta_log
there were JSONs for versions 745, 746, 747, 748, 749, 750 present at that time.
We tried to restore version 749 (deltaTable.restoreToVersion(749)
). The operation failed, saying that only version(s) [750, 750] are available. Also, reading a version prior to 750 did not work with the same error message.
We did not perform any vacuums in the meantime.
Does anyone have an explanation what the reason for this might be?
I tried to recreate the situation with dummy data, but was not able to. In my experiments, I was always able to restore a previous version - even after a schema change.Shazeb Fahmi
04/27/2023, 5:40 PMSlackbot
04/27/2023, 9:01 PMJordan Fox
04/28/2023, 4:10 AMSukumar Nataraj
04/28/2023, 4:12 AMMinh-Duc NGUYEN
04/28/2023, 4:21 AMPayra
04/28/2023, 5:07 AMRahul Sharma
04/28/2023, 10:59 AMRahul Sharma
04/28/2023, 11:14 AM%%sql
CREATE TABLE nm.delta_test (
gender STRING,
salary INT
) USING DELTA
LOCATION '<s3://nm-dev/nm/test>';
Please let me know if this possible to show the schema also in glueAntony Raj M
04/28/2023, 10:07 PMRajesh Tomar
04/29/2023, 7:54 AMgeo1/
year1/
month1/
date1/
hour1/
min1
geo2/
geo3/
geo4/
In one of the applications(let's say for geo4), I'm observing an issue. I'm seeing way too many consecutive jobs titled:
Delta: Compute snapshot for version: <snapshot version>.
There are usually 20-30 such snapshot computation jobs each taking about a minute. This ends up making a single write essentially take about 30mins +- 10mins while the actual processing and saving within the job takes ~60s. The reason there are multiple snapshot computation jobs is because: while geo4 application is computing snapshot for version1, other applications write more versions(v2, v3, v4...). Now, geo4 has to catchup with them by computing snapshots for versions v2,v3,v4 in which time, other applications may write more new versions.
Interestingly, the snapshot computation used to take much lesser earlier.
I tried running a multi-cluster simulation where such snapshot computation jobs only take about a second.
I'm curious why the SnapshotState computation takes so much time. Has anyone seen something similar earlier? My suscpicion is that the deltalog is now too big for my table with more than 330k entries.Theo LEBRUN
04/30/2023, 11:55 PMVignesh Chandramohan
05/01/2023, 4:34 PMPip Mk
05/01/2023, 5:39 PMminWriterVersion
, minReaderVersion
, columnMapping
) to support the DROP COLUMN
command. It worked: when I query the table with Athena, the dropped column doesn't show up. But it still appears on Glue Catalog. If I try to run a Glue crawler on that table, it throws an error ("Internal Service Exception", no further information). I assume it's because Glue crawlers don't support the DROP COLUMN
command and I need to revert the config changes.
So my question is: is there any other non-manual way of updating the Glue catalog if I wanna use the `DROP COLUMN`command? Or perhaps some table property I can add on Glue catalog so that it works?Jordan Fox
05/01/2023, 8:53 PMPayra
05/01/2023, 10:20 PMRajesh Tomar
05/02/2023, 12:07 AMGapyNi
05/02/2023, 7:44 AMspark.readStream.format("delta")
.option("readChangeFeed", "true")
.option("mergeSchema", "true")
.option("startingVersion", starting_version)
What could be the problem?
Thanks for the help,
Regards,
GapyGautam Venugopal
05/02/2023, 4:09 PMJordan Fox
05/02/2023, 10:28 PMDorian Zheng
05/03/2023, 12:46 AMSaurabh
05/03/2023, 3:50 AMJakub Filipek
05/03/2023, 5:33 AMdelta-rs
when appending to a table it seems to go through every commit in history before writing (see thread for log). Is there a way to tell the update/append operation to start from a specific commit? Also, will this have performance impact if there are potentially tens of thousands of commits (assuming no debug logs are printed)?John
05/03/2023, 7:10 AMSukumar Nataraj
05/03/2023, 8:20 AMRajesh Tomar
05/03/2023, 8:39 AMPatrik Ekman
05/04/2023, 7:49 AMdt_target.alias("target")
.merge( source=df_stage.alias("updates"), condition=join_condition)
.whenMatchedUpdate( condition="...", set={...})
.whenNotMatchedInsert( values={...})
.execute()
df_load = df_stage.write.format("delta").mode("append") ...
sharath
05/04/2023, 11:47 AMJordan Cuevas
05/04/2023, 2:39 PM