bharat chaudhury
06/26/2023, 2:15 PMbharat chaudhury
06/26/2023, 4:24 PMTabrez alam
06/26/2023, 4:36 PMKyu
06/27/2023, 6:48 AMKNN
06/27/2023, 7:24 AMThanhtan Le
06/27/2023, 9:06 AMRoel Knitel
06/27/2023, 9:43 AMMarcelo Trevisani
06/27/2023, 10:31 AMMichael Nacey
06/27/2023, 7:44 PMuser class threw exception: org.apache.spark.sql.delta.DeltaIllegalStateException: State of the checkpoint doesn't match that of the snapshot.
I found the source code that throws this error, but it doesn't seem possible for it to happen. Clearly it is, and the numberOfFiles check is in there for a reason. Could concurrent inserts cause this? Thanks.Andres Urrego
06/27/2023, 9:04 PMSimon Thelin
06/28/2023, 9:46 AMwhich I do not own or created
which is under a strange filepath like:
data///delta_table/
Currently spark does not recognise this, is there any way to work around this without having to change S3 file path?Zhenlong Huang
06/28/2023, 10:22 AMSzymon Sikora
06/28/2023, 12:07 PMkv = {c: f"source.{c}" for c in df2.columns}
target_df.alias("target").merge(
df2.alias("source"), "source.Name = target.Name"
).whenNotMatchedInsert(
values={
"uuid": "uuid()",
**kv
}
).whenNotMatchedBySourceDelete().execute()
Is there any way to do that or such an action is not supported?orsagiv
06/28/2023, 1:53 PMJavier Carbajo
06/29/2023, 7:58 AMevent_name
and date
. So the more time I’m ingesting data, the more files and size the table has.
Every night I’m running a compaction + vacuum process in order to maintain the table in the best possible way. Compaction (optimize) is perfect because I’m able to filter the partitions I want to optimize but the vacuum command doesn’t allow me to filter the partitions I want to vacuum and does the vacuum of the whole table each time. As I only append data to the table, if we extend this behavior, the vacuum will spend too much time running. For the moment it’s not a problem but I want to anticipate it.
Any idea on how can I vacuum/clean the table/old files filtering the partitions I want to vacuum? Or any workaround to avoid too large vacuums?
Thanks a lot for your help! Have a nice Thursday! 🙂Alfonso
06/29/2023, 10:36 AMShay Palachy-Affek
06/29/2023, 12:25 PMdelta.enableChangeDataFeed = true
enabled at the silver level, retaining change history.
Now, obviously we will have per-column (for the chosen columns) change history from this point onwards.
However, for select, very import, columns we do have a way of reconstructing all past value changes from raw data kept at some third data store location (some object storage).
We would like to utilize this to get all past changes into said DeltaTables as if the tables were created a few years ago (instead of right now) and were fed all these changes over time, so we can - from this point forwards - get both old and new changes from a single source and using one API.
Is this possible?
Can we directly inject history into the CDF (Change Data Feed) of a table? The CDF having a schema with a _commit_timestamp
field points to this possibility (if we can just write rows into the CDF directly).
Alternatively, can we write changes to table while overriding the timestamp of the change time instead of it being automatically now()
? If we can feed the change history from the past up to the present we will end up with the same result.
Thank you for reading this,
Shayorsagiv
06/29/2023, 12:27 PMBy default Delta Lake on Databricks collects statistics on the first 32 columns defined in your table schema
Andy Grove
06/29/2023, 6:56 PM"operationMetrics": {
"numDeletionVectorsAdded": "1",
"numDeletionVectorsRemoved": "0",
I cannot seem to find any references to numDeletionVectorsAdded
in the source code, so I am wondering where there is defined?Andy Grove
06/29/2023, 7:00 PMDeleteCommandMetrics
Andy Grove
06/29/2023, 7:30 PMsudo
06/30/2023, 1:58 PMHemant Kumar
06/30/2023, 3:03 PMChetan Joshi
06/30/2023, 8:14 PMHemant Kumar
07/01/2023, 2:06 PMAbolfazl karimian
07/02/2023, 8:57 AMdelta_table = DeltaTable.forPath(session , path)
delta_table.logRetentionDuration = "interval 10 days"
delta_table.deletes( col("date") < "specified_day" )
session.sql("vacuum delta.path RETAIN 48 HOURS")
session.sql("vacuum delta.path RETAIN 48 HOURS")
Second:
delta_table = DeltaTable.forPath(session , path)
delta_table.logRetentionDuration = "interval 10 days"
delta.deletedFileRetentionDuration = "interval 2 days"
delta_table.deletes( col("date") < "specified_day" )
session.sql("vacuum delta.path")
session.sql("vacuum delta.path")
Are these 2 ways different?!Ahmad Dorri
07/02/2023, 5:24 PMVishal Kadam
07/02/2023, 5:37 PMSadiq Kavungal
07/03/2023, 8:14 AMJacek
07/03/2023, 10:39 AM