Daniel Bariudin
01/15/2023, 12:38 PMdf.write.format("delta")
.mode("overwrite")
.option("TBLPROPERTIES", "key1=value1, key2=value2") <-- HERE
.save("./")
I know there is a way to update TBLPROPERTIES with the following code:
alter_table_query = f"ALTER TABLE delta.`hdfs:///{table_path}` SET TBLPROPERTIES ('delta.logRetentionDuration' = 'interval 1 days')"
spark.sql(alter_table_query)
but I want to add this as part of streaming write.Lennart Skogmo
01/15/2023, 2:02 PMdf.write.format("delta") \ .option("delta.autoOptimize.optimizeWrite , "true") \ .option("delta.autoOptimize.autoCompact", "true") \
.saveAsTable(table)
Daniel Bariudin
01/15/2023, 3:44 PMdelta.logRetentionDuration
?Omkar
01/17/2023, 8:28 AMDaniel Bariudin
01/18/2023, 9:24 AM.option()
When I tiring to initialize the properties like this:
df.write.format("delta") \
.option("delta.autoOptimize.optimizeWrite" , "true") \ <---This way
.saveAsTable(table)
the TBLPROPERTIES don't change.Omkar
01/18/2023, 9:44 AMdelta.autoOptimize.optimizeWrite
is available in Databricks Delta and not available in Delta Lake (open source). You'll have to check which one you're using and plan accordingly.delta.logRetentionDuration
for example), you can try setting them in your Spark Session Config as follows:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("...") \
.master("...") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.databricks.delta.properties.defaults.logRetentionDuration", "interval 7 days") \
.getOrCreate()
Benefit of setting these properties in the Spark Session config is that you can set them once and then use in all Spark dataframes referring to that Spark Session. Also this should work with streaming writes since properties are set at Spark Session level.
For more info and other ways of setting these properties, you can refer to this section: https://docs.delta.io/latest/delta-batch.html#table-propertiesDaniel Bariudin
01/18/2023, 10:25 AMSHOW TBLPROPERTIES
I don't see the propertyā¹ļø
More details of what I have done:
1. initialize spark-session the same way you showed above
2. Reading existing Delta table
3. Writing the data to the same table
4. Running SHOW TBLPROPERTIES
to see the property - don't see the property
Do I missing something? Have any other ideasš?Omkar
01/18/2023, 10:29 AMSHOW TBLPROPERTIES
here? Let's check it out.Daniel Bariudin
01/18/2023, 10:39 AMspark = SparkSession.builder.appName("test")\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")\
.config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog")\
.config("spark.databricks.delta.properties.defaults.logRetentionDuration", "interval 7 days")\
.getOrCreate()
and after writing to the Delta table the output is :
+----------------------+-----+
|key |value|
+----------------------+-----+
|delta.minReaderVersion|1 |
|delta.minWriterVersion|2 |
+----------------------+-----+
Omkar
01/18/2023, 10:45 AMALTER TABLE tablename SET TBLPROPERTIES ('delta.logRetentionDuration' = 'interval 7 days')
⦠Confirm the property is set with SHOW TBLPROPERTIES tablename
⢠Then you can run your streaming writes on the tables.Daniel Bariudin
01/18/2023, 10:57 AMOmkar
01/18/2023, 11:03 AMNote
⢠Modifying a Delta table property is a write operation that will conflict with other concurrent write operations, causing them to fail. We recommend that you modify a table property only when there are no concurrent write operations on the table.Yes your assumption is correct, the new tables created will have the table properties that you'll set in the Spark Session config. You can change them any time in the future with the
ALTER TABLE tablename SET TBLPROPERTIES (...)
query - only do this when no other write operation is being performed on that particular Delta table.
Hope this answers all your questions!Daniel Bariudin
01/18/2023, 11:05 AM