Oliver Angelil
07/26/2023, 6:25 PMdf.write.mode("append").format("delta").save("/mytable")
There is a TimeStamp column (always increasing) from which year, month, and day columns are created and used as the partition columns. Each partition includes 1 day of data and is around 150mb.
The issue is that since this batch process has been running daily and for many years there are now 2000+ partitions. Databricks recommends that 1 partition should be at least 1GB. Would Ingestion Time Clustering or the new Liquid Clustering be suitable for this use case?Dominique Brezinski
07/26/2023, 7:12 PMOliver Angelil
07/26/2023, 7:14 PMDominique Brezinski
07/26/2023, 7:18 PMOliver Angelil
07/26/2023, 7:20 PMI don't understand the question. You just append the data with the appropriate month value column that the table is partitioned on.I think I get it. So newly appended data will simply be added into the already existing partition for that month (except for the 1st of a month, where a new parquet file would be created)
But yes, you should really try ingestion time clustering for your use case.well the docs say that Ingestion Time Clustering is enabled by default on runtime 11.2, so not sure I need to actively do anything
I don't understand the question. You just append the data with the appropriate month value column that the table is partitioned on.what I find confusing is that this can be done without Ingestion Time Clustering... I don't even understand what Ingestion Time Clustering will be doing if I'll already be partitioning by month...(the docs are very vague)
Dominique Brezinski
07/26/2023, 7:42 PMOliver Angelil
07/27/2023, 7:15 PMdf.repartition(100)
- but would the ingestion times in the delta .json
log files no longer align with the data in each of the files? Asking because I already have 5 years of an incremental table stored in a hive-style subdirectory structure (partition columns being Year/Month/Day).. wondering if I can go from that to use Ingestion Time ClusteringDominique Brezinski
07/27/2023, 7:39 PM