https://delta.io logo
o

Oliver Angelil

07/26/2023, 6:25 PM
I have an incremental table which I am appending to daily with:
df.write.mode("append").format("delta").save("/mytable")
There is a TimeStamp column (always increasing) from which year, month, and day columns are created and used as the partition columns. Each partition includes 1 day of data and is around 150mb. The issue is that since this batch process has been running daily and for many years there are now 2000+ partitions. Databricks recommends that 1 partition should be at least 1GB. Would Ingestion Time Clustering or the new Liquid Clustering be suitable for this use case?
d

Dominique Brezinski

07/26/2023, 7:12 PM
So your table is under 300GB of data? It should be partitioned by month at most. Liquid partitioning may or may not be the right solution. It is too new for the community to make recommendations about. Is the data you are appending daily also events from that day? If so ingestion time clustering is probably the right solution, especially because your table is relatively small.
o

Oliver Angelil

07/26/2023, 7:14 PM
@Dominique Brezinski the table is currently 354GB and growing... Yes the data being appended daily are the last 24-hours of data that have been written to an SQL DB
If I partition by month then each partition will be around 2GB - however, since 1-day of data is being appended every day, how can I partition by month?
d

Dominique Brezinski

07/26/2023, 7:18 PM
I don't understand the question. You just append the data with the appropriate month value column that the table is partitioned on.
But yes, you should really try ingestion time clustering for your use case.
👍 1
Any full table (or large number of partition) scans on your existing table are likely much slower than they need to be due to highly fragmented files
o

Oliver Angelil

07/26/2023, 7:20 PM
I don't understand the question. You just append the data with the appropriate month value column that the table is partitioned on.
I think I get it. So newly appended data will simply be added into the already existing partition for that month (except for the 1st of a month, where a new parquet file would be created)
👍 1
But yes, you should really try ingestion time clustering for your use case.
well the docs say that Ingestion Time Clustering is enabled by default on runtime 11.2, so not sure I need to actively do anything
I don't understand the question. You just append the data with the appropriate month value column that the table is partitioned on.
what I find confusing is that this can be done without Ingestion Time Clustering... I don't even understand what Ingestion Time Clustering will be doing if I'll already be partitioning by month...(the docs are very vague)
d

Dominique Brezinski

07/26/2023, 7:42 PM
Yes, you have to create a new, un-partitioned table that you start appending too using runtime >= 11.2. Here is the key information from the Ingestion Time Clustering blog, "All unpartitioned tables will automatically benefit from ingestion time clustering when new data is ingested. We recommend customers to not partition tables under 1TB in size on date/timestamp columns and let ingestion time clustering automatically take effect."
Ingestion time clustering is essentially just ordering the data at write time by the ingestion time, and then the min/max stats in the timestamp column will naturally be ranged in narrower bands per parquet file. When you execute queries with predicates on the timestamp column, file skipping will happen based on the time range and only read the files where the min/max satisfy the constraints. That is done without having to explicitly include predicates on partitions.
Also, by using an un-partitioned table, optimized writes will create larger parquet files that will increase throughput in processing. I suspect you will see greatly improved query times in many cases.
🙌 1
o

Oliver Angelil

07/27/2023, 7:15 PM
@Dominique Brezinski do you think repartitioning would preserve Ingestion Time Clustering? Say I want to remove existing partition columns, I can run e.g.
df.repartition(100)
- but would the ingestion times in the delta
.json
log files no longer align with the data in each of the files? Asking because I already have 5 years of an incremental table stored in a hive-style subdirectory structure (partition columns being Year/Month/Day).. wondering if I can go from that to use Ingestion Time Clustering
d

Dominique Brezinski

07/27/2023, 7:39 PM
That is probably a question best answered by Databricks support.
👍 1