https://delta.io logo
d

Dmitry Volodin

01/17/2023, 1:25 PM
Hi, all. Have a quick question. As far as I get - there's no opportunity to create hierarchical partitioning with Delta right now. Correct? Maybe someone knows workaround or a replacement. Example of how it is done in some other techs.
PARTITION BY trunc(some_ts, 'HH')
defines small partitions, but then you can allow to
GROUP BY expr
with
expr
that define partition group. All partitions of one group could be stored in one file. And basically it is usually used to create datasets with smaller partitions for fresh data and bigger files for relatively "cold" one.
1
g

Gerhard Brueckl

01/17/2023, 1:31 PM
you would probably use generated columns for this which can also be used for partitioning https://docs.databricks.com/delta/generated-columns.html
Copy code
CREATE TABLE events(
eventId BIGINT,
data STRING,
eventType STRING,
eventTime TIMESTAMP,
year INT GENERATED ALWAYS AS (YEAR(eventTime)),
month INT GENERATED ALWAYS AS (MONTH(eventTime)),
day INT GENERATED ALWAYS AS (DAY(eventTime))
)
PARTITIONED BY (eventType, year, month, day)
d

Dmitry Volodin

01/17/2023, 1:42 PM
@Gerhard Brueckl but GENERATED works once on write. That means it's static, or I need to regenerate the value for it periodically. Yep, actually that should work even without generated attributes, but I need to run recalc of the attribute
expr
which defines partition. I was thinking about more elegant way ... when it is just automatically accounted during VACUUM/OPTIMIZE, so unchanged partitions actually should not be rebuilt.
g

Gerhard Brueckl

01/17/2023, 1:55 PM
I dont understand why you want to recalc your partition value or what the purpose would be. Usually partition columns should not change over time otherwise it does not make much sense to partition the data, no?
d

Dmitry Volodin

01/17/2023, 2:44 PM
@Gerhard Brueckl I don't want, but I want partitioning mechanism to allow "old" and small partitions to be united into single files. I 100% agree that it should not change over time, and in other techs it is usually so, but such mechanism as partition grouping is introduced instead. Here's example: Assume I have a timestamp
ts
attribute.
PARTITION BY trunc(ts, 'HH')
will create hourly partitions. They will be small enough and it will help to prevent large overhead for DML if I expect these data to change. Such design have a drawback, because if I want to store everything with hourly partitions it will no longer be efficient for analytical load. And storage expenses also are expected to be greater due to potentially smaller compression efficiency. So assume I have these partitions:
2023-01-01 09:00:00
2023-01-01 10:00:00
2022-12-01 09:00:00
2022-12-01 10:00:00
I expect December partitions to be stale and therefore I would like to allow vacuum/optimize runs to store them in bigger chunks, potentially uniting
2022-12-01 09:00:00
2022-12-01 10:00:00
into one file. Hope that explains my intentions. And also, to sum up, there is no such mechanism so far. And the only way is to conditionally update partitioning attributes, which looks like definitely NOT the best practice.
g

Gerhard Brueckl

01/17/2023, 3:07 PM
ok, got your point and I think it makes a lot of sense but not sure how we could do this in Delta Lake. I could think of introducing an artificial hour -1 and once the data is stale, you take everything and move it to this artificial partition. In addition you would need a second column holding the actual value for your hour then, in all queries you would need to use
p_hh = 2 OR (p_hh = -1 AND hh = 2)
just an idea that could work -not sure if it will be any faster though. DeltaLake is usually very efficient in partition pruning and file pruning
🙏 1
4 Views