Dmitry Volodin
01/17/2023, 1:25 PMPARTITION BY trunc(some_ts, 'HH')
defines small partitions, but then you can allow to GROUP BY expr
with expr
that define partition group. All partitions of one group could be stored in one file. And basically it is usually used to create datasets with smaller partitions for fresh data and bigger files for relatively "cold" one.Gerhard Brueckl
01/17/2023, 1:31 PMCREATE TABLE events(
eventId BIGINT,
data STRING,
eventType STRING,
eventTime TIMESTAMP,
year INT GENERATED ALWAYS AS (YEAR(eventTime)),
month INT GENERATED ALWAYS AS (MONTH(eventTime)),
day INT GENERATED ALWAYS AS (DAY(eventTime))
)
PARTITIONED BY (eventType, year, month, day)
Dmitry Volodin
01/17/2023, 1:42 PMexpr
which defines partition.
I was thinking about more elegant way ... when it is just automatically accounted during VACUUM/OPTIMIZE, so unchanged partitions actually should not be rebuilt.Gerhard Brueckl
01/17/2023, 1:55 PMDmitry Volodin
01/17/2023, 2:44 PMts
attribute.
PARTITION BY trunc(ts, 'HH')
will create hourly partitions. They will be small enough and it will help to prevent large overhead for DML if I expect these data to change.
Such design have a drawback, because if I want to store everything with hourly partitions it will no longer be efficient for analytical load. And storage expenses also are expected to be greater due to potentially smaller compression efficiency.
So assume I have these partitions:
2023-01-01 09:00:00
2023-01-01 10:00:00
2022-12-01 09:00:00
2022-12-01 10:00:00
I expect December partitions to be stale and therefore I would like to allow vacuum/optimize runs to store them in bigger chunks, potentially uniting
2022-12-01 09:00:00
2022-12-01 10:00:00
into one file.
Hope that explains my intentions. And also, to sum up, there is no such mechanism so far. And the only way is to conditionally update partitioning attributes, which looks like definitely NOT the best practice.Gerhard Brueckl
01/17/2023, 3:07 PMp_hh = 2 OR (p_hh = -1 AND hh = 2)
just an idea that could work -not sure if it will be any faster though.
DeltaLake is usually very efficient in partition pruning and file pruning