https://delta.io logo
o

orsagiv

06/29/2023, 12:27 PM
Hey question about Data Skipping - does collecting statistics happen when writing to delta? or on the optimize job? attaching quotes from databricks delta documentation -
Copy code
By default Delta Lake on Databricks collects statistics on the first 32 columns defined in your table schema
j

JosephK (exDatabricks)

06/29/2023, 12:27 PM
stats are collected when writing data
🙏 1
o

orsagiv

06/29/2023, 12:33 PM
@JosephK (exDatabricks) so if i got it right, it also means that when reducing the value of
delta.dataSkippingNumIndexedCols
for an hour and then increase it back again the columns that were out of this range on that hour will never get their statistics back, right?
j

JosephK (exDatabricks)

06/29/2023, 12:35 PM
First, don't do that. They wont' have stats but if you optimize, that will write new files and you'll get stats for writing
o

orsagiv

06/29/2023, 12:37 PM
not sure i understand.. you mean when increasing the value of this property back, it will write the (missing) stats on the optimize job?
j

JosephK (exDatabricks)

06/29/2023, 12:42 PM
If you change it back, the optimize will create newer stats
o

orsagiv

06/29/2023, 12:43 PM
ok so it can be done also on the optimize not only when writing the data
j

JosephK (exDatabricks)

06/29/2023, 3:14 PM
optimize writes files