Jeremy Jordan

05/15/2023, 7:17 PM
I seem to remember that delta lake only tracks metadata for the first 30 columns (and thus column ordering matters for improving query performance) but I can't seem to find this in the docs, can someone point me in the right direction?
Copy code
By default Delta Lake on Databricks collects statistics on the first 32 columns defined in your table schema. You can change this value using the table property delta.dataSkippingNumIndexedCols. Adding more columns to collect statistics would add more overhead as you write files.

Collecting statistics on long strings is an expensive operation. To avoid collecting statistics on long strings, you can either configure the table property delta.dataSkippingNumIndexedCols to avoid columns containing long strings or move columns containing long strings to a column greater than delta.dataSkippingNumIndexedCols using ALTER TABLE ALTER COLUMN. See ALTER TABLE

Lennart Skogmo

05/15/2023, 7:21 PM
found something called delta.dataSkippingNumIndexedCols here

chris fish

05/15/2023, 7:22 PM
just fyi, our docs search is a little messed up right now, so using google and
will usually help you find the thing you’re looking for
❤️ 1

Jeremy Jordan

05/15/2023, 7:23 PM
ahh perfect, thank you!!
i'm assuming this applies to the OSS delta lake as well? found the relevant section here