https://delta.io logo
a

Ajex

01/31/2023, 10:39 AM
hi guy. I have a table with 60 column, i knew that delta gather statistic on first 32 column(default). So i use this code:
Copy code
spark.sql("ALTER TABLE delta.`/user/fplay/temp/testraw/raw_logs.delta` CHANGE COLUMN user_id user_id STRING FIRST")
to change the column index of the column i need to perform z-order to 0(the first column). After all the newest delta log version still have not any statistic about the column i need. Any help please!!!
r

Ryan Zhu

01/31/2023, 4:23 PM
what’s the output of
spark.read.format("delta").load("/user/fplay/temp/testraw/raw_logs.delta").schema
? Is
user_id
the first column? In addition, which Delta version are you using? There is a bug that Delta stats will pick up the first 32 columns in the schema of your ingestion data rather than the table schema. 2.2.0 fixed this issue ( https://github.com/delta-io/delta/commit/67bf022d3e4d8fc7c17c12be7875b855283b6996 )
1
a

Ajex

02/01/2023, 12:10 AM
It’s working, thank you so much
@Ryan Zhu however i'm running another test. Using delta 2.1 and use this code
Copy code
spark.sql("ALTER TABLE delta.`/user/fplay/temp/testraw/raw_logs.delta` CHANGE COLUMN user_id user_id STRING FIRST")
to change the column index of the column i need to perform z-order to 0(the first column). Then perform z-order, it's working, the delta logs have statistic for "user_id". But when the new data(not run zorder yet) write to that location delta logs have not statistic for "user_id". Meaning that just the data which i run z-order on it have statistic for "user_id", but the new data writing to
/user/fplay/temp/testraw/raw_logs.delta
have not statistic for "user_id"
r

Ryan Zhu

02/01/2023, 3:45 AM
What’s the schema in the new data? As I mentioned above, 2.1 will not pick up user_id column if it’s not in the first 32 columns of the new data schema.
a

Ajex

02/01/2023, 4:07 AM
you mean that it based on the schema on write but not the schema print out when using this code
spark.read.format("delta").load("/user/fplay/temp/testraw/raw_logs.delta").schema
The above code print out that user_id at the first place
r

Ryan Zhu

02/01/2023, 4:18 AM
Right, the schema on write. 2.2 fixed this bug
1
a

Ajex

02/01/2023, 4:22 AM
But when using 2.1 i'm still can perform z-order after change user_id column position to 0. So that z-order will not effective ??
r

Ryan Zhu

02/01/2023, 4:35 AM
Zorder will read the table. So the data to be written has the same schema as the table.
a

Ajex

02/01/2023, 4:45 AM
So you mean that using 2.1, after change column position in range of first 32 the Zorder still working regardless the new data to be written into table will not have statistic(of the column have change position) in the delta_log.
r

Ryan Zhu

02/01/2023, 4:46 AM
yep, zorder doesn’t have the issue because for zorder, the schema of data to be written happens to be the same as the table schema
❤️ 1
a

Ajex

02/01/2023, 5:00 AM
So each time i perform Zorder, delta cannot reuse the statistic in delta_log(becasue user_id have not statistic in this) and recompute statistic on user_id for Zorder. With new version delta can reuse the statistic on delta_log when i change user_id position. I'm right ?
3 Views