rtyler
04/13/2023, 2:51 PMprimitive("timestamp")
column in the schema because the ParquetReader gives Timestamp(nanosecond)
from the int96 column, but the Arrow schema delta-rs thinks it should be using is Timestamp(microsecond)
, which I believe to be the correct interpretation here.
My gut feel is that this will be the case for parquet files written by Delta/Spark, but I'm not sure where the best place to introduce the necessary conversion will be. Putting it in our writer feels correct but I think we would have other readers not doing the right thing on this type eitherWill Jones
04/13/2023, 3:04 PMrtyler
04/13/2023, 3:05 PMWill Jones
04/13/2023, 3:12 PMrtyler
04/13/2023, 3:16 PMWill Jones
04/13/2023, 4:00 PMrtyler
04/13/2023, 7:04 PMWill Jones
04/13/2023, 7:11 PM• Since Spark 3.0, parquet logical typeis used by default while savingTIMESTAMP_MICROS
columns. In Spark version 2.4 and earlier,TIMESTAMP
columns are saved asTIMESTAMP
in parquet files. To setINT96
toINT96
restores the previous behavior.spark.sql.parquet.outputTimestampType
rtyler
04/13/2023, 7:35 PMRobert
04/13/2023, 7:38 PMrtyler
04/13/2023, 8:13 PMWill Jones
04/13/2023, 8:25 PMwith_schema
will work; it looks like it only handles metadata and will fail if data types change. I think we’ll need to write our own function that casts record batches by traversing the schema and applying functions from https://docs.rs/arrow-cast/37.0.0/arrow_cast/cast/fn.cast.htmlrtyler
04/13/2023, 8:27 PM