https://delta.io logo
n

Noah Prince

06/25/2023, 10:59 PM
In delta-rs, when trying to write a checkpoint, I’m getting
Copy code
Error: Failed to convert into Arrow schema: Json error: Binary is not supported by JSON
Any idea how to fix?
w

Will Jones

06/26/2023, 3:31 PM
Thanks for making the issue. We’ll take a look soon!
n

Noah Prince

06/28/2023, 12:28 AM
Any update on this?
w

Will Jones

06/28/2023, 1:07 AM
Nothing yet. Keep in mind we are a volunteer run project, so soon might mean a week or so 🙂
n

Noah Prince

06/28/2023, 1:12 AM
Got a good idea of where to look? Happy to contribute a fix but the surface area is a bit large haha
ReaderBuilder might not like that it's getting a schema with a binary column. So might need to insert some dbg!() to figure out why there is a binary column and where
Do you have binary data? Might be either from partition values or statistics
n

Noah Prince

06/28/2023, 1:55 AM
I do have binary data yeah, almost certainly coming from the statistics.
Partitioning is by date, so wouldn’t expect that to be the problem
Oh I can see exactly where it’s exploding. Thanks for the direction!
🙌 1
https://github.com/delta-io/delta-rs/pull/1498 This work? I think it fixes the issue. Dunno if it would have any downstream consequences.
Lol of course there’s another error now: Parser error: can’t parse the string value 5.5714976858712145e23 to decimal Guess this checkpoint writer hasn’t gotten much use 😞
m

Michael Nacey

06/28/2023, 12:44 PM
It might help to show your JSON and table schema
w

Will Jones

06/28/2023, 2:52 PM
I think it’s just that the checkpoint writer wasn’t tested with a wide enough range of schemas (something I often have to nag contributors about ;) )
I recently upgraded the statistics collection logic so we collect them for more data types, but didn’t get as far as testing them with the checkpoints apparently
I think we need to disable decimal stats for checkpoints too; they have their own complications like binary, since they don’t have a native JSON representation like strings or numbers
n

Noah Prince

06/28/2023, 2:54 PM
Yeah I’m more so saying I’m surprised the first to hit issues. Seems like large decimals and binary blowing up would have happened by now. But maybe people just aren’t using the checkpointing feature?
I’m looking at the decimal stuff right now. It’s representing them as a float which is bad
w

Will Jones

06/28/2023, 2:54 PM
Right now decimals are serialized out as floats, but that’s clearly not working well here
n

Noah Prince

06/28/2023, 2:55 PM
Could cause data skipping where it shouldn’t
Do you know off hand how the Scala impl handles this?
Does it skip decimals or serialize them as a raw string?
w

Will Jones

06/28/2023, 2:56 PM
I think they have a special non standard JSON serialization where they are unquoted strings
Such as 10.000
n

Noah Prince

06/28/2023, 2:57 PM
Oof. Skipping may be best then. These are our bronze tables so not the end of the world to not have stats
w

Will Jones

06/28/2023, 2:57 PM
I wonder if we serialized to are strings if spark would be able to read those fine 🤔
n

Noah Prince

06/28/2023, 2:57 PM
I think it might
That’s what I was going to try
Okay it does work with decimal strings
pushed that up
So I pushed another fix, I think stats were actually being encoded incorrectly for decimal types.
👍 1