https://delta.io logo
r

Ryan Johnson

01/13/2023, 7:57 PM
Hi all, I'm trying to use pyarrow to parse Delta's AddFile.stats field (a JSON string), but there doesn't seem to be any way to parse a
StringArray
full of json into an
Array
or
Table
of nested data. This surprises me, becuase pyarrow.json.read_json does exactly the right thing... but only for line-delimited json files. At least, I didn't see anything e.g. in pyarrow.compute and a google search came up empty. Am I missing something obvious here?
w

Will Jones

01/13/2023, 8:08 PM
Yeah there’s no compute function to parse JSON. I think the best thing you can do is join the string buffers with newlines and then run
parse_json
on that buffer:
Copy code
import pyarrow as pa
from pyarrow.json import read_json

arr = pa.array(['{"x": 1, "y": "a"}', '{"x": 2, "y": null}'])

json_data = pa.py_buffer("\n".join(x.as_py() for x in arr).encode())

read_json(json_data)
r

Ryan Johnson

01/13/2023, 8:11 PM
I guess that works because buffers are "file like"? But super annoying given that it requires copying the data...
w

Will Jones

01/13/2023, 8:17 PM
True, but it works 🤷
Not sure if you saw, but the next version of the Python delta-rs bindings will include a function that parses this for you: https://github.com/delta-io/delta-rs/pull/1033#issuecomment-1370487499
🎉 3
r

Ryan Johnson

01/13/2023, 8:59 PM
Skimming the C++ sources, the code is a bit convoluted, but there might be a core that could be wrapped up into a proper compute function. If I read correctly: • TableReaderImpl::Read • calls TableReaderImpl::ParseAndInsert • calls ParseBlock • which uses a HandlerBase instance that implements
BlockParser
• whose doParse method uses a rapidjson
Reader
to do the heavy lifting. I think the only difference for parsing a
StringArray
(instead of a file) is we'd need a block parser variant that presents each string to be parsed as a different buffer --- maybe as a std::spanstream --- so that EOF becomes the "delimiter" instead of newline?
w

Will Jones

01/13/2023, 9:22 PM
I think you have the right idea. Do you want to open an issue at https://github.com/apache/arrow/issues and we can discuss further there?
r

Ryan Johnson

01/13/2023, 10:23 PM
6 Views