Ryan Johnson
01/13/2023, 7:57 PMStringArray
full of json into an Array
or Table
of nested data. This surprises me, becuase pyarrow.json.read_json does exactly the right thing... but only for line-delimited json files. At least, I didn't see anything e.g. in pyarrow.compute and a google search came up empty. Am I missing something obvious here?Will Jones
01/13/2023, 8:08 PMparse_json
on that buffer:
import pyarrow as pa
from pyarrow.json import read_json
arr = pa.array(['{"x": 1, "y": "a"}', '{"x": 2, "y": null}'])
json_data = pa.py_buffer("\n".join(x.as_py() for x in arr).encode())
read_json(json_data)
Ryan Johnson
01/13/2023, 8:11 PMWill Jones
01/13/2023, 8:17 PMRyan Johnson
01/13/2023, 8:59 PMBlockParser
• whose doParse method uses a rapidjson Reader
to do the heavy lifting.
I think the only difference for parsing a StringArray
(instead of a file) is we'd need a block parser variant that presents each string to be parsed as a different buffer --- maybe as a std::spanstream --- so that EOF becomes the "delimiter" instead of newline?Will Jones
01/13/2023, 9:22 PMRyan Johnson
01/13/2023, 10:23 PM