https://delta.io logo
j

John Darrington

05/29/2023, 8:50 PM
best practice for converting an array of strings, lets say from a csv reader, into an arrow-array
Int32Array
or something similar?
w

Will Jones

05/29/2023, 8:55 PM
Assuming the array of strings is an Arrow array, probably the cast function in arrow-cast: https://docs.rs/arrow-cast/40.0.0/arrow_cast/cast/index.html
j

John Darrington

05/29/2023, 8:56 PM
how did I miss that in the docs 🤦‍♂️ thank you!
while I've got you here though I have an architecture question that might change how I approach the problem anyways, could use another pair of eyes.
right now we ingest csv and load them into postgres with COPY FROM - and it's fairly fast. My original thought was to concurrently write to delta tables as I was loading this data from the csv to the copier - but now I'm wondering if I'm going to fast a high casting cost
and if it would be better to send to postgres, then batch pull from postgres down into the delta tables - but then I'm doubling my bandwidth, sending the same data an extra time - though this time I might not have to do much casting work as postgres would have done it?
w

Will Jones

05/29/2023, 9:01 PM
Yeah there’s lots of ways to go about that, and I’m not sure of the tradeoffs.
One way is to ingest CSVs to Postgres, then extract to Arrow via pgeon (that’s the most performant postgres -> Arrow reader I know of atm, though eventually ADBC might match it), and write to delta from there.
j

John Darrington

05/30/2023, 12:21 AM
Hmm not a bad idea I wonder how nicely that plays with timescaledb
I think for right now I’ll play with trying to write to both sources concurrently and see how performance is.
If it’s not acceptable or too brittle then move to the batch pull from Postgres with that extension if it plays nice with timescale
Thanks for responding btw!
j

John Darrington

06/01/2023, 2:09 PM
@Will Jones just a quick follow up - concurrently writing out to delta lake from our ingestion pathway for csv's into Postgres didn't add a significant amount of a time on top of the operation - meaning we could follow that route if we wanted to. The biggest pain in the a** though was dealing with the three different datatypes and the conversion between them - between the postgres types, arrow types, and delta-rs types it was a blast of conversion processes - especially for timestamps 😄 I'm seriously considering that for our csv route at least it would be easier to have a batch process using that postgres extension you mentioned earlier - and have our parquet ingestion pathway go straight to delta vs even hitting postgres
w

Will Jones

06/01/2023, 2:18 PM
Hmm I would have thought that the conversion from Arrow to delta-ra was handled well by us
Which parts did you have to manually write. We might want to open issues on this in the repo
j

John Darrington

06/01/2023, 2:37 PM
Most of the conversion stuff I had to write was from the postgres types so not really applicable - I think you guys did a good job
I might be using wrong
w

Will Jones

06/01/2023, 2:41 PM
Okay well we have that whole delta_arrow module for the conversions, at least at the scheme level
j

John Darrington

06/01/2023, 2:45 PM
lmaooooooo I forgot about it
I ended up reimplementing a basic version of it to make my use case work
🤦‍♂️