https://delta.io logo
m

Matthew Powers

06/13/2023, 7:35 PM
Thanks @Will Jones. Will try that and report back.
w

Will Jones

06/13/2023, 7:36 PM
I think part of the trick is specifying the schema up front so the reader doesn't have to try inferring it.
But usually they limit to the top 10,000 rows or something, so not sure if that's the issue
k

Kees Duvekot

06/13/2023, 10:55 PM
Can't you "just" read that csv into a spark dataframe and than save that dataframe as a parquet file?
m

Matthew Powers

06/14/2023, 1:43 AM
@Kees Duvekot - yea, but that makes multiple Parquet files. I am looking for a single Parquet file in this case.
k

Kees Duvekot

06/14/2023, 3:59 AM
Can't you do a coalesc before?
So
coalesc(1)
That is also what we did when we wanted the csv writer in spark to write a single csv
It would require a node with enough memory 😁
😎 1
m

Matthew Powers

06/15/2023, 3:09 AM
@Will Jones - this worked to convert a 50 GB CSV => a single Parquet file BTW:
Copy code
writer = None
with pyarrow.csv.open_csv(in_path) as reader:
    for next_chunk in reader:
        if next_chunk is None:
            break
        if writer is None:
            writer = pq.ParquetWriter(out_path, next_chunk.schema)
        next_table = pa.Table.from_batches([next_chunk])
        writer.write_table(next_table)
writer.close()
Thanks for pointing me in the right direction. The resulting Parquet file had 50,476 row groups 😲 The notebook in case you’re interested.
πŸ‘€ 2