https://delta.io logo
z

Zhang David

09/03/2023, 2:03 PM
Hi guys, I have two questions: 1. if I have an existing set of parquet files, how do I convert to delta table thru delta-rs python (w/o loading everything in memeory first) ? 2. If I have created two delta tables, at two different local fs locations, how can I merge them into one via delta-rs python? Really appreciate your help!
w

Will Jones

09/03/2023, 4:55 PM
You can use the PyArrow datasets module to load the parquet files, and then pass the stream of data to the deltalake writer. Will look roughly like:
Copy code
from deltalake import write_deltalake
import pyarrow.dataset as ds

dataset = ds.dataset("path/to/parquet", format="parquet")
reader = dataset.scanner().to_reader()
write_deltalake("path/to/deltatable", reader)
If I have created two delta tables, at two different local fs locations, how can I merge them into one via delta-rs python?
What do you mean by merge? Would it be acceptable to read them both and write a new third one? Or are you saying you want to create a new Delta Table that references data files at those two existing locations?
If it's the latter, we can't do that right now, since we only support relative paths for tables, not absolute ones.
z

Zhang David

09/03/2023, 9:51 PM
I see. For my 2nd question, I mean to ask if there’s a way to avoid loading everything in and writing out again.
My current setup is that I hv distributed jobs writing parquet files to different locations, and in the end there’s an aggregator that combines them. If I’m using delta-rs in python, can I have all those processes writing in parallel/concurrently to one table? I understand delta table is acid, but not sure if the rust based python binding also guarantees ACIDity
w

Will Jones

09/03/2023, 11:36 PM
What's the storage backend? S3 or something else?
Anything but S3 (and R2 and minio) you can append to concurrently in delta-rs
z

Zhang David

09/04/2023, 12:54 AM
I see. is it b/c of this open tkt: https://github.com/delta-io/delta-rs/issues/883 ?
unfortunately - aws s3 is the current backend when I run this on a remote cluster
do you know what the behavior will be if there's an intermediate "fs" layer, whether it be s3fs-fuse or fsx lustre ? so basically on a remote ec2 node, I'm concurrently writing to a location that's a mount of the same s3 prefix but managed by fsx lustre from AWS or its mounted via s3fs-fuse
w

Will Jones

09/04/2023, 3:19 AM
is it b/c of this open tkt
No, it's just because S3 doesn't support any atomic replace if not exists or similar operation. For S3, to support concurrent writes you need some sort of external locking mechanism. delta-rs ships with one implemented using dynamodb. You should be able to set these environment variables to configure it: https://github.com/delta-incubator/dynamodb-lock-rs/blob/f4e21a81d0a39fc2c20d868479045062131e5aba/src/lib.rs#L66-L82
z

Zhang David

09/04/2023, 4:40 AM
Oh nice! So I guess I should set both env vars above in each worker w/ the same values?
w

Will Jones

09/04/2023, 5:51 AM
Yeah it’s important that each worker is configured with the same locking mechanism
z

Zhang David

09/04/2023, 11:27 AM
got it thanks!