https://delta.io logo
s

S Thelin

04/19/2023, 12:46 PM
Hello. Is there any known issue if you append concurrently from let’s say
dask
to a delta table with
delta-rs
? Will it be able to handle the log properly or will it fail silently?
r

rtyler

04/19/2023, 4:00 PM
There are no known issues. If you're using S3 you must use the dynamodb lock provider in order to safely provide for concurrent writes
1
j

Jim Hibbard

04/19/2023, 8:15 PM
You may also be interested in the deltadask library. This is independent of the concurrency question, I'd just be echoing rtyler, but if you're working with dask and delta lake a lot it might be handy. Hope that helps! 😄
s

S Thelin

04/24/2023, 9:31 PM
@rtyler yeah just found out about it. However in my case I don’t want to use dynamodb, feels quite expensive for a basic use case. I think I will revert to produce to kafka topics, and then consume it from spark to produce it to delta. We want to use kafka for CDC either way so it is another good use case for us. And I can then expose produce it via fast-api for my client users, that way we don’t have to setup lock via dynamodb.
@Jim Hibbard I assume
deltadask
also has issues with concurrent writes since there is nothing that can lock this, unless you setup the dynamodb lock. As expected, it seems like it only support reads, which is essentially the same as what I do today via dask and my own client with delta-rs and trino.
r

rtyler

04/24/2023, 9:37 PM
I wouldn't shy away from this approach because of cost. I just double checked a production dashboard and for data pipelines written in Rust which are ingesting over 50k messages/sec our monthly DynamoDB costs are under $10 🤣
It's $1.25 per million write request units and $0.25 per million read request units. Using DynamoDB for locking is nothing like using DynamoDB for something like a CRUD application FWIW
👍 1
👀 1
j

Jim Hibbard

04/24/2023, 10:14 PM
That's a really good point @rtyler. The amount of work DynamoDB is doing is miniscule!
@S Thelin Yes, any concurrent write solution on s3 will currently need some locking mechanism like DynamoDB. The cost is quite low since not much is happening, but concurrent write workloads need some coordination mechanism because s3 doesn't have a "put if empty" API like azure/gcp's cloud storage.
s

S Thelin

04/25/2023, 7:28 AM
So I mostly run fully open source and I avoid cloud products as much as possible, so for us it makes more sense to move this for now to produce to topics, then have spark picking it up writing to delta tables where the risk for collision is less. I don’t like the idea of using DynamoDB for this very specific edge case. I am pushing to use Log CDC so this is just another good reason to switch to that. The only thing I use on AWS atm is EKS and S3, the rest we run in K8S. Hence I can be cloud agnostic. I have written another interface which can communicate with delta, create tables on the fly, perform merges etc with using delta-rs and trino together, spoke with @Matthew Powers about it, think I will try to push to open source it. However the issue persists with locking there as well. However with the client I have written, you can enable lock with DynamoDB yourself.
j

Jim Hibbard

04/25/2023, 7:29 AM
That sounds like a great contribution. Yes, the thing missing with s3 is atomicity, this isn't an issue on Azure or GCP's flavor of object storage, just s3.
👀 1
s

S Thelin

04/25/2023, 7:29 AM
However @Nick Karpov is working on some interesting stuff regarding this locking across frameworks like spark, delta-rs etc which will probably come handy where you are not forced to use DynamoDB.
👍 1
j

Jim Hibbard

04/25/2023, 7:30 AM
I'll have to check with Nick, but I'm pretty sure it was to harmonize the DynamoDB approaches.
s

S Thelin

04/25/2023, 7:30 AM
@Jim Hibbard I assume MinIO has the same issue as S3 then.
j

Jim Hibbard

04/25/2023, 7:31 AM
You could definitely do something without DynamoDB, you just need to fill the same niche it currently is. I'm unsure about minio.
I know they replicate the s3 API, but not sure if it's identical or a superset of features, etc. So not 100% sure off the top of my head but assume the same issue would be there. You should double check though 🙂
s

S Thelin

04/25/2023, 7:43 AM
Thanks for replies @Jim Hibbard and @rtyler appreciate it.
j

Jim Hibbard

04/25/2023, 7:45 AM
Of course! And don't hesitate to reach out to any of us if you want help w/ your delta-rs/trino features.
s

S Thelin

04/25/2023, 11:30 PM
@Jim Hibbard seems like MinIO might have some locking https://min.io/docs/minio/linux/administration/object-management/object-retention.html#id6 I will test it.
j

Jim Hibbard

04/26/2023, 12:48 AM
Oh sweet, definitely let me know. That would be really interesting 😁
r

rtyler

04/26/2023, 6:03 AM
I don't believe that does what you may believe it to do. S3 also has object locking. The purpose of the DynamoDB locking for delta-rs is to ensure against eventually consistent bucket/object operations in S3. If Minio is truly S3 compatible, they will have likely replicated the same eventually consistent behavior around a "put if absent" type behavior which makes concurrent writers unsafe
1
2 Views