https://delta.io logo
j

John Darrington

05/25/2023, 9:21 PM
Would there be any heartache if I worked on a way to do locking with a Postgres instance vs dynamodb on s3? Use case is we're using MinIO (S3 Compliant) cloud storage on premises and can't use Dynamo for locking
👀 1
r

rtyler

05/25/2023, 9:22 PM
there would be no heartache
j

John Darrington

05/25/2023, 9:23 PM
cool - shouldn't be too difficult honestly
you guys seem to have done a great job with the APIs between the packages, makes it easier
🦀 1
r

Robert

05/25/2023, 9:30 PM
was said before, but no heartache at all 😄. It may be worthwhile to consioder though which way we want to go with this.. i,e, there have been some discussions around offering alternative locking mechanisms, to be able to use databricks and delta.rs for the same table. While postgres would not offer that, there may be soem inspiration for a simpler approach rather then using locks… https://github.com/delta-io/delta-rs/issues
r

rtyler

05/25/2023, 9:31 PM
using the Databricks S3 commit service would require them to post an actual API, which I've given them grief about since before the delta-rs project was founded
r

Robert

05/25/2023, 9:32 PM
there is a design document linked in the issue as well.. all in all that is going to be a larger effort, and we will be supporting the lock based going forward, so implementing a lock client shopuld also be fine
my understanding is that databricks has in internal way, but also offers a dynamo db based one (maybe of OSS delta?). the internal one is … internal 😄.. but using the dynamo db based one, the could be interop between rust and spark writers.
r

rtyler

05/25/2023, 9:34 PM
I think you meant to link a specific issue @Robert but not everything
🤣 1
j

John Darrington

05/25/2023, 9:34 PM
I found it 🙂
Correct 😄
j

John Darrington

05/25/2023, 9:35 PM
Honestly it seems like the quicker, stopgap for my situation would be to quickly implement a new locking client for now - then work on getting rid of the need in the future. I guess I could also run it in unsafe mode, I didn't see that one before
r

rtyler

05/25/2023, 9:36 PM
Databricks has the S3 Commit service which serializes writes to Delta tables in a given workspace. It's not the same as the "dynamodb one" you might be referring to, which is the DynamoDB Log Store implementation which is a very different beast compared to just locking
👍 1
j

John Darrington

05/25/2023, 9:36 PM
I can't do databricks, they have no on premises solution
this is for a fully on-prem airgapped delta lake backed setup
r

rtyler

05/25/2023, 9:36 PM
The big problem I have with the dynamodb log store is that it's an all or nothing proposition, a table has to move its log store to DynamoDB and it just lives there, rather than as JSON in object storage. That means all clients have to understand how to interact with the log store in dynamo
j

John Darrington

05/25/2023, 9:41 PM
wait, is that how it is currently, or was that the proposed route?
I've only just now cloned the repo and am looking through code
r

Robert

05/25/2023, 9:42 PM
hmm if i read this document right, it does put a line for each commit into dynamo, but the actuial files still reside in the object store, it just about not allwoing duplicate entries on a table, so the write to the db table will fail, if the commit has been made before. https://docs.google.com/document/d/1Gs4ZsTH19lMxth4BSdwlWjUNR-XhKHicDvBjd2RqNd8/edit#heading=h.xgcsc4h146hh
👀 1
but in any case… if implementing the lock client is easier this is the way to go for now 🙂
j

John Darrington

05/25/2023, 9:45 PM
yeah for just one person and needing to get something out the door soon - this is the way
👍 1
r

rtyler

05/25/2023, 9:50 PM
@Robert this is what I was thinking of with its warning, which honestly is not much different than our limitation, so it is perhaps just around write operations that the dynamodb log store comes into play. I cannot make heads or tails of the actual java behind it
👍 1
r

Robert

05/25/2023, 9:53 PM
My hope is anyhow, that S3 will eventually join the party and offer atomic writes.. all others do, and IIRC R2 will offer that on top of S3 🫰
j

John Darrington

05/25/2023, 9:59 PM
Considering that this postgres lock would be it's own crate - any issues with me using some helper packages like thiserror?
r

rtyler

05/25/2023, 10:04 PM
we already have it 😉
n

Nick Karpov

05/25/2023, 10:26 PM
ya the java land dynamodb store intercepts write and list calls, & therefore indeed heavy in that sense (but as robert linked, it does not move the log entirely to dynamo) (cc @shingo you may find this thread helpful)
👍 1
s

shingo

05/26/2023, 1:42 AM
Thanks, @Nick Karpov!
j

John Darrington

05/26/2023, 2:54 PM
Does someone have some time to do a quick huddle that's familiar with the current dynamodb locking paradigm? I wanted to make sure I understood it correctly before attempting to implement a postgres lock
I'm free all day after about an hour or so
r

rtyler

05/26/2023, 4:36 PM
@John Darrington what timezone are you in? I have the day off but am floating around some today
j

John Darrington

05/26/2023, 4:51 PM
I’m in the mountain timezone @rtyler , USA
👍 1
r

rtyler

05/26/2023, 6:41 PM
aight, I'm PST and in the office for a couple hours here
j

John Darrington

05/26/2023, 6:58 PM
I'm free now
r

Ryan Johnson

06/30/2023, 9:23 PM
Late to the party here, but:
The big problem I have with the dynamodb log store is that it's an all or nothing proposition, a table has to move its log store to DynamoDB and it just lives there, rather than as JSON in object storage. That means all clients have to understand how to interact with the log store in dynamo
AFAIK, the DDB client writes back the .json file to its usual place in S3 after the winning commit has been confirmed, and the reader doesn't talk to DDB at all. I don't know that code tho, so I'd have to double check to be 100% certain.
The all or nothing bit is that all writers to a table must agree about which endpoint/protocol to use for coordinating their commits. If one writer uses DDB and the other uses postgres, then they can bypass each other and a commit will likely be lost.
t

TD

06/30/2023, 9:26 PM
exactly correct @Ryan Johnson. everything is synced back to s3 and the json are in their usual location.
we also designed the code for this dynamodblogstore such that there is a subinterface called BaseExternalLogStore which can be extended for other non-DDB KVstores/databases.
4 Views