https://delta.io logo
r

rtyler

05/27/2023, 10:14 PM
@Robert I saw your comments on @John Darrington’s pull request. Would you and @Will Jones be up for me moving dynamodb_lock into a repo in delta-incubator? I think that crate needs to come out especially if we're to use a LockingObjectStore implementation
👀 1
w

Will Jones

05/27/2023, 11:08 PM
Is the idea: • dynamodb_lock ->
delta-incubator/dynamodb-lock
pub trait LockingObjectStore
lives in delta-rs • dynamodb_lock is used to implement
LockingObjectStore
for S3 in delta-rs ?
r

rtyler

05/27/2023, 11:11 PM
There's a bunch of ways that I think we can approach this actually. I think my preference would be to create a repo locking_object_store in incubator and have that depend on dynamodb_lock releases. The only reason dynamodb_lock lives inside the delta-rs repo is because we had no where else to sensibly put it at the time
w

Will Jones

05/27/2023, 11:17 PM
That seems reasonable too. I feel like there is a lot of refactoring to do here and I haven't wrapped my head around it all yet.
r

rtyler

05/28/2023, 6:26 AM
I dropped a quick sub-tree cut out of the crate into an incubator repo https://github.com/delta-incubator/dynamodb-lock-rs
👍 1
r

Robert

05/28/2023, 6:36 AM
So in the end we would end up with a repo that contains ``LockingObjectStore``, which exposes the
LocakCLient
trait? Then the dynamodb-lock-rs repo would implement that for dynamo and there would be a new repo for the ongoing posgres client?
or should the individual lock clients be crates in one repo?
together with the LockingObjectSTore?
Also noticed there is a lambdas repo in delta-incubator, should we then remove the
aws/heckpoint
from the delta-rs repo?
r

rtyler

05/28/2023, 6:55 AM
@Robert that was my intention, I was going to drop a couple other lambdas I have written in there when I found the time 🙂
💪 1
@Robert I was thinking that a single repo would have
LockingObjectStore
and that repo could contain feature gated locking implementations which would pull in dynamodb_lock, sqlx, etc I'm not sure if the
LockClient
is a useful abstraction tbh if we have LockingObjectStore
r

Robert

05/28/2023, 7:04 AM
makes sense to me 👍. Was just wondering, since we created the dynamodb-lock repo. WHat would be contained in that, or is that just temporary?
r

rtyler

05/28/2023, 7:07 AM
Since dynamodb-lock-rs is basically some standalone crate which implements a simple DynamoDB lock that others may be dependent on, I wanted to drop that elsewhere so that maybe others would contribute to it, or at the very least we can set it off to the side and forget about it 😛
r

Robert

05/28/2023, 7:08 AM
makes sense 😆
j

John Darrington

05/28/2023, 2:56 PM
So what should be my course of action at this point regarding where the code lives and where I should put the Postgres stuff? @rtyler @Robert
r

Robert

05/28/2023, 3:04 PM
My suggestion would be to do it in the delta-rs repo for now, as there will be changes to the core codebase and its probably easier for development until thigs are sorted out. Depending on you like to work, we could also build it in a separate repo, and integrate with delta-rs once we have a somewhat clean state. Publishing whatever new crates exist then would then be a pre-requisite though. Also, I cannot create repos in delta-incubator, so @rtyler would have to help out there. In short, I'm fine either way...
j

John Darrington

05/28/2023, 3:08 PM
Ok I think it makes the most sense to pull out into the incubator or as a separate crate as well.
👍 1
Then the crate rename wouldn’t be a big issue
imo
r

Robert

05/28/2023, 3:15 PM
absolutely. IIUC, we landed on having the whole
LockedObjectStore
in that crate, so a name along those lines would be great 🙂 . The first implementation would then be the postgres backed one... As this would also pull in object_store, which by now exposes AWS credentials, I hope to get that rusoto-free some time ...
One word of caution. object_store 0.6 has been released, but that is not yet integrated with datafusion, I guess this will happen in the next release or the release after. So if you want to integreate before that, make sure to use 0.5.* for now...
j

John Darrington

05/28/2023, 3:18 PM
Oh so you don’t want me to pull the existing dynamodb stuff over wholesale?
r

Robert

05/28/2023, 3:19 PM
well - if i understood @rtyler correctly, he wants to publish dynamo db lock as a standlalone crate. So the traits etc defined over there would live in the new crate, but the actual implementation of the dynamo lock i guess not...
j

John Darrington

05/28/2023, 3:20 PM
Ah
r

Robert

05/28/2023, 3:21 PM
Personally I believe for object store, we can implement the dynamo db lock without any sdks, which would be great in terms of depoendencies, and also compatibility with how the S3 object store itself is configured.
in that case though it would make more sense, to have that in the lock store crate as well...
Personally, I think its fine to have some duplication for now, as we figure this out.
j

John Darrington

05/28/2023, 3:23 PM
Ok. My goal is to get the Postgres backed lock up this week so I can start experimenting with delta on our data warehouse. I’m just glad works paying for my time on this :D
🤣 1
r

Robert

05/28/2023, 3:25 PM
We are happy to help wherever we can as well - I think many users may appreciate an alternative to DynamoDB, as we see many using non AWS S3 stores...
So extremely grateful for your effort!
j

John Darrington

05/28/2023, 3:25 PM
Yeah we’re using Minio here
👍 1
So just a little background , I work for Idaho National Laboratory as an architect. We built an open source ontological data warehouse that I’m looking to incorporate delta tables with for our longer term timeseries data
r

Robert

05/28/2023, 3:28 PM
Nice! Do you have a link, so I can browse a bit once I find some time? 🙂
r

Robert

05/28/2023, 3:29 PM
Thanks!
j

John Darrington

05/28/2023, 3:29 PM
Yep. I’m very excited for delta tables. No one wants to manage a petabyte sized Postgres cluster on prem lol
🤣 1
r

Robert

05/28/2023, 3:29 PM
indeed not. 😄
I guess doing petabyte scale is never easy ... but delta is a great choice to make it managable. Depending on your workloads, you may want to consider distributing a table across several buckets. The whole discussions around separating the LogStore etc. are also setting us up to support this. Right now delta-rs assumes that the data lives where the delta log is, which is not a requirement imposed by the protocol.
but we are looking to relax that constraint
j

John Darrington

05/28/2023, 3:35 PM
Good to know.
I’m just glad I can interact and manage it without having to stand up a spark instance
r

Robert

05/28/2023, 3:36 PM
^^ same reason that got me started 😄
j

John Darrington

05/28/2023, 3:36 PM
And this works with data that’s been ingested by data bricks?
r

Robert

05/28/2023, 3:38 PM
as usual, it depends ... databricks already supports more features. e.g. deletion vectors. SO if a table is written with unsupported features we cannot (reliably) read it. That said, there is some stuff coming up, that will make following new feature releases along much easier!
j

John Darrington

05/28/2023, 3:38 PM
Cool
r

Robert

05/28/2023, 3:39 PM
concurrent writes from databricks and delta-rs are also not possible rigth now, with S3, as a shared locking machanism between both environments is required.
j

John Darrington

05/28/2023, 3:40 PM
That’s fine. I’m looking at leveraging delta live tables for ingestion but on prem storage for long term keeping
👍 1
So just wanted to make sure the delta sharing sever I write would mostly work if I can configure it right
r

Robert

05/28/2023, 3:42 PM
if you are interested in delta sharing, @shingo has a working rust implementation...
j

John Darrington

05/28/2023, 3:43 PM
Awesome!
s

shingo

05/28/2023, 3:47 PM
@Robert Thank you for the call-out to my project here! @John Darrington I would be happy to collaborate with you!
j

John Darrington

05/28/2023, 3:51 PM
Cool. I’m a ways off of sharing but I’ll let you know.
👍 1
j

John Darrington

05/28/2023, 5:08 PM
maybe I'm just blind - but where is the incubator repo so I can fork it?
oh nvmd, it's an org , I see
r

rtyler

05/28/2023, 5:14 PM
@Robert I'll bring you into the incubator now
j

John Darrington

05/28/2023, 5:18 PM
ok so just to clarify we're pulling out dynamodb-lock into it's own repo on incubator then basically not touching it. Then I'll be writing
locked-object-store
(or should we name it
distributed-lock
?) as it's own thing with Postgres being the first backer which will also live in the incubator
r

rtyler

05/28/2023, 5:26 PM
@John Darrington what's your github ID? I can just create a locked-object-store repo right now while I'm here
j

John Darrington

05/28/2023, 5:27 PM
DnOberon
r

rtyler

05/28/2023, 5:28 PM
added to https://github.com/delta-incubator/locked-object-store, @Robert you and @Will Jones also have invites in your inbox to join the incubator org, your absence was an oversight
👍 2
j

John Darrington

05/28/2023, 5:29 PM
I'm in, thanks
I'm debating on following the pattern in the dynamodb locking vs. utilizing Postgres's row locking. If we're fine with interior mutability on the
LockClient
then I can hold transaction open with a
SELECT FOR UPDATE
on a given row for as long as the lock is needed, `release`ing the lock would cause an update and the transaction closing would free up the row for another
SELECT FOR UPDATE
- my gut says interior mutability is probably fine if what we gain is the use of an existing lock system vs. rolling our own completely with record information.
hmm though I guess if you're pulling a lot of concurrent operations that would mean a lot of open transactions , could still use session level advisory locks though- sorry just thinking out loud