https://delta.io logo
h

Hugo Saavedra

07/28/2023, 5:52 PM
I have a sort of weird question -- I'm trying to understand this person's criticism. Is it that sequential versioning doesn't work in a setting with multiple nodes writing to their own copy of a file, since you'll quickly end up with version number collisions? Meaning there needs to be some node that acts as a single source of truth or otherwise some way of locking files so that only one node can write at once?
I guess my counter would be that in most use cases/data warehouses there aren't multiple copies of individual tables distributed across several nodes
p

Parthiban Jaganathan

07/28/2023, 6:03 PM
when there are parallel writes, a metastore setup like dynamodb is required to eliminate concurrency issues and conflicts
r

rtyler

07/28/2023, 6:18 PM
@Hugo Saavedra I think your hunch is largely correct. The concern amount sequential identifiers is not as much of a concern IMHO because we're not talking about a distributed data store like a clustered relational data base. . @Parthiban Jaganathan the utility of a DynamoDB type helper is really just for S3 which still is eventually consistent, that's not required for ADLSv2 or GCS
👍 2
p

Parthiban Jaganathan

07/28/2023, 6:20 PM
@rtyler S3 is no more eventual consistent since last year. So for concurrent modification on a table, Would dynamodb setup is still required?
r

rtyler

07/28/2023, 6:21 PM
@Parthiban Jaganathan yes indeed, in a Databricks environment they operate a commit coordination service that is not open source, so Delta users on S3 should be considering providing their own lock coordination (as we do in #delta-rs) or adopt the use of the S3DynamoLogStore implementation
👍 3
h

Hugo Saavedra

07/28/2023, 6:33 PM
thanks @rtyler and @Parthiban Jaganathan -- appreciate your thoughts on this
r

rtyler

07/28/2023, 6:36 PM
@Hugo Saavedra it's important to not take comments like that on HN too seriously, they let anybody post replies there! 😉 The problems with monotonically increasing version numbers is only when you have writes happening to discrete and distributed data systems, referring to Subversion vs. Git is a good one. Git needs hashes of objects for versions because it has to merge state that was committed to distributed versions of the same data store (your local, versus my local, versus GitHub). Data lakes are very much like Subversion in that they are inherently centralized, on purpose! In a centralized data system, monotonically increasing versions are not a bug, but a feature!
h

Hugo Saavedra

07/28/2023, 7:00 PM
haha, yes that makes sense about HN 🙂 I mainly was digging through the archives there because I'm interested in learning more about delta internals and occasionally there's insightful commentary amid all the smugness and vitriol, and sometimes there are learning opportunities in thinking through some of the flippant criticisms. I figured that monotonic version numbers over hashes was probably a deliberate design decision but I wasn't sure about the reasoning -- it makes sense to me now that it's largely a product of data lakes being centralized by design
1
d

Dominique Brezinski

07/28/2023, 9:50 PM
It is not that S3 is eventually consistent (it isn't anymore), rather it is that is dooesn't have an atomic test/set
henrydark is confused lol
🙂 1