https://delta.io logo
t

Tornike Gurgenidze

06/10/2023, 8:36 AM
Hi all, I've been experimenting with delta-sharing a bit and I have a question regarding the scope and goals of the project. I would like to build a more "generic" data server where a client can send arbitrary JSON as a request and get a response in delta-sharing's API Response Format after some amount of computation on the server-side. Seems like, there are several design decisions around delta-sharing that make this kind of use case impossible. 1) Delta Sharing enforces somewhat sql-like (maybe database-like would be a better term) semantics for a request (share/schema/table structure and the concept of a predicate). 2) More importantly,
query
endpoint expects data response to be immediately available. While this might have made sense for the original use case, I think it's still unnecessarily limiting. Some mechanism by which a server would optionally return a token and allow the client to poll for results until they're available would free the protocol up for various use cases. What I'm probably actually looking for is a "lower-level" sister protocol similar to delta-sharing, which defines the format of the dataset metadata being returned but makes no other assumptions about the service. A good analogy would be the difference between Arrow Flight and Arrow Flight SQL from Arrow project where the latter is much more opinionated. Would you consider something like this outside of delta-sharing's scope?
s

shingo

06/10/2023, 10:03 AM
Though I'm not sure if I am the right person to answer this question and if I understand you correctly, here are my thoughts on your proposal:
1) Delta Sharing enforces somewhat sql-like (maybe database-like would be a better term) semantics for a request (share/schema/table structure and the concept of a predicate).
Delta Sharing is a simple REST protocol, so I would say the semantics are quite common. Regarding the arbitrariness of the request JSON, the protocol specifies the bare minimum that Delta Sharing should have. This means you can still implement your own Delta Sharing server as long as it accepts arbitrary requests that satisfy the minimum requirements. The
predicate
parameter is used for Delta table pushdown filtering, which is one of the advantages of using Delta tables as backends.
2) More importantly,
query
endpoint expects data response to be immediately available. While this might have made sense for the original use case, I think it's still unnecessarily limiting. Some mechanism by which a server would optionally return a token and allow the client to poll for results until they're available would free the protocol up for various use cases.
When a server accepts a request at the
query
endpoint from a Delta Sharing client, the server tries to create short-lived pre-signed URLs, which essentially aligns with the concept you described as
tokens
. I would say the resulting footprint of the response is generally smaller than that of an HTML file for an ordinary static web page. Therefore, the expectation of immediate availability should not cause any performance issues. In other words, the server will free up the protocol immediately.
t

Tornike Gurgenidze

06/10/2023, 10:35 AM
thanks for the answer. 1) Of course it can be extended beyond the minimum requirements, but the exclusion from the protocol means that available clients probably won't make it convenient to make such calls. For example, python client's
load_as_pandas
method expects share.schema.table as an input. One can write their own client or perhaps somehow cram the whole json into one of share/schema/table arguments, but neither of those solutions seem ideal to me. 2) I may have miscommunicated my intention here. The fact that the server has to generate pre-signed URLs right away means that parquet files need to be already in-place in (for example) a S3 bucket, doesn't it? My goal is to have a server that receives a request and internally spawns off another process (a spark job maybe) that materializes parquet files in S3 first and returns the response with pre-signed URLs afterwards. As data materialization may take some time, this probably should not happen in a single HTTP call.
s

shingo

06/10/2023, 1:11 PM
Thank you for the clarification, and I apologize for misunderstanding your point. Personally, I prefer following the KISS principle. Having a separate job scheduling/runner system responsible for creating Delta tables would seem nice to me. However, if there are other useful use cases to consider, it might be worth exploring those as well. I am interested in hearing what other people think about this.