https://delta.io logo
a

Andrew Plunk

05/05/2023, 6:51 PM
Hello all. I'm planning to use delta-rs with datafusion for a streaming sql processing use case. Because my query nodes are going to frequently re-process queries over a time window, a local filesystem cache is a must. I would be interested in implementing https://github.com/delta-io/delta-rs/issues/769 but could use guidance on the design. The current questions that I have are: • Does the scala implementation implement a local cache feature? • Would it be safe to design a system that memoizes calls to get and get_range in the ObjectStore trait, from a delegate ObjectStore instance, to a local cache directory? • Should the cache have a configurable expiry? If so is it size based, or time based, or both?
w

Will Jones

05/05/2023, 6:54 PM
I don’t think we are certain on what the best approach is, but my suggestion is to create an wrapping object store that caches the
get
and
get_range
calls like you suggest, and writes them to a local temp directory. I’d recommend reading through the limit store implementation to see how to create a wrapper that is generic over the inner store. https://github.com/apache/arrow-rs/blob/master/object_store/src/limit.rs
a

Andrew Plunk

05/05/2023, 6:56 PM
That is very helpful, thank you for the link @Will Jones
w

Will Jones

05/05/2023, 6:56 PM
As far as expiry, I think we would definitely want some bound on the size. For time, many use cases would be fine with it living in a temporary directory whose lifetime is associated with the process (but you could also provide an option to have the cache persist beyond that)
👍 1
I’m not sure about the scala implementation. I think it’s likely their object store interaction is structured quite differently than ours, but it still may be instructive
a

Andrew Plunk

05/05/2023, 6:59 PM
Is the best path forward to read through LimitStore and produce a small RFC document w/ the proposed implementation before starting on code?
w

Will Jones

05/05/2023, 7:03 PM
If you want more input on the design, sure. But I think we’d be just as happy seeing a rough prototype with some benchmarks as well
Depends how many open questions you think there are.
a

Andrew Plunk

05/05/2023, 7:06 PM
Ok, I will start with a rough prototype then. Thanks again!
🙌 1
r

rtyler

05/05/2023, 7:27 PM
As a user of delta cache I have .. thoughts about it that are not necessarily positive 😆 I was exploring some code last week which was basically doing caching and mmaping of object store data. Nothing useful there unfortunately. The big gripe that I have with delta cache is that there are no metrics coming out of it, so if we're to add a cache, we gotta have it instrumented to see what the hit rate is
👍 2
w

Will Jones

05/05/2023, 7:41 PM
Yeah for certain queries it might not help much. The better we are at filter pushdown, the less cache hits you are likely to get if your queries vary by their filters. But if they vary higher up in the operation chain then maybe it’s more worth it
2 Views