https://delta.io logo
x

Xiaohan Zhang

08/29/2023, 4:49 PM
Hi I hope this is the right place for my question: In Mosaic, we need to design a random access algorithm to fetch record from databricks' delta table based on its ID (any unique id, like row number). Since it's used in LLM training, we need to have the query latency minimized to the order of 10^-1 record/thread. The current approach I am taking is to build a lookup table before hand, which creates a mapping: ID --> (parquet file path, row group id). Then given a ID, I can just read the specific row group of that parquet file. However, this is still slow since I still need to read the whole column chunk to a pyarrow table and then slice it to get the row I queried. I wonder if delta-rust has any functions that allows fast single record fetching, ideally O(1) complexity. or maybe has a way to access page level metadata, so I can build the lookup table with an even finer granularity, e.g., before LLM training, I can build a mapping ID -> (parquet file, row group id, page offset). Then I only need to read a page instead of the whole column chunk. Any suggestion is appreciated~
w

Will Jones

08/29/2023, 4:55 PM
I don't think delta-rs exposes this, but you might be able to accomplish this with some of the low-level Rust parquet APIs. RowSelection is what's used to prune pages https://docs.rs/parquet/46.0.0/parquet/arrow/arrow_reader/struct.RowSelection.html Which can be passed to the reader here: https://docs.rs/parquet/46.0.0/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_selection
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/ describes the page pruning they are able to accomplish with this
x

Xiaohan Zhang

08/29/2023, 7:20 PM
Oh wow, exactly what I was looking for! Thanks @Will Jones A follow up question: the LLM training and data streaming is implemented in python. what is the best way to call the rust function (with_row_selection() ) from our python library? Is there python wrapper available?
w

Will Jones

08/29/2023, 8:03 PM
There isn't one available right now; I think you'd have to write your own Python wrapper
gratitude thank you 1
x

Xiaohan Zhang

08/29/2023, 8:14 PM
@Will Jones I assume writing the wrapper requires using pyo3? Is it the de facto approach? I haven't used rust before but won't mind onboarding with a new language ~~
w

Will Jones

08/29/2023, 8:15 PM
Is it the de facto approach?
Yep. See the
maturin
project for how to set it up.