shingo
07/25/2023, 5:38 AM# Who am I?
- I donated the Rust implementation of the data sharing protocol called
Delta Sharing Protocol to the Delta Lake community.
- Since then, I have been actively involved with developers worldwide in
implementing the Delta Lake Transaction Protocol in Rust.
# What is delta-rs?
- Delta Lake is an open-source storage framework (open table format) that
enables building a Lakehouse architecture with compute engines including
Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust,
Ruby, and Python.
- delta-rs provides low level access to Delta tables in Rust, which can be
used with data processing frameworks like datafusion, ballista, polars, vega, etc.
It also provides bindings to other higher level language Python.
# What is a Table Format?
- Table formats are a way to organize data files.
- They try to bring database-like features to the Data lake.
- Serializable ACID Writes
- Snapshot Isolation for Reads
- Scalability to billions of partitions or files
- Self describing
- Support for incremental processing
- Delta Lake offers some interesting/unique features.
- Delta Sharing Protocol
- Commit versioning / Timetravel
- Checkpoints
- CDC feature
# All these table formats' features are only for "big" data?
- No!
- These features are also useful for "not-so-big" data as well.
- But need to reduce the "surge impedance" between "big" and "not-so-big" data.
- Rust!
# Why Rust is gaining attention in the field of data engineering.
- Depending on the scale of the data, appropriate tools are chosen, ranging from
the ecosystem centered around Spark running on clusters to pandas, which operates on laptops.
- Now, what kind of ecosystem would be suitable for data of intermediate size, say several terabytes?
- One of the end goals of the Rust data engineering ecosystem is to essentially
be a drop in replacement for worker nodes for Spark jobs.
- Right now, worker nodes for spark are all jvm bytecode.
- But imagine instead of your scala, python, C#, or sql code instead created a binary native program
with better performance and also deterministic memory management?
- No more random heap allocation issues, no more "oh just try it again".
# Demo / kafka-delta-ingest + delta-rs + datafusion (or pandas?)
# Demo / Delta Sharing
# Some other interesting Rust projects.
- ROAPI
- delta-collaboration (if possible)
# Demo / Delta Collaboration (if possible)
# Rust as a core language for inter-language communication
- Rust is loved so much by lots of developers!
- Performance and low memory footprint.
- Lots of your workloads are simple and do not live long.
- Imagine deploying your workloads on AWS/Fargate.
- Memory safety is justice.
- I found some famous OSS projects relating to data streaming still had memory leak.
- Interoperability with other languages.
- Python has a way bigger ecosystem.
- Introducing PyO3/maturin.
- Least electricity.
# Delta Kernel and Table/Memory Format Interoperability
- The Delta Kernel will abstract out all of the Delta protocol details.
- It provides a narrow and stable set of Table APIs for rapid connector development
and easier maintenance for connector authors.
- I started the Delta Sharing Protocol server in Rust just for my own learning purposes.
- Now, implementing delta-sharing-rs on top of the native Rust implementation instead
of the Python binding would be a more convincing choice.
- Besides the Delta Kernel's universal approach for the Rust ecosystem, the supporting
technology's interoperabile nature, such as Apache Arrow and Parquet, will leverage the
true potential of the Rust data engineering ecosystem.
# References
- [D3L2: How BASF achieves global sustainability with Delta Lake w/ Robert Pack](<https://www.youtube.com/watch?v=yJshLUcbHGk&pp=ygUEZDNsMg%3D%3D>)
- [D3L2: Discussing Rust, Ballista, Ray SQL, Data Fusion with Andy Grove](<https://www.youtube.com/watch?v=NEL6DluUxgw&t=1612s&pp=ygUEZDNsMg%3D%3D>)
- [D3L2: The Inception of Delta Rust](<https://www.youtube.com/watch?v=2jgfpJD5D6U&pp=ygUEZDNsMg%3D%3D>)
- [D3L2: The Genesis of Delta Rust with QP Hou](<https://www.youtube.com/watch?v=ZQdEdifcBh8&t=2084s&pp=ygUEZDNsMg%3D%3D>)
- [D3L2: delta-rs at Back Market: Python and Rust, the best of both worlds](<https://www.youtube.com/watch?v=7mPbgJajHR0&pp=ygUEZDNsMg%3D%3D>)
- [Python Devs, It's Time To Get On The Rust Bandwagon!](<https://www.youtube.com/watch?v=j_1uUbxDWjY&t=231s&pp=ygUPZGVsdGEgbGFrZSBydXN0>)
- [Why Delta Lake is the Best Storage Format for Pandas Analyses](<https://www.databricks.com/dataaisummit/session/why-delta-lake-best-storage-format-pandas-analyses/>)
- [Delta-rs, Apache Arrow, Polars, WASM: Is Rust the Future of Analytics?](<https://www.databricks.com/dataaisummit/session/delta-rs-apache-arrow-polars-wasm-rust-future-analytics/>)
- [Focusing on the Developer and Partnering with the Linux Foundation Helps..- Denny Lee & Carly Akerly](<https://www.youtube.com/watch?v=IC2oOhIjM8g&t=721s&pp=ygUTZGFpczIwMjMgZGF0YWJyaWNrcw%3D%3D>)