https://delta.io logo
s

shingo

07/25/2023, 5:38 AM
Copy code
# Who am I?
 - I donated the Rust implementation of the data sharing protocol called
   Delta Sharing Protocol to the Delta Lake community.
 - Since then, I have been actively involved with developers worldwide in
   implementing the Delta Lake Transaction Protocol in Rust.

# What is delta-rs?
 - Delta Lake is an open-source storage framework (open table format) that
   enables building a Lakehouse architecture with compute engines including
   Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust,
   Ruby, and Python.
 - delta-rs provides low level access to Delta tables in Rust, which can be
   used with data processing frameworks like datafusion, ballista, polars, vega, etc.
   It also provides bindings to other higher level language Python.

# What is a Table Format?
 - Table formats are a way to organize data files.
 - They try to bring database-like features to the Data lake.
   - Serializable ACID Writes
   - Snapshot Isolation for Reads
   - Scalability to billions of partitions or files
   - Self describing
   - Support for incremental processing
 - Delta Lake offers some interesting/unique features.
   - Delta Sharing Protocol
   - Commit versioning / Timetravel
   - Checkpoints
   - CDC feature

# All these table formats' features are only for "big" data?
 - No!
 - These features are also useful for "not-so-big" data as well.
 - But need to reduce the "surge impedance" between "big" and "not-so-big" data.
 - Rust!

# Why Rust is gaining attention in the field of data engineering.
 - Depending on the scale of the data, appropriate tools are chosen, ranging from
   the ecosystem centered around Spark running on clusters to pandas, which operates on laptops.
 - Now, what kind of ecosystem would be suitable for data of intermediate size, say several terabytes?
 - One of the end goals of the Rust data engineering ecosystem is to essentially
   be a drop in replacement for worker nodes for Spark jobs.
 - Right now, worker nodes for spark are all jvm bytecode.
 - But imagine instead of your scala, python, C#, or sql code instead created a binary native program
   with better performance and also deterministic memory management?
 - No more random heap allocation issues, no more "oh just try it again".

# Demo / kafka-delta-ingest + delta-rs + datafusion (or pandas?)

# Demo / Delta Sharing

# Some other interesting Rust projects.
 - ROAPI
 - delta-collaboration (if possible)

# Demo / Delta Collaboration (if possible)

# Rust as a core language for inter-language communication
 - Rust is loved so much by lots of developers!
 - Performance and low memory footprint.
   - Lots of your workloads are simple and do not live long.
   - Imagine deploying your workloads on AWS/Fargate.
 - Memory safety is justice.
   - I found some famous OSS projects relating to data streaming still had memory leak.
 - Interoperability with other languages.
   - Python has a way bigger ecosystem.
   - Introducing PyO3/maturin.
 - Least electricity.

# Delta Kernel and Table/Memory Format Interoperability
 - The Delta Kernel will abstract out all of the Delta protocol details.
 - It provides a narrow and stable set of Table APIs for rapid connector development
   and easier maintenance for connector authors.
 - I started the Delta Sharing Protocol server in Rust just for my own learning purposes.
   - Now, implementing delta-sharing-rs on top of the native Rust implementation instead
     of the Python binding would be a more convincing choice.
 - Besides the Delta Kernel's universal approach for the Rust ecosystem, the supporting
   technology's interoperabile nature, such as Apache Arrow and Parquet, will leverage the
   true potential of the Rust data engineering ecosystem.

# References
 - [D3L2: How BASF achieves global sustainability with Delta Lake w/ Robert Pack](<https://www.youtube.com/watch?v=yJshLUcbHGk&pp=ygUEZDNsMg%3D%3D>)
 - [D3L2: Discussing Rust, Ballista, Ray SQL, Data Fusion with Andy Grove](<https://www.youtube.com/watch?v=NEL6DluUxgw&t=1612s&pp=ygUEZDNsMg%3D%3D>)
 - [D3L2: The Inception of Delta Rust](<https://www.youtube.com/watch?v=2jgfpJD5D6U&pp=ygUEZDNsMg%3D%3D>)
 - [D3L2: The Genesis of Delta Rust with QP Hou](<https://www.youtube.com/watch?v=ZQdEdifcBh8&t=2084s&pp=ygUEZDNsMg%3D%3D>)
 - [D3L2: delta-rs at Back Market: Python and Rust, the best of both worlds](<https://www.youtube.com/watch?v=7mPbgJajHR0&pp=ygUEZDNsMg%3D%3D>)
 - [Python Devs, It's Time To Get On The Rust Bandwagon!](<https://www.youtube.com/watch?v=j_1uUbxDWjY&t=231s&pp=ygUPZGVsdGEgbGFrZSBydXN0>)
 - [Why Delta Lake is the Best Storage Format for Pandas Analyses](<https://www.databricks.com/dataaisummit/session/why-delta-lake-best-storage-format-pandas-analyses/>)
 - [Delta-rs, Apache Arrow, Polars, WASM: Is Rust the Future of Analytics?](<https://www.databricks.com/dataaisummit/session/delta-rs-apache-arrow-polars-wasm-rust-future-analytics/>)
 - [Focusing on the Developer and Partnering with the Linux Foundation Helps..- Denny Lee & Carly Akerly](<https://www.youtube.com/watch?v=IC2oOhIjM8g&t=721s&pp=ygUTZGFpczIwMjMgZGF0YWJyaWNrcw%3D%3D>)