https://delta.io logo
l

ldacey

08/10/2023, 10:10 PM
not directly related to delta lake, but I have seen several libraries use rust lately and I am using a few of them myself (polars, datafusion, pyarrow has a rust version, delta-rs and an iceberg library, etc). is rust good for data related tasks in general? I have been using python for years and wanted to learn a new language but I want to actually use it day to day. I saw some memes about "write it in rust" so I was not sure how seriously I should look into it
w

Will Jones

08/10/2023, 10:15 PM
I think if you were previously considering using C++/Cython for some task, I'd definitely recommend Rust over those. But I don't think I'd recommend using Rust over just composing existing libraries in Python.
r

rtyler

08/14/2023, 2:44 PM
I would say that I'm a little more zealous about "write it in Rust" πŸ˜† For me it comes down to a question of of requirements, @Will Jones is absolutely right in that for many use-cases in the Data world you can compose with existing Python tools quite well.
πŸ‘ 1
I default to Rust because I'm usually chasing high performance. I did some data ingestion work on Friday (experimenting) and my first draft using tokio was able to ingest data maximizing concurrency of the machine and reading data at just about line-speed (15Gbps) . That said, getting comfy with Rust took me a long time, so if your time/effort is limited, Python FTW.
πŸ’― 1
c

chris fish

08/14/2023, 7:06 PM
i would say rust is becoming more and more accessible for data related tasks, but it is still missing a ton of higher level libraries compared to java/python. if you want to use a library like Pandas, it does not exist in Rust in a complete/stable way. the overall ecosystem of higher level libraries in python/java is quite complete. the benefit of Rust is that its inherently built for concurrency/parallel processing, and as a native language with direct memory control, can be extremely efficient compared to Java, but with all the benefits of strong type safety. Rust is an excellent backend language and probably the first genuine potential successor to C++, and is starting to ramp up the Hype curve. thats why you see all these memes of β€œjust write it in rust!” @Kurt Fehlhauer writes a good blog - https://kurtfehlhauer.com/posts/exploring-rust-for-data-engineering-part-1/ and i think you can see both some of the benefits of using Rust, but also how the ecosystem just isn’t very complete yet. instead of using a standard high level JSON parsing library, you have to do a fair bit of low level coding to be able to process JSON data. is it terribly complicated? No. but do i want to write 100 lines of code just to load 1 json file with a really simple schema? also No. in python, pandas, or even Spark, the equivalent is like 5 lines of code. i am very excited for Spark written in Rust, Pandas written in Rust, but this is mainly for the backend improvements, better concurrency, better memory efficiency. i actually dont care if the user-facing API is in Rust, Python, Scala, Java, SQL.
w

Will Jones

08/14/2023, 7:21 PM
instead of using a standard high level JSON parsing library, you have to do a fair bit of low level coding to be able to process JSON data. is it terribly complicated?
TBH, the blog post makes it look a little more complicated than it needs to be. Both DataFusion and Polars have the ability to read JSON (while inferring the schema) directly into Arrow data. So you could write equivalent code in Rust.
IMO the main difference between Rust and Python will be that Rust doesn't have any REPL. So if you want to do any interactive exploration or literate programming (notebooks), not the best language. That being said, there's plenty of data engineering tasks that don't require that, and where performance is more important.
I'd also add that if you are interested in exploring Rust and currently using Python, a good place to start is contributing to open source Python libraries that are written in Rust πŸ˜‰
j

Jordan Fox

08/14/2023, 9:12 PM
Yeah was reading through the thread and was like, pretty sure both datafusion and polars do that. I'd suggest personally working on one of those projects to get a sense of how rust works with data.
h

Hugo Saavedra

08/15/2023, 4:35 AM
(being somewhat new to Rust) I'm surprised to not see the added guardrails from the type system as a possible benefit of working in Rust -- it feels like so many of the issues I find myself debugging when working on Python pipelines and data systems reduce to weird typing issues that a language with more guardrails would have immediately caught
l

ldacey

08/15/2023, 3:19 PM
yeah - I was able to swap from pandas to polars for data transformation and I like it a lot. I use pyarrow heavily but there is a rust version (and datafusion). so in the case of reading JSON data I am 100% using rust originated code already
h

Hugo Saavedra

08/15/2023, 7:07 PM
came across this the other day https://datawithrust.com/ -- hopefully the author continues work on it. maybe getting a few additional subscribers would nudge them? πŸ™‚
πŸ‘ 1