https://delta.io logo
m

Matthew Powers

02/17/2023, 2:19 PM
It seems like most of our users are using the Python bindings. What do we want to call this project when reaching out to the Python community? “Delta Rust” seems off putting for most Python programmers. They’ll probably think “well, that’s not the project for me”. Should we just call this project “deltalake” when chatting with the Python community? That’s the pypi package name and seems like a good option - thoughts?
👍 2
i

Ian Joiner

02/17/2023, 2:27 PM
Yup though it needs to be differentiated from https://pypi.org/project/delta-spark/.
I agree that we shouldn't mention Rust as most Python developers do not really care about whether what they are using has C++ or Rust underneath Python. Numpy doesn't refer to itself as num-c++ either.
What about deltapy/pydelta ?
m

Matthew Powers

02/17/2023, 2:32 PM
I like pydelta a lot. For higher level context for other folks - I am going to be outreaching to pandas users via Reddit, blog posts, and conference talks. It’d be weird for me to say to pandas users “you should try Delta Rust”. Telling pandas users that they should try pydelta would be a much easier sell…
👍 1
i

Ian Joiner

02/17/2023, 2:34 PM
https://freedelta.sourceforge.net/pydelta/ There is this package though. I guess it is sufficiently different from us that there won't be any confusion.
😑 1
w

Will Jones

02/17/2023, 4:02 PM
Would calling it “the Python deltalake” / “the native Python deltalake” package help? I mostly differentiate it by saying it doesn’t require JVM or Spark.
👍 2
I agree that in our documentation, we should probably move away from “the Python bindings to delta-rs”
👍 1
m

Matthew Powers

02/17/2023, 4:18 PM
I like “Python deltalake” 🤓
👍 3
i

Ian Joiner

02/17/2023, 4:20 PM
Same. That doesn't seem to have been occupied.
pydeltalake
is already occupied as a pure Python package though.
r

rtyler

02/17/2023, 4:57 PM
I would shy away from pydelta since it bears a lot of resemblance to pyspark. Is "deltalake for python" too simple? 😛
😀 1
v

Venkat Viswanathan

02/17/2023, 6:43 PM
How about pyrustdl or pyrustdeltalake or pyrustdelta?
🤣 2
i

Ian Joiner

02/17/2023, 6:49 PM
Let’s leave Rust out of it. Most Python developers do not know either Rust or C++. I mean there is a reason why most C++ libraries don’t have “ASM” in their names even though manually written assembly is sometimes there..
👍 1
In fact one disadvantage we have over C++/Cython/Python libraries is precisely the fact that our Python code is actually in Rust unlike Cython which actually looks mostly like Python and hence is a lot more readable to Python developers. We need to make the Python docs really good since folks are not going to be able to read the source code at all without learning a new language with pretty steep learning curve..which in practice rarely takes place.
🙌 1
m

Matthew Powers

02/17/2023, 6:56 PM
Yep, I agree @Ian Joiner. I hope that most users are blissfully unaware that the code is even written in Rust. That’s an implementation detail they shouldn’t have to worry about.
👍 1
v

Venkat Viswanathan

02/17/2023, 6:57 PM
Other option is to check if pydeltalake can be extended with the rust libraries?
i

Ian Joiner

02/17/2023, 6:57 PM
Having plenty of Python examples will also be great.
w

Will Jones

02/17/2023, 7:26 PM
I’ll do another revision of the Python docs soon and will change out the “Python bindings for delta-rs” language 👍
👍 2
r

Robert

02/17/2023, 10:00 PM
while i have no strong feelings about this and in principle agree that the fact its written in rust does not need to be front and center, my perception is a bit different. i think the community is well aware that if you want performance, you need native libraries, and that this is the core of numpy, or rather the entire data and even more ml domain in python. also there are some extremely popular libraries out there - like polars, ruff, pydantic (core) - that advocate being implemented in rust. people seem to associate this with performance and reliability - as they should 🙂.
🤔 2
Another angle we may want to take when looking at the target audience is - who has problems we are solving? I.e. if I am working with a static and reasonably small dataset, pandas might be exactly the right tool if I am used to the APIs etc. On the other hand If the dataset is large, and I can leverage file skipping, then deltalake is the way to go. Also if my dataset is evolving or I aim to operationalise my data, then Delta is the way to go. Some examples that come to mind are replacing tools like DVC with Delta or how to leverage data versioning together with mlflow recipes ...
5 Views