https://delta.io logo
m

Matthew Powers

07/23/2023, 2:38 PM
Anyone know how to read a Delta table into a DataFusion DataFrame with the Python API? Here’s what I tried:
Copy code
from datafusion import SessionContext
from deltalake import DeltaTable

table = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e8_1e2_0_0")
ctx = SessionContext()
ctx.create_dataframe(table.to_pyarrow_dataset())
Here’s the error I got:
TypeError: argument 'partitions': 'FileSystemDataset' object cannot be converted to 'PyList'
More info in the thread…
This works with DataFusion:
Copy code
batch = pyarrow.RecordBatch.from_arrays(
    [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
    names=["a", "b"],
)
df = ctx.create_dataframe([[batch]])
Is there any way to go from
<http://table.to|table.to>_pyarrow_dataset()
=>
[[batch]]
?
j

Jordan Fox

07/23/2023, 4:59 PM
Have you tried
register_table(name,table)
?
Copy code
from datafusion import SessionContext
from deltalake import DeltaTable

table = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e8_1e2_0_0")
ctx = SessionContext()
ctx.register_table('my_table', table)
Another option would be register_dataset() where you could pass table.to_pyarrow_dataset()
This should also work:
Copy code
from datafusion import SessionContext
from deltalake import DeltaTable

table = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e8_1e2_0_0")
ctx = SessionContext()
ctx.create_dataframe(table.to_pyarrow_table().to_batches())
Copy code
from datafusion import SessionContext
from deltalake import DeltaTable

table = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e8_1e2_0_0")
ctx = SessionContext()
ctx.create_dataframe(table.to_pyarrow_dataset().to_batches())
m

Matthew Powers

07/23/2023, 7:04 PM
Thanks @Jordan Fox. This worked:
Copy code
ctx.register_dataset("my_dataset", table.to_pyarrow_dataset())
ctx.sql("select * from my_dataset where v2 > 5")
This didn’t work:
ctx.register_table("my_table", <http://table.to|table.to>_pyarrow_table())
This didn’t work either:
ctx.create_dataframe(<http://table.to|table.to>_pyarrow_table().to_batches())
Here’s the notebook if you’d like to take a look: https://github.com/delta-io/delta-examples/blob/master/notebooks/python-deltalake/datafusion-read-delta.ipynb But one solution is working, so this is great!
j

Jordan Fox

07/23/2023, 7:12 PM
Interesting that to_pyarrow_table().to_batches() doesn't work. It's how I started working with delta/duckdb/datafusion together. https://duckdb.org/docs/guides/python/datafusion.html
1
Just realized what I wrote. Its:
[table.to_pyarrow_table().to_batches()]
Am mobile, I'll confirm on your notebook.
m

Matthew Powers

07/24/2023, 12:50 PM
@Jordan Fox - this works:
[<http://table.to|table.to>_pyarrow_table().to_batches()]
. Thank you!!
🙏 1
2 Views