Add support for custom format to pyspark using python
I want to read data from an on-prem Oracle schema (from multiple joined tables) as a Spark stream. E.g.
How to create a custom streaming data source?
Question is: Is it possible to write this in Python? Or does it have to be Java? Is Scala an option too?
As I understand I need to create a new implementation of
DataSourceV2 that will read from Oracle and store check point information (so I can invoke it at a flexible schedule and it'll know where to resume stream from) etc. and make my pyspark code, which uses the new format, look clean like:
streaming_oracle_df = spark.readStream \
.format("custom_oracle") \ # <- my custom format
.option("oracle_jdbc_str", "jdbc:...") \
.option("custom_option1", 123) \
.load()
streaming_oracle_df.writeStream \
.trigger(availableNow=True) \
.format('delta') \
.option('checkpointLocation', '<s3://bucket/checkpoint/dim_customer>') \
.start('<s3://bucket/tables/dim_customer>')