https://delta.io logo
a

Amit Singh Hora

06/07/2023, 5:17 AM
Hi All, I am new here .. my question will sound basic till I ramp up. Looking for support from you all . I am trying to use Spark to write Delta table in S3 and store the meta information in Hive metastore, which can later be queried by Trino . This is how I am starting the pyspark shell -
Copy code
./pyspark --packages io.delta:delta-core_2.12:2.4.0 \
 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
 --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \
--conf "fs.s3a.aws.credentials.provide=com.amazonaws.auth.DefaultAWSCredentialsProviderChain" \
--conf "spark.sql.hive.metastore.version=3.1.3" \
--conf "spark.sql.hive.metastore.jars=maven"
This is my spark configuration -
Copy code
spark = SparkSession.builder \
  .appName("DeltaTableExample") \
  .master("local[*]") \
  .config("spark.hadoop.javax.jdo.option.ConnectionURL", "jdbc:<postgresql://localhost:5435/hive_metastore2>") \
  .config("spark.hadoop.javax.jdo.option.ConnectionDriverName", "org.postgresql.Driver") \
  .config("spark.hadoop.javax.jdo.option.ConnectionUserName", "username") \
  .config("spark.hadoop.javax.jdo.option.ConnectionPassword", "password") \
  .config("spark.sql.warehouse.dir","<s3a://location/hivewarehouse>") \
  .config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension") \
   .config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog") \
   .enableHiveSupport() \
  .getOrCreate()
I am able to write the delta table to S3, but the moment I try to create the table with spark sql to have it’s entry available in Hive metastore
Copy code
delta_table_path = "<s3a://location/deltatable>"
# Register Delta table in Hive Metastore
spark.sql(f"CREATE TABLE IF NOT EXISTS my_table USING DELTA LOCATION '{delta_table_path}'")
i start getting these errors - I don’t understand from where it is picking this /user/hive/warehouse/my_table , hive warehouse location
Copy code
23/06/07 00:47:57 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `spark_catalog`.`default`.`my_table` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
00:48:01.034 [Thread-3] ERROR org.apache.hadoop.hive.metastore.RetryingHMSHandler - MetaException(message:file:/user/hive/warehouse/my_table-__PLACEHOLDER__ is not a directory or unable to create one)
Please note I am running hive metastore standalone and hive meta init worked without any erros.
m

Matthew Powers

06/07/2023, 8:57 AM
It’s easier to install Delta with PyPi or Docker. Here are a bunch of example notebooks that I run using conda environments: https://github.com/delta-io/delta-examples