Grainne B
04/06/2023, 4:00 AM23/04/06 01:55:48 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `spark_docker_testing`.`aws_glue_docker` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/glue_user/spark/python/pyspark/sql/session.py", line 1034, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self)
File "/home/glue_user/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 196, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: <s3://grainne-data-testingaws_glue_docker-__PLACEHOLDER__>
>>> spark.sql("create table spark_docker_testing.aws_glue_docker using delta location '<s3a://grainne-data-testing/workforce_good>'")
23/04/06 01:55:59 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `spark_docker_testing`.`aws_glue_docker` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/glue_user/spark/python/pyspark/sql/session.py", line 1034, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self)
File "/home/glue_user/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 196, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: <s3://data-testingaws_glue_docker-__PLACEHOLDER__>
Even when I use EMR (which works) I notice that when writing a delta file to s3 there is always a PLACEHOLDER appended to the s3 path. I can't find any relevant documentation for this online ?
Is there a way to possibly create the delta table pointing to the generated manifest path ?
Code to replicate
# Run docker image
docker run -it -v ~/.aws:/home/glue_user/.aws -e AWS_PROFILE=saml -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_4.0.0_image_01
# Start pyspark shell
pyspark --packages org.apache.hadoop:hadoop-aws:3.2.2,io.delta:delta-core_2.12:1.2.1 \
# Create table
spark.sql("create table spark_docker_testing.aws_glue_docker using delta location '<s3://data-testing/test_data>'")