https://delta.io logo
g

Grainne B

04/06/2023, 4:00 AM
Hi all, I'm having an issue registering a delta table to the aws glue catalog I am using aws' glue docker image amazon/aws-glue-libs:glue_libs_4.0.0_image_01 When I try to create a delta table, I get this error
Copy code
23/04/06 01:55:48 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `spark_docker_testing`.`aws_glue_docker` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/glue_user/spark/python/pyspark/sql/session.py", line 1034, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self)
  File "/home/glue_user/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: <s3://grainne-data-testingaws_glue_docker-__PLACEHOLDER__>
>>> spark.sql("create table spark_docker_testing.aws_glue_docker using delta location '<s3a://grainne-data-testing/workforce_good>'")
23/04/06 01:55:59 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `spark_docker_testing`.`aws_glue_docker` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/glue_user/spark/python/pyspark/sql/session.py", line 1034, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self)
  File "/home/glue_user/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: <s3://data-testingaws_glue_docker-__PLACEHOLDER__>
Even when I use EMR (which works) I notice that when writing a delta file to s3 there is always a PLACEHOLDER appended to the s3 path. I can't find any relevant documentation for this online ? Is there a way to possibly create the delta table pointing to the generated manifest path ? Code to replicate
Copy code
# Run docker image 
docker run -it -v ~/.aws:/home/glue_user/.aws -e AWS_PROFILE=saml -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_4.0.0_image_01

# Start pyspark shell 
pyspark --packages org.apache.hadoop:hadoop-aws:3.2.2,io.delta:delta-core_2.12:1.2.1 \

# Create table 
spark.sql("create table spark_docker_testing.aws_glue_docker using delta location '<s3://data-testing/test_data>'")
76 Views