https://delta.io logo
s

Sekhar Sahu

02/08/2023, 9:58 PM
Hi Folks.. I am trying out delta lake on AWS EMR 6.9.0 release label following the documentation listed here. When I try to run the following, I get this error.. Can anyone please help debug this?
Copy code
23/02/08 21:53:36 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `default`.`delta_table` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 1034, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self)
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.IllegalArgumentException: Can not create a Path from an empty string
Code:
Copy code
## Create a DataFrame
data =  spark.createDataFrame([("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101",  "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103",  "2015-01-01",  "2015-01-01T13:51:40.519832Z")],
["id", "creation_date",  "last_update_time"])

## Write a DataFrame as a Delta Lake dataset to the S3  location
spark.sql("""CREATE  TABLE IF NOT EXISTS delta_table (id string, creation_date string, 
last_update_time string)
USING delta location
'<s3://DOC-EXAMPLE-BUCKET/example-prefix/db/delta_table>'""");

data.writeTo("delta_table").append()
pyspark command used
Copy code
pyspark --master yarn --deploy-mode client --repositories <http://repo.hortonworks.com/content/groups/public/,https://repos.spark-packages.org/,https://oss.sonatype.org/content/repositories/snapshots>  --conf spark.sql.adaptive.coalescePartitions.initialPartitionNum=5000 --conf spark.databricks.delta.optimize.maxFileSize=250000 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.executor.extraJavaOptions=-XX:+UseG1GC --conf spark.driver.maxResultSize=0      --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.databricks.delta.optimize.repartition.enabled=true   --conf spark.databricks.delta.autoOptimize=true  --packages io.delta:delta-core_2.12:2.1.0
g

Grainne B

04/06/2023, 3:54 AM
Hi Sekhar, did you end up resolving this ? I am facing a similar issue, using the aws glue lib docker image
amazon/aws-glue-libs:glue_libs_4.0.0_image_01
Run container
Copy code
docker run -it -v ~/.aws:/home/glue_user/.aws -e AWS_PROFILE=saml -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_4.0.0_image_01
Create pyspark shell
Copy code
pyspark --packages org.apache.hadoop:hadoop-aws:3.2.2,io.delta:delta-core_2.12:1.2.1
Code used
Copy code
spark.sql("create table spark_docker_testing.aws_glue_docker using delta location '<s3://data-testing/test_file>'")
Error received
Copy code
23/04/06 03:41:34 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `spark_docker_testing`.`aws_glue_docker_delta` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/glue_user/spark/python/pyspark/sql/session.py", line 1034, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self)
  File "/home/glue_user/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: <s3://data-testingaws_glue_docker_delta-__PLACEHOLDER__>
9 Views