Chetan Joshi
03/30/2023, 1:10 AMlogRetentionDuration
for our delta table. Could someone advise on fetching this property using Python or Java? Thanks!
We tried the following but “detail” method is not available in 2.0.1 it seem:
DeltaTable deltaTable = DeltaTable.forPath(spark, pathToTable);
DataFrame detailDF = deltaTable.detail();
Martin Beaussart
03/30/2023, 1:41 PMAlber Tadrous
03/30/2023, 2:55 PMspark.read.format("csv").load("s3://<file-path>")
but when I use Pandas it throws an errorAlber Tadrous
03/30/2023, 2:57 PMYan Zhao
03/30/2023, 3:55 PMrtyler
03/30/2023, 4:46 PMLouis Frolio
03/30/2023, 7:35 PMJosué Castañeda Landa
03/31/2023, 6:32 AMJosué Castañeda Landa
03/31/2023, 6:32 AMApple iPhone14,7 Luis Alejandro -vcbc6bf9169a54d92a4ae7783581cf7c0 16.3.1 iOS Direct 2023-03-30 15:39:50.000 <http://hotmail.com|hotmail.com>
Apple iPhone14,7 Luis Alejandro -vcbc6bf9169a54d92a4ae7783581cf7c0 16.3.1 iOS Direct 2023-03-30 15:39:50.000 <http://hotmail.com|hotmail.com>
Josué Castañeda Landa
03/31/2023, 6:33 AMChristina
03/31/2023, 4:56 PMYan Zhao
04/01/2023, 3:50 PMdef checkConflicts(): Unit = {
checkProtocolCompatibility()
checkNoMetadataUpdates()
checkForAddedFilesThatShouldHaveBeenReadByCurrentTxn()
checkForDeletedFilesAgainstCurrentTxnReadFiles()
checkForDeletedFilesAgainstCurrentTxnDeletedFiles()
checkForUpdatedApplicationTransactionIdsThatCurrentTxnDependsOn()
reportMetrics()
}
I want to know how dose checkNoMetadataUpdates()
work.
Here I give a case to describe it.
I have two writers, both of them will to update the metadata. Before update, the snapshot version is 1.
Both of them commit the updateSchema action to the table, and the target metadata structure is same.
The first writer commit 2.json succeed, the second writer commit 2.json failed, then the second writer will do the retry commit, and check the conflicting.
In this case, dose the second writer checkNoMetadataUpdates()
will throw MetadataChangedException
?Iwan Aucamp
04/03/2023, 1:31 PMMatt Dalesio
04/03/2023, 8:24 PMGnanasoundari Soundarajan
04/04/2023, 7:56 AMswest
04/04/2023, 3:28 PMwrite_deltalake()
, it seems like we can only specify schemas with pyarrow, which unfortunately doesn't preserve the metadata dictionary in the conversion to a deltalake.schema
.Robert Thompson
04/04/2023, 4:01 PMNermin Yehia
04/04/2023, 10:46 PMorg.apache.spark.sql.AnalysisException: No query found for dataset 'dlt_table_123' in class 'com.databricks.pipelines.execution.core.languages.PythonPipeline'
Starkadur Hrobjartsson
04/05/2023, 11:47 AMCREATE TABLE IF NOT EXISTS test_schema.test_table
(
timestamp TIMESTAMP,
year INT GENERATED ALWAYS AS (YEAR(timestamp))
)
USING DELTA LOCATION '/tmp/delta-lake/delta-table'
TBLPROPERTIES(delta.minWriterVersion = '4')
I get the following error: Syntax error at or near 'GENERATED'
The same table I can create in Databricks with the appropriate location, catalog and schema.
Is this feature exclusive to Databricks or am I missing something?junhui huang
04/06/2023, 2:54 AMGrainne B
04/06/2023, 4:00 AM23/04/06 01:55:48 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `spark_docker_testing`.`aws_glue_docker` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/glue_user/spark/python/pyspark/sql/session.py", line 1034, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self)
File "/home/glue_user/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 196, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: <s3://grainne-data-testingaws_glue_docker-__PLACEHOLDER__>
>>> spark.sql("create table spark_docker_testing.aws_glue_docker using delta location '<s3a://grainne-data-testing/workforce_good>'")
23/04/06 01:55:59 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `spark_docker_testing`.`aws_glue_docker` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/glue_user/spark/python/pyspark/sql/session.py", line 1034, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self)
File "/home/glue_user/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 196, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: <s3://data-testingaws_glue_docker-__PLACEHOLDER__>
Even when I use EMR (which works) I notice that when writing a delta file to s3 there is always a PLACEHOLDER appended to the s3 path. I can't find any relevant documentation for this online ?
Is there a way to possibly create the delta table pointing to the generated manifest path ?
Code to replicate
# Run docker image
docker run -it -v ~/.aws:/home/glue_user/.aws -e AWS_PROFILE=saml -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_4.0.0_image_01
# Start pyspark shell
pyspark --packages org.apache.hadoop:hadoop-aws:3.2.2,io.delta:delta-core_2.12:1.2.1 \
# Create table
spark.sql("create table spark_docker_testing.aws_glue_docker using delta location '<s3://data-testing/test_data>'")
Nagendra Darla
04/06/2023, 5:44 AMSparkSession spark = SparkSession.builder()
.config("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("fs.AbstractFileSystem.s3.impl", "org.apache.hadoop.fs.s3a.S3A")
.config("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.delta.logStore.s3.impl", "io.delta.storage.S3DynamoDBLogStore")
.config("spark.io.delta.storage.S3DynamoDBLogStore.ddb.tableName", "delta_log")
.config("spark.io.delta.storage.S3DynamoDBLogStore.ddb.region", "us-east-1")
.config("spark.io.delta.storage.S3DynamoDBLogStore.credentials.provider",
"com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
.getOrCreate();
Ian
04/06/2023, 1:13 PM{
"message": "Error while creating the delta table=<s3a://test/15f2a198-6302-410f-afc7-6f6aa5dd0f27/description_field_table/> Error=Failed to load checkpoint: Failed to read checkpoint content: Generic S3 error: response error \"<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\"\"<http://www.w3.org/TR/html4/strict.dtd>\">\r\n<HTML><HEAD><TITLE>Length Required</TITLE>\r\n<META HTTP-EQUIV=\"Content-Type\" Content=\"text/html; charset=us-ascii\"></HEAD>\r\n<BODY><h2>Length Required</h2>\r\n<hr><p>HTTP Error 411. The request must be chunked or have a content length.</p>\r\n</BODY></HTML>\r\n\", after 0 retries: HTTP status client error (411 Length Required) for url (<http://169.254.169.254/latest/api/token>)."
}
Satyam Singh
04/06/2023, 4:27 PMLucas Zago
04/06/2023, 5:27 PMsabari dass
04/06/2023, 8:16 PMKenny Ma
04/07/2023, 12:00 AMjunhui huang
04/07/2023, 1:20 AMIan
04/07/2023, 3:04 AMJalil Alchy
04/07/2023, 8:45 PM