Alessandro Mangone
01/25/2023, 4:00 PMrow_number()
window function.
When using the row_number() > 1
condition, I get the error that windows and aggregations are not supported by DeltaTable’s delete.
I tried rewriting that as a subquery, but I get the error that subqueries are not supported by delete.
Is collecting the unwanted values and use Column.isin(unwantedValues)
my only option? Are there any alternatives I might consider?Anil Puliyeril
01/26/2023, 12:00 PMMartin
01/26/2023, 1:55 PMALTER TABLE x SET LOCATION
statement (https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html#set-location-and-set-file-format). But if I run this after I changed the physical location, it fails since at this point the table in Hive metastore is already corrupt since there is no Delta Table to be found any more are the stored location.
I can work around this issue by creating a dummy delta table (without registering it in Hive metastore) at the original location after I moved the Delta Lake table folder to its destination, run the the ALTER TABLE x SET LOCATION
statement and then simply delete the dummy delta table at the original location.
Another way for moving unmanaged Tables would be to move Delta Lake table folder using OS, run DROP TABLE x
and then CREATE TABLE x USING DELTA LOCATION
(https://docs.delta.io/latest/delta-batch.html#control-data-location). Downside of this approach is that we lose table properties and metadata.
Both approaches feel very clumsy to my. In my point of view, this should be a build-in feature. Am I missing something?Andres Urrego
01/26/2023, 4:30 PMHendry Suwanda
01/27/2023, 2:49 PMdescribe detail
the total of files has been reduced.
but, when I run the readStream job, the job still read the old files (scan more than 5000 files), then I try to run the vacuum(0)
, but the readStream still read the file that has been deleted by vacuum command.
But, when I try to run the select
command, there is no error.
Is there any way to invalidate the delta table? why the readStream still read the file that has been deleted?
note: I use delta version 1.0.0
ThanksAns Fida
01/27/2023, 6:53 PMHaijia Zhou
01/27/2023, 10:01 PMAns Fida
01/27/2023, 10:17 PMSoukaina
01/28/2023, 1:51 PMReginaldo Silva
01/28/2023, 9:56 PMKetki Bukkawar
01/29/2023, 3:58 AMException in thread "main" java.lang.IllegalStateException: Detected incompatible schema change:
old schema: root
-- id: integer (nullable = true) (metadata ={})
-- name: string (nullable = true) (metadata ={})
-- new_col: string (nullable = true) (metadata ={})
new schema: root
-- id: integer (nullable = true) (metadata ={})
-- name: string (nullable = true) (metadata ={})
Thank you in advance!Ketki Bukkawar
01/29/2023, 9:57 AMdelta-standalone
library gets stored in ADLS? Is it possible to query those tables using apache spark pool in SQL?Roberto
01/30/2023, 11:33 AMJan Schutte
01/30/2023, 12:59 PMdt.optimize().where("partition == 'asd'").executeCompaction()
on a partition of about 200GB, but my jobs often run out of memory. I've allocated "spark.executor.memory": "45G"
and "spark.executor.memoryOverhead": "5G"
which I feel should be enough? Any ideas on how to optimize this operation for memory efficiency? I am not in a position to allocate more memory. I'm on Delta OSS 2.0.0 and Spark 3.2.2.Kashyap Bhatt
01/30/2023, 3:22 PM--packages
when running pyspark locally?
E.g. delta-core_2.12:2.2.0
works fine:
pyspark --packages io.delta:delta-core_2.12:2.2.0,org.apache.hadoop:hadoop-aws:3.3.4 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
But delta-core_2.12:2.1.0
throws a NullPointerException
at start, which I assume is due to some version incompatibility:
pyspark --packages io.delta:delta-core_2.12:2.1.0,org.apache.hadoop:hadoop-aws:3.3.4 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
Question is where do I find the compatible versions of delta-core
, io.delta
and hadoop
?
My source was: https://docs.databricks.com/release-notes/runtime/12.1.html ( I use Databricks Runtime), but I do not see delta-core there, so when I change from 2.2.0 to 2.1.0 I don't know what to change 2.12 to?Ans Fida
01/30/2023, 6:57 PMChristian Pfarr
01/30/2023, 7:46 PMspark.sql.parquet.int96AsTimestamp
to false, so i wouldnt have to set this drill setting, but it seems that the delta-table writes the data always as INT96.
Is it possible to configure this via a special delta config or should this setting work and i did something wrong while setting up my spark session?Ian
01/31/2023, 10:26 AMdelta_table_loc = "<s3a://test-bucket/test-table>
storage_options = {"AWS_ACCESS_KEY_ID": "<accesskey>", "AWS_SECRET_ACCESS_KEY": "<secretkey>",
"AWS_ENDPOINT_URL": "<https://0.0.0.0:9001>"}
write_deltalake(table_or_uri=delta_table_loc, data=df, mode='append', schema=SA_data_schema, partition_by=["asset_id"],
storage_options=storage_options)
when not using AWS_DEFAULT_REGION throws region missing but when used it throws “Failed to load checkpoint: Failed to read checkpoint content: Generic S3 error: Error performing get request”Ajex
01/31/2023, 10:39 AMspark.sql("ALTER TABLE delta.`/user/fplay/temp/testraw/raw_logs.delta` CHANGE COLUMN user_id user_id STRING FIRST")
to change the column index of the column i need to perform z-order to 0(the first column).
After all the newest delta log version still have not any statistic about the column i need.
Any help please!!!João Pinto
01/31/2023, 2:47 PMDe Guy
01/31/2023, 2:50 PMSanjeeb Dey
01/31/2023, 7:36 PMSanjeeb Dey
01/31/2023, 7:36 PMNaama Gal-Or
02/01/2023, 7:28 AMNaama Gal-Or
02/01/2023, 7:42 AMorsagiv
02/01/2023, 4:13 PMBen Temple
02/01/2023, 6:31 PMTotal size of serialized results of 462 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB
. I have checked the usual suspects for this issue such as collect statements, really high numbers of partitions and large broadcast joins, none of which occur in the job. Digging further I believe this could be caused by the Delta writer as when looking at the stage that returns this it shows this line com.databricks.sql.transaction.tahoe.commands.WriteIntoDeltaCommand.run(WriteIntoDeltaCommand.scala:70)
. I do not have access to the source code but wondered if anybody had any insight as to why this line could be returning fairly large amounts of data to the driver (I assume metrics and information on file locations for updating the delta log), but it seems that each task is returning a lot of information and for the total number of tasks for this stage (2787, which is equal to the current number of partitions) it would require a much increased result size than currently configured. I would rather work out if this result size can be reduced than keep increasing the Spark configuration and driver size.
Can anybody please give any insights into what information is returned to the driver in the write function mentioned above and if there are any particular types of data that could cause each result set to be so large?Yousry Mohamed
02/01/2023, 9:19 PMSELECT count(*)
not SELECT *
. Anyway, bloom filters did not produce the same file pruning behaviour like column-level stats.
I wonder if anyone has used them effectively before and can explain how to verify they are really making a difference.Reetika Agrawal
02/02/2023, 5:08 AMsabari dass
02/02/2023, 8:04 AM