Georg Heiler
02/09/2023, 6:01 PMRoberto
02/09/2023, 6:52 PMspark.databricks.delta.schema.autoMerge.enabled=true
spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite
sabari dass
02/09/2023, 7:15 PMvijay
02/09/2023, 10:15 PMrtyler
02/09/2023, 10:24 PMSK
02/09/2023, 10:49 PMMatt Fysh
02/10/2023, 3:20 AMMatt Fysh
02/10/2023, 3:21 AMMatt Fysh
02/10/2023, 3:22 AMMatt Fysh
02/10/2023, 3:31 AMDanyil Butkovskyi
02/10/2023, 3:55 AMSelva Nachimuthu
02/10/2023, 4:01 AMgopinath s
02/10/2023, 6:09 AMChanukya Pekala
02/10/2023, 11:05 AMLucas Zago
02/10/2023, 12:33 PMChristian Pfarr
02/10/2023, 6:34 PMMatt Fysh
02/11/2023, 3:23 AMTram
02/11/2023, 6:35 PMSteve Quan
02/13/2023, 4:07 AMGrainne B
02/13/2023, 11:15 PMStefano Lori
02/14/2023, 7:31 AMgroupId = org.apache.spark
artifactId = spark-sql-kafka-0-10_2.12
version = 3.3.1
The application runs in local without problems, but when distributed over a K8s cluster, it breaks with the exception
Caused by: java.lang.ClassNotFoundException: delta.DefaultSource
Anybody experienced something like this? thx
Logs here:
2023-02-13 16:40:37.990Z INFO org.apache.spark.storage.BlockManagerMaster:61 - BlockManagerMaster stopped
2023-02-13 16:40:37.996Z INFO org.apache.spark.scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:61 - OutputCommitCoordinator stopped!
Exception in thread "main" java.lang.ClassNotFoundException:
Failed to find data source: delta. Please find packages at
<https://spark.apache.org/third-party-projects.html>
at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:587)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:675)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725)
at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:864)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
at ....
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at com.izicap.delta.loader.DeltaLoaderTableLauncher$.run(DeltaLoaderTableLauncher.scala:155)
at com.izicap.delta.loader.DeltaLoaderTableLauncher$.main(DeltaLoaderTableLauncher.scala:178)
at com.izicap.delta.loader.DeltaLoaderTableLauncher.main(DeltaLoaderTableLauncher.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: delta.DefaultSource
at java.base/java.net.URLClassLoader.findClass(Unknown Source)
at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:661)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:661)
at scala.util.Failure.orElse(Try.scala:224)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:661)
... 32 more
Shira Bodenstein
02/14/2023, 12:40 PMDeltaTable.forPath(sparkSession, hdfsPath)
.alias("oldData")
.merge(data.alias("newData"), deltaMergeQuery)
.whenMatched()
.updateAll()
.whenNotMatched()
.insertAll()
.execute();
The deletion will be in a similar way.
Now to the question:
Since there are no partitions on the fields on the query, does it mean that Spark will have
to open and read all files?
Thanks in advance!Kevin Bromber
02/14/2023, 4:52 PMLucas Zago
02/14/2023, 6:37 PMrtyler
02/14/2023, 7:09 PMTram
02/15/2023, 2:04 AMERevanthSai
02/15/2023, 6:01 AMBas Beelen
02/15/2023, 9:40 AMchange-data-feed
is enabled. And it has been working fine.
Today though, for one of the tables we get a weird error message when we try to use table_changes("the_table", <commit>)
, namely:
No file found in the directory: <gs://bucket/raw/.../_delta_log>
1. The _delta_log
is at the location
2. I can query the table itself select * from the_table
3. I can load previous versions select * from the_table@v100
4. But this fails: select * from table_changes("the_table", 100)
, with the above mentioned error
Nothing in our logic has changed for the specific table where this is happening for. Other tables that run on the same logic of ingestion etc. don't have this specific issue.
Anyone an idea what could be going on?
We use Databricks on GCPYatharth Maheshwari
02/15/2023, 10:26 AMGerhard Brueckl
02/15/2023, 11:50 AMVACUUM
deletes orphand and outdated files, has anyone ever had a case where
• a long UPDATE/MERGE/DELETE operation was running that creates data files (parquet)
• VACUUM
is run at about the same time and cleans up those temporary orphaned files. Orphaned in a way that the log file has not yet been written as the concurrent UPDATE/MERGE/DELETE operation was still running?