John Darrington
06/06/2023, 8:27 PMAmit Singh Hora
06/07/2023, 5:17 AM./pyspark --packages io.delta:delta-core_2.12:2.4.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \
--conf "fs.s3a.aws.credentials.provide=com.amazonaws.auth.DefaultAWSCredentialsProviderChain" \
--conf "spark.sql.hive.metastore.version=3.1.3" \
--conf "spark.sql.hive.metastore.jars=maven"
This is my spark configuration -
spark = SparkSession.builder \
.appName("DeltaTableExample") \
.master("local[*]") \
.config("spark.hadoop.javax.jdo.option.ConnectionURL", "jdbc:<postgresql://localhost:5435/hive_metastore2>") \
.config("spark.hadoop.javax.jdo.option.ConnectionDriverName", "org.postgresql.Driver") \
.config("spark.hadoop.javax.jdo.option.ConnectionUserName", "username") \
.config("spark.hadoop.javax.jdo.option.ConnectionPassword", "password") \
.config("spark.sql.warehouse.dir","<s3a://location/hivewarehouse>") \
.config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.enableHiveSupport() \
.getOrCreate()
I am able to write the delta table to S3, but the moment I try to create the table with spark sql to have it’s entry available in Hive metastore
delta_table_path = "<s3a://location/deltatable>"
# Register Delta table in Hive Metastore
spark.sql(f"CREATE TABLE IF NOT EXISTS my_table USING DELTA LOCATION '{delta_table_path}'")
i start getting these errors - I don’t understand from where it is picking this /user/hive/warehouse/my_table , hive warehouse location
23/06/07 00:47:57 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `spark_catalog`.`default`.`my_table` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
00:48:01.034 [Thread-3] ERROR org.apache.hadoop.hive.metastore.RetryingHMSHandler - MetaException(message:file:/user/hive/warehouse/my_table-__PLACEHOLDER__ is not a directory or unable to create one)
Please note I am running hive metastore standalone and hive meta init worked without any erros.Rahul Sharma
06/07/2023, 7:42 AMRahul Sharma
06/07/2023, 11:02 AMYuya Ebihara
06/07/2023, 11:45 AM'delta.minWriterVersion'='6'
table property on Spark 3.4.0 & Delta 2.4.0. Is it a known limitation? Spark 3.3.2 & Delta 2.3.0 had no such failure.
CREATE TABLE a.b (col int) USING DELTA LOCATION '<s3://xxx/xxx>' TBLPROPERTIES ('delta.minWriterVersion'='6')
-> delta.minWriterVersion needs to be one of 1, 2, 3, 4, 5, 7.
Dipak Krishnan
06/07/2023, 7:51 PMShanmukhSaavinay Gajula
06/08/2023, 11:58 AMExtraction-Load Job Process
1)Read the table data from postgres server in incremental fashion
2)Write data in S3 in parquet format
3)Write the data in AWS-S3 in DeltaFormat
4)Run the Merge to handle upserts
For instance
If a job is taking 3 min to finish
Step 3 & 4 take close to 1.6-1.7 mins to finish the job
Please find screenshot for more informationMarius Grama
06/08/2023, 1:43 PMPatrik Ekman
06/08/2023, 3:19 PMBenny Elgazar
06/08/2023, 5:40 PMVishal Kadam
06/09/2023, 4:41 AMYatharth Maheshwari
06/09/2023, 5:43 AM|-- someColumn: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- value1: long (nullable = true)
| | |-- value2: string (nullable = true)
Michael Angerman
06/11/2023, 2:28 PMCompiling deltalake v0.13.0 (/Users/ma/j/tmp09/delta-rs/rust)
error[E0700]: hidden type for `impl futures::Future<Output = Result<Vec<action::Add>, errors::DeltaTableError>>` captures lifetime that does not appear in bounds
--> rust/src/operations/write.rs:231:28
|
231 | ) -> DeltaResult<Vec<Add>> {
| ____________________________^
232 | | let invariants = snapshot
233 | | .current_metadata()
234 | | .and_then(|meta| meta.schema.get_invariants().ok())
... |
283 | | .collect::<Vec<_>>())
284 | | }
| |_^
|
= note: hidden type `[async fn body@rust/src/operations/write.rs:231:28: 284:2]` captures lifetime '_#21r
Michael Angerman
06/11/2023, 2:30 PMUpdating <http://crates.io|crates.io> index
error: failed to select a version for `chrono`.
... required by package `arrow-array v39.0.0`
... which satisfies dependency `arrow-array = "^39.0.0"` of package `arrow v39.0.0`
... which satisfies dependency `arrow = "^39"` of package `deltalake v0.13.0 (/Users/ma/j/tmp09/delta-rs/rust)`
... which satisfies path dependency `deltalake` of package `delta-inspect v0.8.0 (/Users/ma/j/tmp09/delta-rs/delta-inspect)`
versions that meet the requirements `^0.4.24` are: 0.4.26, 0.4.25, 0.4.24
all possible versions conflict with previously selected packages.
previously selected package `chrono v0.4.23`
... which satisfies dependency `chrono = "^0.4.23"` (locked to 0.4.23) of package `arrow-arith v39.0.0`
... which satisfies dependency `arrow-arith = "^39.0.0"` of package `arrow v39.0.0`
... which satisfies dependency `arrow = "^39"` of package `deltalake v0.13.0 (/Users/ma/j/tmp09/delta-rs/rust)`
... which satisfies path dependency `deltalake` of package `delta-inspect v0.8.0 (/Users/ma/j/tmp09/delta-rs/delta-inspect)`
failed to select a version for `chrono` which could resolve this conflict
keshav
06/12/2023, 5:56 AM23/06/12 04:57:04 WARN DeltaHistoryManager: Found Delta commit 12687 with a timestamp 1686481866687 which is greater than the next commit timestamp 1686306634000.
23/06/12 04:57:04 WARN DeltaHistoryManager: Found Delta commit 12688 with a timestamp 1686481866688 which is greater than the next commit timestamp 1686306757000.
23/06/12 04:57:04 WARN DeltaHistoryManager: Found Delta commit 12689 with a timestamp 1686481866689 which is greater than the next commit timestamp 1686306881000.
1686481866689 - GMT: Sunday, 11 June 2023 111106.689
1686306881000 - GMT: Friday, 9 June 2023 103441
any suggestions how i can resolve these warnings.?Jiri Vinarek
06/12/2023, 2:01 PMIn another streaming query, you can continuously read deduplicated data from this Delta table. This is possible because an insert-only merge only appends new data to the Delta table.The target table will not be a streaming table, as I understand. I can get around this by enabling change data feed and then consuming INSERT operations from the feed. Is this the proposed way? Or is there another way how to get deduplicated streaming table?
Ram
06/13/2023, 7:36 AMdeltaTable = ... # DeltaTable with schema (key, value)
changesDF = spark.table("changes")
latestChangeForEachKey = changesDF \
.selectExpr("key", "struct(time, newValue, deleted) as otherCols") \
.groupBy("key") \
.agg(max("otherCols").alias("latest")) \
.select("key", "latest.*") \
deltaTable.alias("t").merge(
latestChangeForEachKey.alias("s"),
"s.key = t.key") \
.whenMatchedDelete(condition = "s.deleted = true").execute()
vs
deltaTable.alias("t").merge(
latestChangeForEachKey.alias("s"),
"s.key = t.key") \
.whenMatchedDelete(None).execute()
Alessandro Biagi
06/13/2023, 9:31 AMAmool Gupta
06/13/2023, 5:07 PMLi Sun
06/13/2023, 9:27 PMPaweł Mitruś
06/14/2023, 7:07 AMYi Wang
06/15/2023, 8:51 AMBrayan Jules
06/15/2023, 5:17 PMJohn Darrington
06/15/2023, 6:54 PMbharat chaudhury
06/16/2023, 9:49 AM/home/bchaudhu/python3_9_env/bin/python3.9 -m pip install delta-spark==2.3.0
Collecting delta-spark==2.3.0
Using cached delta_spark-2.3.0-py3-none-any.whl (20 kB)
Collecting pyspark<3.4.0,>=3.3.0 (from delta-spark==2.3.0)
Using cached pyspark-3.3.2.tar.gz (281.4 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... error
error: subprocess-exited-with-error
à Preparing metadata (pyproject.toml) did not run successfully.
â exit code: 1
â°â> [80 lines of output]
/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/dist.py:755: SetuptoolsDeprecationWarning: Invalid dash-separated options
!!
********************************************************************************
Usage of dash-separated 'description-file' will not be supported in future
versions. Please use the underscore name 'description_file' instead.
By 2023-Sep-26, you need to update your project and remove deprecated calls
or your builds will no longer be supported.
See <https://setuptools.pypa.io/en/latest/userguide/declarative_config.html> for details.
********************************************************************************
!!
opt = self.warn_dash_deprecation(opt, section)
running dist_info
creating /tmp/pip-modern-metadata-axjhbnal/pyspark.egg-info
writing /tmp/pip-modern-metadata-axjhbnal/pyspark.egg-info/PKG-INFO
writing dependency_links to /tmp/pip-modern-metadata-axjhbnal/pyspark.egg-info/dependency_links.txt
writing requirements to /tmp/pip-modern-metadata-axjhbnal/pyspark.egg-info/requires.txt
writing top-level names to /tmp/pip-modern-metadata-axjhbnal/pyspark.egg-info/top_level.txt
writing manifest file '/tmp/pip-modern-metadata-axjhbnal/pyspark.egg-info/SOURCES.txt'
reading manifest file '/tmp/pip-modern-metadata-axjhbnal/pyspark.egg-info/SOURCES.txt'
reading manifest template '<http://MANIFEST.in|MANIFEST.in>'
warning: no previously-included files matching '*.py[cod]' found anywhere in distribution
warning: no previously-included files matching '__pycache__' found anywhere in distribution
warning: no previously-included files matching '.DS_Store' found anywhere in distribution
writing manifest file '/tmp/pip-modern-metadata-axjhbnal/pyspark.egg-info/SOURCES.txt'
creating '/tmp/pip-modern-metadata-axjhbnal/pyspark-3.3.2.dist-info'
Traceback (most recent call last):
File "/home/bchaudhu/python3_9_env/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/home/bchaudhu/python3_9_env/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/home/bchaudhu/python3_9_env/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 149, in prepare_metadata_for_build_wheel
return hook(metadata_directory, config_settings)
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 380, in prepare_metadata_for_build_wheel
self.run_setup()
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 487, in run_setup
super(_BuildMetaLegacyBackend,
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 338, in run_setup
exec(code, locals())
File "<string>", line 199, in <module>
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/__init__.py", line 107, in setup
return distutils.core.setup(**attrs)
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/dist.py", line 1244, in run_command
super().run_command(command)
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/command/dist_info.py", line 104, in run
bdist_wheel = self.get_finalized_command('bdist_wheel')
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/_distutils/cmd.py", line 304, in get_finalized_command
cmd_obj = self.distribution.get_command_obj(command, create)
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 860, in get_command_obj
klass = self.get_command_class(command)
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/dist.py", line 989, in get_command_class
self.cmdclass[command] = cmdclass = ep.load()
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/setuptools/_vendor/importlib_metadata/__init__.py", line 208, in load
module = import_module(match.group('module'))
File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/wheel/bdist_wheel.py", line 28, in <module>
from .macosx_libfile import calculate_macosx_platform_tag
File "/tmp/pip-build-env-d4kbutzi/overlay/lib/python3.9/site-packages/wheel/macosx_libfile.py", line 43, in <module>
import ctypes
File "/usr/local/lib/python3.9/ctypes/__init__.py", line 8, in <module>
from _ctypes import Union, Structure, Array
ModuleNotFoundError: No module named '_ctypes'
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
à Encountered error while generating package metadata.
â°â> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Rodrigo Witzel
06/16/2023, 10:04 AMAlejandra H Segura
06/17/2023, 7:12 AMNaveen Kumar Vadlamudi
06/17/2023, 6:03 PMNaveen Kumar Vadlamudi
06/18/2023, 4:19 PMJared Grove
06/18/2023, 9:03 PMspark-submit
on my local host I have no errors. However, I wanted to test this spark application in docker, which I guess is still technically my local host, but I want to submit the program from the docker container. I have four containers. spark-master, two spark-workers, spark-history-server, and a spark-driver. All containers are on the same docker network. Inside the spark-driver container is where I launch spark-submit --properties-file ./src/spark/spark-defaults.conf ./src/start_pipeline.py
I receive the following error An error occurred while calling o515.load.
: java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 4 times, most recent failure: Lost task 0.3 in stage 11.0 (TID 156) (172.23.0.3 executor 3): org.apache.spark.SparkFileNotFoundException: File file:/opt/ufo-lakehouse/lakehouse/ufo/bronze/_delta_log/00000000000000000000.json does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
This file does exist, the spark program created it! I first thought it may be permission issues so I set all my folders/files with permission 777 but still have the same error. Any help or guidance would be much appreciated. Thank you!