Al Nick Ortiz
05/23/2023, 1:29 PMAl Nick Ortiz
05/23/2023, 1:30 PMAl Nick Ortiz
05/23/2023, 1:31 PMliab
05/23/2023, 5:27 PMGapyNi
05/23/2023, 5:45 PMreadStream.option("startingVersion", lastMaxVersionFromAppend)
, where we do foreEachBatch
in the next step to merge them into silver layer.
Now the question is as i looked into the metrics it seems that the process read all the numAddedFiles
but the numOutputRows
(from bronze layer) does not exactly match with the numSourceRows
from silver layer, when performing forEachBatch
merge commando. Do you know what could be the reason?
Thanks and regards,
GapyNiDaniel Bariudin
05/24/2023, 8:17 AMDeltaTable.isDeltaTable(spark, "/path/to/table")
For given root path: "/path/to/table"
if the sub path "/path/to"
has deltalog directory, the function return True.
The Documentation the provided is:
"classmethod *`isDeltaTable`*(sparkSession: pyspark.sql.session.SparkSession, identifier: str) → bool
Check if the provided identifier string, in this case a file path, is the root of a Delta table using the given SparkSession."
_(_https://docs.delta.io/latest/api/python/index.html)
So the question is : if the root the I provided path don't contains Delta table ,
but the sub-root path has delta table and 'isDeltaTable' returns True - is this not a bug? or am I missing something?Simon Thelin
05/24/2023, 9:30 AMSoukaina
05/24/2023, 11:10 AMERIC HAMMEL
05/24/2023, 12:41 PMAinesh Pandey
05/24/2023, 4:12 PMgeometry
column) as a Delta table in Databricks?Suraj Malthumkar
05/25/2023, 7:22 AMSystem.out.println("Delta Read");
spark.read().format("delta")
.load("<s3a://delta-laketest/spark_table>")
Also when am running this(above code) it Does not read the parquet files. As the logs and checkpoint files points to the parquet files registered.
How do i read it using delta connector ?
2nd way:
Reading Parquet data (single-JVM)
DeltaLog log = DeltaLog.forTable(conf, "<s3a://delta-laketest/my_table>");
CloseableIterator<RowRecord> rowItr = log.snapshot().open(); //update().open();
Code above was able to Read parquet data files after committed to delta table via standalone library.
Is this correct way to Read parquet data files using delta connector when the data files committed to delta table via standalone library ?
I would like to understand whats the difference in both ways?
I am pretty new to the delta, please guide me through this. Thank you for your help! :)bharat chaudhury
05/25/2023, 8:17 AMpip3 install --user
instead.
Collecting delta-spark==2.3.0
Cache entry deserialization failed, entry ignored
Using cached https://files.pythonhosted.org/packages/34/9e/c06f3b701de4746defc240fe7a2cc973f7bbfaa8fa17d57e045868c16925/delta_spark-2.3.0-py3-none-any.whl
Collecting pyspark<3.4.0,>=3.3.0 (from delta-spark==2.3.0)
Cache entry deserialization failed, entry ignored
Could not find a version that satisfies the requirement pyspark<3.4.0,>=3.3.0 (from delta-spark==2.3.0) (from versions: 2.1.2, 2.1.3, 2.2.0.post0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 2.4.8, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.1, 3.1.2, 3.1.3, 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.4)
No matching distribution found for pyspark<3.4.0,>=3.3.0 (from delta-spark==2.3.0)
According to the docs .Alber Tadrous
05/25/2023, 4:17 PMAlbert Wong
05/25/2023, 6:46 PMVigneshraja Palaniraj
05/25/2023, 8:56 PMKashyap Bhatt
05/25/2023, 9:03 PMDockerfile
?
Our usecase is we're running some e2e tests (written using pytest
) in a Jenkins pipeline, that create some delta tables and perform merges etc. The Jenkins agent is docker image with alpine
as base (full Dockerfile attached). Problem is that snappy doesn't like the environment and we get following error (more complete stack trace attached):
py4j.protocol.Py4JJavaError: An error occurred while calling o162.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 13) (k-ci-fgao-2ffxcdata-10-2franking-report-by-session-type-1-wx95c executor driver): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:642)
Caused by: java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:275)
at org.xerial.snappy.Snappy.compress(Snappy.java:156)
at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:78)
... 19 more
I've tried many workarounds (to make snappy use some other writable-tmp-folder), but none work. So I'm hoping to create a docker image where I can write a simple python code that creates a delta table and writes to it.
Thank you!Suraj Malthumkar
05/26/2023, 1:52 AMAdishesh Kishore
05/26/2023, 6:12 AMNo delta log found for the Delta table at
Is there some way for me to specify the minreader and minwriter versions?Bryce Bartmann
05/26/2023, 6:47 AMRoshan Punnoose
05/26/2023, 10:56 AMSagar Singh Rawal(TECH-BLR)
05/26/2023, 2:23 PMRosmery Valle Ortiz
05/26/2023, 2:37 PMAfonso de Paula Feliciano
05/26/2023, 6:30 PMDelta doesn't accept NullTypes in the schema for streaming writes.
Is there a way to write delta using null type?
I did some searches but I didn't find anything until the momentTuan Nguyen
05/27/2023, 12:12 AMarray<string>
. Has anyone had this problem before? Running `SHOW COLUMNS IN table_name`in Athena shows a list of column names as expected.Ahmad Dorri
05/27/2023, 4:39 PMSuraj Malthumkar
05/30/2023, 1:37 AMHana BOUACILA
05/30/2023, 1:58 PMritwik singh
05/31/2023, 3:11 AMDivyansh Jain
05/31/2023, 5:40 AMSuraj Malthumkar
05/31/2023, 8:29 AM