Kashyap Bhatt

05/25/2023, 9:03 PM
Hello !, Does anyone have a working setup where some code running inside a docker image, tries to create a delta table? If yes, what image do you use and can you share your
? Our usecase is we're running some e2e tests (written using
) in a Jenkins pipeline, that create some delta tables and perform merges etc. The Jenkins agent is docker image with
as base (full Dockerfile attached). Problem is that snappy doesn't like the environment and we get following error (more complete stack trace attached):
Copy code
py4j.protocol.Py4JJavaError: An error occurred while calling o162.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
   at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 13) (k-ci-fgao-2ffxcdata-10-2franking-report-by-session-type-1-wx95c executor driver): org.apache.spark.SparkException: Task failed while writing rows.
   at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:642)
Caused by: java.lang.IllegalArgumentException
   at java.nio.Buffer.limit(
   at org.xerial.snappy.Snappy.compress(
   at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(
   ... 19 more
I've tried many workarounds (to make snappy use some other writable-tmp-folder), but none work. So I'm hoping to create a docker image where I can write a simple python code that creates a delta table and writes to it. Thank you!


05/25/2023, 9:09 PM
if it si just about writing to the table, and not doing merge etc, you could use the deltalake package

Kashyap Bhatt

05/25/2023, 10:57 PM
Perhaps I explained the problem inadequately. When I run following code in
Copy code
from delta import configure_spark_with_delta_pip
from pyspark.sql import SparkSession

builder = (
    .config('spark.sql.extensions', '')
    .config('spark.sql.catalog.spark_catalog', '')
spark = configure_spark_with_delta_pip(builder).getOrCreate()

df = spark.createDataFrame(data=[(1, 2)])
df.write.saveAsTable(name='some_table', format='delta')
with something like
in Jenkins on a Docker agent (whose
is attached) I get the exception related to snappy/parquet. Exception details attached as well.
I understand the basics of what delta is and so on...