https://delta.io logo
k

Kashyap Bhatt

05/25/2023, 9:03 PM
Hello !, Does anyone have a working setup where some code running inside a docker image, tries to create a delta table? If yes, what image do you use and can you share your
Dockerfile
? Our usecase is we're running some e2e tests (written using
pytest
) in a Jenkins pipeline, that create some delta tables and perform merges etc. The Jenkins agent is docker image with
alpine
as base (full Dockerfile attached). Problem is that snappy doesn't like the environment and we get following error (more complete stack trace attached):
Copy code
py4j.protocol.Py4JJavaError: An error occurred while calling o162.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
   at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 13) (k-ci-fgao-2ffxcdata-10-2franking-report-by-session-type-1-wx95c executor driver): org.apache.spark.SparkException: Task failed while writing rows.
   at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:642)
Caused by: java.lang.IllegalArgumentException
   at java.nio.Buffer.limit(Buffer.java:275)
   at org.xerial.snappy.Snappy.compress(Snappy.java:156)
   at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:78)
   ... 19 more
I've tried many workarounds (to make snappy use some other writable-tmp-folder), but none work. So I'm hoping to create a docker image where I can write a simple python code that creates a delta table and writes to it. Thank you!
r

Robert

05/25/2023, 9:09 PM
if it si just about writing to the table, and not doing merge etc, you could use the deltalake package https://pypi.org/project/deltalake/
k

Kashyap Bhatt

05/25/2023, 10:57 PM
Donm
Perhaps I explained the problem inadequately. When I run following code in
test.py
Copy code
from delta import configure_spark_with_delta_pip
from pyspark.sql import SparkSession

builder = (
    SparkSession.builder.appName('LocalDevSparkSession')
    .config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
    .config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()

df = spark.createDataFrame(data=[(1, 2)])
df.write.saveAsTable(name='some_table', format='delta')
with something like
spark-submit
in Jenkins on a Docker agent (whose
Dockerfile
is attached) I get the exception related to snappy/parquet. Exception details attached as well.
I understand the basics of what delta is and so on...