Kashyap Bhatt
05/25/2023, 9:03 PMDockerfile
?
Our usecase is we're running some e2e tests (written using pytest
) in a Jenkins pipeline, that create some delta tables and perform merges etc. The Jenkins agent is docker image with alpine
as base (full Dockerfile attached). Problem is that snappy doesn't like the environment and we get following error (more complete stack trace attached):
py4j.protocol.Py4JJavaError: An error occurred while calling o162.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 13) (k-ci-fgao-2ffxcdata-10-2franking-report-by-session-type-1-wx95c executor driver): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:642)
Caused by: java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:275)
at org.xerial.snappy.Snappy.compress(Snappy.java:156)
at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:78)
... 19 more
I've tried many workarounds (to make snappy use some other writable-tmp-folder), but none work. So I'm hoping to create a docker image where I can write a simple python code that creates a delta table and writes to it.
Thank you!Robert
05/25/2023, 9:09 PMKashyap Bhatt
05/25/2023, 10:57 PMtest.py
from delta import configure_spark_with_delta_pip
from pyspark.sql import SparkSession
builder = (
SparkSession.builder.appName('LocalDevSparkSession')
.config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
.config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()
df = spark.createDataFrame(data=[(1, 2)])
df.write.saveAsTable(name='some_table', format='delta')
with something like spark-submit
in Jenkins on a Docker agent (whose Dockerfile
is attached) I get the exception related to snappy/parquet. Exception details attached as well.