https://delta.io logo
á

Álvaro José Baranoski

02/06/2023, 2:33 PM
Hello everyone, I am running some tests on my local machine with a Spark mini cluster (4 docker containers, 3 workers and 1 master). I'm trying to create a simple delta table sending commands to the cluster with
spark-submit
. The table that I'm trying to create is the very first one in the getting started section, like so:
Copy code
data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
Whoever, when executing the command, the log presents the following error:
Copy code
[2023-02-06, 11:09:36 -03] {spark_submit.py:495} INFO - 23/02/06 11:09:36 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2) (172.19.0.3 executor 2): java.io.FileNotFoundException:
[2023-02-06, 11:09:36 -03] {spark_submit.py:495} INFO - File file:/home/alvaro/airflow/tmp/delta-table/_delta_log/00000000000000000000.json does not exist
[2023-02-06, 11:09:36 -03] {spark_submit.py:495} INFO - 
[2023-02-06, 11:09:36 -03] {spark_submit.py:495} INFO - It is possible the underlying files have been updated. You can explicitly invalidate
[2023-02-06, 11:09:36 -03] {spark_submit.py:495} INFO - the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
[2023-02-06, 11:09:36 -03] {spark_submit.py:495} INFO - recreating the Dataset/DataFrame involved.
What should I do to get the delta table to be created on my local machine? Is it possible to do so using this kind of Spark cluster? If not, what is the best way to Spark + Delta Lake + Airflow on my local machine? Thanks in advance!!
s

Sherlock Beard

02/06/2023, 2:37 PM
Can you try "/tmp/delta-table2" .
Random thought your old table might be corrupted .
á

Álvaro José Baranoski

02/06/2023, 2:38 PM
Sure, I'll try it right now
Now that I posted this question I also thought that it could be because there are more than 1 worker, so maybe some of them are trying to access the files that do not exist yet
I tried changing the name, but unfortunetly the error persisted =/ I also forgot to mention, but after the process fails, when I open my local files, there is some data inside (this also happened with the original path name).
Setting the number of workers to 1 also didn't work =/
s

Sherlock Beard

02/06/2023, 4:21 PM
not sure can you try to restart spark? or do the refresh tableName ? also can you try without airflow ? and what is the version of spark and delta lake you are using ?
á

Álvaro José Baranoski

02/06/2023, 5:03 PM
I'll try the three sugestions you mentioned. As for the versions, I'm using Spark 3.3.1 and delta-core_2.12:2.1.0
I tried n° 1 and 3, but it didn't work either
It seems like it stops writing to the folder out of nowhere
What is this refresh tableName that you mentioned?
Looks like the problem is with the cluster I created. I tried saving the DataFrame in json format, both locally (output8) and via cluster and spark-submit (output9). I'll investigate this further, but so far thanks for the help!
👍 1
4 Views