https://delta.io logo
r

Robin Moffatt

05/12/2023, 3:06 PM
This may be more of a Spark/Notebooks question… but hopefully some bright folk here can help me anyway 😉 (I've been told flattery always works) I'm using a single Jupyter notebook to test out a series of steps for loading a table, whilst mucking about with the underlying files in the background. If I do the following pseudo-code, I get an error (which I expect and understand why) 1. In a Jupyter notebook, Create a SparkSession 2. Read CSV file, write it to Delta table on S3 3. Read the delta table - works fine 4. Append more data to the delta table 5. Read the delta table - works fine 6. Reset the underlying files to their state after step #2 7. Read the delta table - as expected, this throws an error:
Copy code
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 14.0 failed 1 times, most recent failure: Lost task 2.0 in stage 14.0 (TID 66) (1e1dc78cf259 executor driver): java.io.FileNotFoundException: 
No such file or directory: <s3a://example/data_load/raw/soil/_delta_log/00000000000000000001.json>

It is possible the underlying files have been updated. You can explicitly invalidate
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
recreating the Dataset/DataFrame involved.
So, on to my question: I've found that if I restart the Jupyter kernel I can re-query the delta table successfully. However, I also thought I'd be able to re-create the spark session:
Copy code
spark.stop()
    spark = SparkSession.builder.appName […]
to the same effect - but doing this I still get the
FileNotFound
exception, suggesting that something's cached somewhere. I've tried the suggestion in the error message, but this also throws the FileNotFound exception:
Copy code
refresh table <delta.s3a://example/data_load/my/table>
Is there a way to programatically flush whatever's been cached, without the "turn it off and turn it back on again" approach of restarting the kernel?
s

Sumanth Bo3

05/13/2023, 7:07 PM
spark.catalog.clearCache()
i remember reading something like to this but don’t have the actual source try this out
r

Robin Moffatt

05/18/2023, 3:27 PM
thanks, I'll try it