Martin
07/07/2023, 4:28 PMdf.cache()
produces dirty reads on Delta
I noticed some (to me) strange behavior and I like to aks if this is by-design or a bug:
location = 'some arbitrary location'
# create delta table at location containing *10* rows
spark.range(10).toDF("id").write.format("delta").mode("overwrite").option("path", location).save()
# reading from delta table and caching the dataframe
df_init = spark.read.format('delta').load(location)
df_init.cache()
df_init.count()
>>> 10
# overwriting the delta table with *100* rows
spark.range(100).toDF("id").write.format("delta").mode("overwrite").option("path", location).save()
# now checking the number of rows again
spark.read.format('delta').load(location).count()
>>> 10
Is this by design?
Is this because Spark "remembers" that the result of execution plan for spark.read.format('delta').load(location)
has been cached and uses it instead of reexecuting the plan - and therefore does not notice that the underlying delta table has changed?Dominique Brezinski
07/07/2023, 4:42 PM.cache()
is insidious and should rarely be used--it keys the cache by plan and value is the cached results. This means anybody on the same cluster that invokes the same query plan will receive the cached results, likely unaware that is what they are getting. The cache lookup happens before the source is evaluated in any other way. Most users are not aware of this, and more often that not it causes more problems than it solves. Dataframe caching is working as designed, the behavior is not specific to Delta, and this is a brilliant example why many of us would prefer our users never use .cache/persist
. I have heard Michael Armbrust say his goal is that users never have to think about caching--that the system just performs optimally without it. There are a few cases where it is still warranted, but only experts should be making that decision.Martin
07/07/2023, 4:50 PMJosephK (exDatabricks)
07/07/2023, 7:07 PMJacek
07/22/2023, 7:33 PMSparkContext
= RDDs are tracked by ID and no other user can access blocks of your own SC. I must be missing something obvious.Dominique Brezinski
07/22/2023, 7:44 PMJosephK (exDatabricks)
07/24/2023, 11:14 AMJacek
07/25/2023, 11:00 AMSparkContext
-bound so unless itâs shared thereâs an isolated temporary directory for data blocks. Iâll have to refresh my memory on this and be back with more concrete examples next time asked. /me Back to studying MERGE INTO đ