Martin07/07/2023, 4:28 PM
produces dirty reads on Delta I noticed some (to me) strange behavior and I like to aks if this is by-design or a bug:
Is this by design? Is this because Spark "remembers" that the result of execution plan for
location = 'some arbitrary location' # create delta table at location containing *10* rows spark.range(10).toDF("id").write.format("delta").mode("overwrite").option("path", location).save() # reading from delta table and caching the dataframe df_init = spark.read.format('delta').load(location) df_init.cache() df_init.count() >>> 10 # overwriting the delta table with *100* rows spark.range(100).toDF("id").write.format("delta").mode("overwrite").option("path", location).save() # now checking the number of rows again spark.read.format('delta').load(location).count() >>> 10
has been cached and uses it instead of reexecuting the plan - and therefore does not notice that the underlying delta table has changed?
Dominique Brezinski07/07/2023, 4:42 PM
is insidious and should rarely be used--it keys the cache by plan and value is the cached results. This means anybody on the same cluster that invokes the same query plan will receive the cached results, likely unaware that is what they are getting. The cache lookup happens before the source is evaluated in any other way. Most users are not aware of this, and more often that not it causes more problems than it solves. Dataframe caching is working as designed, the behavior is not specific to Delta, and this is a brilliant example why many of us would prefer our users never use
. I have heard Michael Armbrust say his goal is that users never have to think about caching--that the system just performs optimally without it. There are a few cases where it is still warranted, but only experts should be making that decision.
Martin07/07/2023, 4:50 PM
JosephK (exDatabricks)07/07/2023, 7:07 PM
Jacek07/22/2023, 7:33 PM
= RDDs are tracked by ID and no other user can access blocks of your own SC. I must be missing something obvious.
Dominique Brezinski07/22/2023, 7:44 PM
JosephK (exDatabricks)07/24/2023, 11:14 AM
Jacek07/25/2023, 11:00 AM
-bound so unless it’s shared there’s an isolated temporary directory for data blocks. I’ll have to refresh my memory on this and be back with more concrete examples next time asked. /me Back to studying MERGE INTO 😉