Hello Everyone !
I have a question regarding how PySpark works with Spark Clusters when loading Delta Table :
I have a 6 TB Delta table, i am loading it in a dataframe for analysis works
How is it possible for Spark to load a 6TB Delta table in a dataframe , it is impossible to load 6TB in the living memory isn't it ? So i guess it partition somehow the data and load it in memory in a 'queue'
Can you help me to understand how clearly Spark can manage doing this ?
Big thanks in advance !
j
JosephK (exDatabricks)
03/30/2023, 3:28 PM
Depending on the query, it may not need to read the entire table. Especially with delta, it will use file statistics to skip certain files. If you drop a column, spark will not scan that column and read it into memory.
If you are doing a full scan of all 6TB, spark will just break it up into parts.
👍 1
j
Jim Hibbard
03/30/2023, 3:43 PM
This is one of the strengths of Spark + Delta Lake! You're able to analyze your dataset almost as though it were one big in-memory DataFrame, but behind the scenes Spark is using the metadata from Delta Lake to only scan/access the necessary files for your operation and also selectively bring data in and out of memory when it's needed. Essentially, if your cluster is specified appropriately for your workload, this is something you don't have to think about during your analysis.
m
Matthew Powers
03/30/2023, 8:30 PM
Yep, another beautiful part of Delta Lake is that it contains file-level metadata statistics that Spark can use to sometimes skip entire files. Spark will try to load the minimal amount of data when running queries.
m
Martin Beaussart
03/31/2023, 7:36 AM
Woah it is really interesting and i understand very well the Power of Spark that's really impressive !
Well really thank you for the answers !