https://delta.io logo
s

Sumanth Bo3

05/10/2023, 6:35 AM
Hi All just started to explore delta lake and it’s capabilities recently it looks very interesting and promising --- My question is below is how i currently query my delta table
Copy code
df = spark.read.format("delta").load("/Users/saiteja/Downloads/new/")

df.createOrReplaceTempView("my_city_view")

result = spark.sql("SELECT LEFT(name, 1) AS letter, COUNT(*) AS num_cities FROM my_city_view GROUP BY letter ORDER BY letter ASC")
— is there any better efficient and faster way to query delta tables — Currently we are reading the entire delta table into df and creating a temporary view on top of it • Is the df and the tempView stored in memory during this phase • if the number of rows in the delta table is very high wouldn’t it cause high memory usage to read the table and create a view on top of it another thing which i observed is
SELECT LEFT(name, 1) AS letter, COUNT(*) AS num_cities FROM my_city_view GROUP BY letter ORDER BY letter ASC
my first query is relatively taking very high time to execute where are the subsequent queries on the same
my_city_view
are getting executed very fast here is an example
Copy code
No.of Records :150710
Time taken to run query 1: 135ms
Time taken to run query 2: 13ms
Time taken to run query 3: 15ms
is it caching the records in memory or something • i tried the same for different datasets but every time only the first query will take up more time where as queries executed after that will be 4 or 5 times faster Another question not related to delta lake — the delta lake query results in the above code is stored in
result
which is a spark dataframe when i try
result.show()
it is taking a good 40-50ms is the a better faster way to export the results mainly
j

Joydeep Banik Roy

05/10/2023, 7:58 AM
Can you share an example of the other queries which takes 13ms and 15ms
s

Sumanth Bo3

05/10/2023, 7:59 AM
query 2
Copy code
select country_name, count(*) FROM my_city_view GROUP BY country_name"
query 3:
Copy code
SELECT LEFT(state_name, 1) AS letter, COUNT(*) AS num_cities FROM my_city_view GROUP BY letter ORDER BY letter ASC"
the data set has city_name with column as
name
,
state_name
to which the city belongs and country_name of almost all cites in the world
here a peek at the parquet file used to create delta table, do let me know if you need the entire dataset
j

Joydeep Banik Roy

05/10/2023, 8:25 AM
So I tried it out - not the exact same query and dataset but similar operations on a much larger dataset
Here is a view of the DAG
I ran only two queries
query 1 and 2 => there was not much difference in the runtime
infact second one took more time
the actual operations happen when you invoke show
before that it is only metadata operation
Delta lake read all the fileinfo and caches just the DeltaLog, you will find it under the storage tab on Spark UI
apart from that nothing is cached
s

Sumanth Bo3

05/10/2023, 10:39 AM
the actual operations happen when you invoke show
oh this is something i didn’t know - so only when we operate on the result spark df like
result.show()
the query be executed then does this
result = spark.sql("SELECT LEFT(name, 1) AS letter, COUNT(*) AS num_cities FROM
just store the sql in variable result
checked the same only after running
result.show()
the jobs are getting created like
Delta: Filtering files for query
and
showString at NativeMethodAccessorImpl.java:0
one more question @Joydeep Banik Roy in the screenshot below we can see that
Delta: Filtering files for query
is only taking 0.1 sec but to convert them to string or pandas it is taking some huge time any better way to reduce convertion time to less than that of
Delta: Filtering files for query
time
does time in
Delta: Filtering files for query
is for query execution or just to construct the query
j

Joydeep Banik Roy

05/11/2023, 2:00 AM
it just stores the query plan of the dataframe
Filtering files is still filtering on metadata and not executing any query
👍 1
gratitude thank you 1