Sumanth Bo3
05/10/2023, 6:35 AMdf = spark.read.format("delta").load("/Users/saiteja/Downloads/new/")
df.createOrReplaceTempView("my_city_view")
result = spark.sql("SELECT LEFT(name, 1) AS letter, COUNT(*) AS num_cities FROM my_city_view GROUP BY letter ORDER BY letter ASC")
— is there any better efficient and faster way to query delta tables
— Currently we are reading the entire delta table into df and creating a temporary view on top of it
• Is the df and the tempView stored in memory during this phase
• if the number of rows in the delta table is very high wouldn’t it cause high memory usage to read the table and create a view on top of it
another thing which i observed is
SELECT LEFT(name, 1) AS letter, COUNT(*) AS num_cities FROM my_city_view GROUP BY letter ORDER BY letter ASC
my first query is relatively taking very high time to execute where are the subsequent queries on the same my_city_view
are getting executed very fast
here is an example
No.of Records :150710
Time taken to run query 1: 135ms
Time taken to run query 2: 13ms
Time taken to run query 3: 15ms
is it caching the records in memory or something
• i tried the same for different datasets but every time only the first query will take up more time where as queries executed after that will be 4 or 5 times faster
Another question not related to delta lake
— the delta lake query results in the above code is stored in result
which is a spark dataframe when i try result.show()
it is taking a good 40-50ms is the a better faster way to export the results mainlyJoydeep Banik Roy
05/10/2023, 7:58 AMSumanth Bo3
05/10/2023, 7:59 AMselect country_name, count(*) FROM my_city_view GROUP BY country_name"
query 3:
SELECT LEFT(state_name, 1) AS letter, COUNT(*) AS num_cities FROM my_city_view GROUP BY letter ORDER BY letter ASC"
name
, state_name
to which the city belongs and country_name of almost all cites in the worldJoydeep Banik Roy
05/10/2023, 8:25 AMSumanth Bo3
05/10/2023, 10:39 AMthe actual operations happen when you invoke showoh this is something i didn’t know - so only when we operate on the result spark df like
result.show()
the query be executed
then does this result = spark.sql("SELECT LEFT(name, 1) AS letter, COUNT(*) AS num_cities FROM
just store the sql in variable resultresult.show()
the jobs are getting created like
Delta: Filtering files for query
and showString at NativeMethodAccessorImpl.java:0
one more question @Joydeep Banik Roy in the screenshot below we can see that Delta: Filtering files for query
is only taking 0.1 sec but to convert them to string or pandas it is taking some huge time any better way to reduce convertion time to less than that of Delta: Filtering files for query
timeDelta: Filtering files for query
is for query execution or just to construct the queryJoydeep Banik Roy
05/11/2023, 2:00 AM