04/28/2023, 5:07 AM
I have lot of data file (~ in PB) in compressed csv format. The data file are organized by year/mm/dd/hh/mm/. I am trying to create an external table to query these data using below syntax and its taking a lot of time to execute. Any help to make it efficient ? The reason I am doing this is I want to create a table and share it with the management team so that they can query those csv using spark SQL. Thank you. CREATE EXTERNAL TABLE Test ( TIMESTAMP String, Message String, EventName String, Device String ) USING CSV OPTIONS('path'='/2023/')

Samat Kurmanov

04/28/2023, 10:12 AM
is it on AWS or what?

JosephK (exDatabricks)

04/28/2023, 11:24 AM
First of all, this has nothing to do with delta io . Second, do not do this. Using CSVs is going to be a mess and you'll be much better served using delta or even parquet.

Winston LAI

04/28/2023, 12:57 PM
Hi Payra! Do you have any senior data engineers or experienced staff from your company to help you? You may need some guidance or self learning to pick up knowledge in SQL, database management, delta lake, partitioning, basic understanding on cloud services. The story won't end here after you manage to create the table. Later you may encounter performance issue if the table is not properly defined. Then you may have to start over again. I suggest you seek help from senior staff or learn more on your own first before rushing to create the table.
💯 1