https://delta.io logo
s

Satyam Singh

03/18/2023, 2:30 PM
In Databricks, i am using pyspark to read CSV. I need to skip the first 2 rows. My csv header is at 3rd row. I could not find a way to do it in pyspark (i can only see this skiprows option in Pandas). Really appreciate if any body could help. Thanks.
m

Mike M

03/18/2023, 2:36 PM
Do the first 2 rows match the schema or start with a specific character? You can set the
comment
option to the starting character. You can also manually define the schema and set the
mode
option to
DROPMALFORMED
.
s

Satyam Singh

03/18/2023, 3:08 PM
• Thanks for your reply. The first 2 rows don't match in schema and they also not start with a specific character 😞 . `DROPMALFORMED`: ignores the whole corrupted records. This mode is unsupported in the CSV built-in functions.
c

Chris

03/18/2023, 4:50 PM
Doesn’t sound like you’re reading a CSV.. 😬
m

Mike M

03/18/2023, 5:32 PM
If you define the schema manually, DROPMALFORMED should remove the record in question. You can also specify a different character as the separator and load each row into a single column. Afterwards, filter out what you don't want and manually split the columns.
👍 1
j

JosephK (exDatabricks)

03/18/2023, 6:58 PM
not a delta lake question
j

Jim Hibbard

03/26/2023, 8:24 AM
It's a little questionable, but you could read the CSVs as just lines of plain text and remove the first two lines from each file before parsing the columns in a second pass.
3 Views