0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

how do you drop rows from an RDD in PySpark? Particularly the first row, since that tends to contain column names in my datasets. From perusing the API, I can't seem to find an easy way to do this. Of course I could do this via Bash / HDFS, but I just want to know if this can be done from within PySpark.

1 Answer

0 votes
by (32.5k points)

As far I know there's no 'easy' way to do this.

But the following approach should do the trick:

val header = data.first

val rows = data.filter(line => line != header)

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...