Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
5 views
in Big Data Hadoop & Spark by (11.4k points)

Suppose I give three files paths to a Spark context to read and each file has a schema in the first row. How can we skip schema lines from headers?

val rdd=sc.textFile("file1,file2,file3")


Now, how can we skip header lines from this rdd?

1 Answer

0 votes
by (32.3k points)
edited by

A simple way would be to just 

filter the initial read based on what your header looks like 

rdd = sc.textFile(X).filter(!_.startsWith("beginningOfYourHeader")).cache() 

For  Spark 2.0 and onwards user what you can do is use SparkSession to get this done as a one liner:

val spark = SparkSession.builder.config(conf).getOrCreate()

val dataFrame = spark.read.format("CSV").option("header","true").load(csvfilePath)

I hope it solved your question !

Another approach will be using python equivalent:

from itertools import islice

rdd.mapPartitionsWithIndex(

    lambda idx, it: islice(it, 1, None) if idx == 0 else it 

)

If you want to know more about Spark, then do check out this awesome video tutorial:

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...