0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

Suppose I give three files paths to a Spark context to read and each file has a schema in the first row. How can we skip schema lines from headers?

val rdd=sc.textFile("file1,file2,file3")


Now, how can we skip header lines from this rdd?

1 Answer

0 votes
by (31.4k points)
edited by

A simple way would be to just 

filter the initial read based on what your header looks like 

rdd = sc.textFile(X).filter(!_.startsWith("beginningOfYourHeader")).cache() 

For  Spark 2.0 and onwards user what you can do is use SparkSession to get this done as a one liner:

val spark = SparkSession.builder.config(conf).getOrCreate()

val dataFrame = spark.read.format("CSV").option("header","true").load(csvfilePath)

I hope it solved your question !

Another approach will be using python equivalent:

from itertools import islice

rdd.mapPartitionsWithIndex(

    lambda idx, it: islice(it, 1, None) if idx == 0 else it 

)

If you want to know more about Spark, then do check out this awesome video tutorial:

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...