In Scala with Spark, the first row alone is skipped as a header using.option("header", true). You might use something like.option("skipRows", 3) to skip more than one row-for example, the first three-by using the Spark DataFrame API. However, skipRows may not be available in all versions of Spark.
Another way is to load the entire CSV and then filter out the first three rows after loading:
val df = spark.sqlContext.read
.schema(Myschema)
.option("header", true) // Skips the first row as a header
.option("delimiter", "|")
.csv(path)
.toDF()
// Filter out the first 3 rows
val filteredDF = df.rdd.zipWithIndex().filter(_._2 >= 3).map(_._1).toDF(df.columns: _*)
Here’s how this works:
.zipWithIndex assigns a unique index to each row.
.filter(_._2 >= 3) excludes rows with an index less than 3.
.map(_._1) extracts the rows back into a DataFrame format.
This approach should skip the first three rows.