I need to skip three rows from the dataframe while loading from a CSV file in scala

Question

4 Answers

Amit Rawat · Answer 1 · 2019-06-19T13:24:31+0000

Try to make changes with respect to your schema. This approach will be a safer approach

import org.apache.spark.sql.Row
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Read CSV
val fi = sc.textFile("csvfilelocation")
//Remove first 3 lines
val da = fi.mapPartitionsWithIndex{ (id_x, iter) => if (id_x == 0) iter.drop(3) else iter }
//Create RowRDD by mapping each line to the required fields
val rowRdd = da.map(x=>Row(x(0), x(1)))
//create dataframe by calling sqlcontext.createDataframe with rowRdd and your schema
val df = sqlContext.createDataFrame(rowRdd, schema)

Another approach:

You can add an index, using monotonically_increasing_id.

df.withColumn("Index",monotonically_increasing_id)
.filter('Index > 2)
.drop("Index")

But this is not a safe approach, as it only going to work if the first 3 rows are in the first partition and the code may break with further versions of spark.

If you have any doubt regarding spark, refer the following video:

If you wish to learn What is Apache Spark visit this Apache Spark Training by Intellipaat.

vinita · Answer 2 · 2019-08-08T06:19:51+0000

A generic way to handle your problem would be to index the data frame and filter the indices that are greater than 2.

Straightforward approach:

As suggested in another answer, you may try adding an index with monotonically_increasing_id.

df.withColumn("Index",monotonically_increasing_id)
.filter('Index > 2)
.drop("Index")

Yet, that's only going to work if the first 3 rows are in the first partition. Moreover, as mentioned in the comments, this is the case today but this code may break completely with further versions or spark and that would be very hard to debug. Indeed, the contract in the API is just "The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive". It is therefore not very safe to assume that they will always start from zero. There might even be other cases in the current version in which that does not work (I'm not sure though).

To illustrate my first concern, have a look at this:

scala> spark.range(4).withColumn("Index",monotonically_increasing_id()).show()
+---+----------+
| id| Index|
+---+----------+
| 0| 0|
| 1| 1|
| 2|8589934592|
| 3|8589934593|
+---+----------+

We would only remove two rows.

kodee · Answer 3 · 2019-08-08T07:20:38+0000

zipWithIndex from the RDD API can be used to get the successive indices.

def zipWithIndex(df : DataFrame, name : String) : DataFrame = {
  val rdd = df.rdd.zipWithIndex
    .map{ case (row, i) => Row.fromSeq(row.toSeq :+ i) }
  val newSchema = df.schema
    .add(StructField(name, LongType, false))
  df.sparkSession.createDataFrame(rdd, newSchema)
}
zipWithIndex(df, "index").where('index > 2).drop("index")

Also, this is safer than other methods and you can check it like this:

scala> zipWithIndex(spark.range(4).toDF("id"), "index").show()
+---+-----+
| id|index|
+---+-----+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
+---+-----+

Vishal · Answer 4 · 2019-08-19T07:58:33+0000

You can use the below-mentioned code to skip three rows from the dataframe while loading from a CSV file in scala:-

import csv
customSchema = StructType([ \
StructField("Col1", StringType(), True), \
StructField("Col2", StringType(), True)])
df = sc.textFile("file.csv")\
.mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=',', quotechar='"')).filter(lambda line: len(line) > 2 and line[0] != 'Col1')\
.toDF(customSchema)

I need to skip three rows from the dataframe while loading from a CSV file in scala

I need to skip three rows from the dataframe while loading from a CSV file in scala

Please log in or register to add a comment.

Please log in or register to answer this question.

4 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions