0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

Is it possible and what would be the most efficient neat method to add a column to Data Frame?

More specifically, column may serve as Row IDs for the existing Data Frame.

In a simplified case, reading from file and not tokenizing it, I can think of something as below (in Scala), but it completes with errors (at line 3), and anyways doesn't look like the best route possible:

var dataDF = sc.textFile("path/file").toDF()
val rowDF = sc.parallelize(1 to DataDF.count().toInt).toDF("ID")
dataDF = dataDF.withColumn("ID", rowDF("ID"))

1 Answer

0 votes
by (31.4k points)
edited by

The original task was to append a column with row identificators (basically, a sequence 1 to numRows) to any given data frame, so the rows order/presence can be tracked (e.g. when you sample). This can be achieved by something along these lines:

sqlContext.textFile(file).

zipWithIndex().

map(case(d, i)=>i.toString + delimiter + d).

map(_.split(delimiter)).

map(s=>Row.fromSeq(s.toSeq))

The "closest" approach to this functionality in Spark API are withColumn, withColumnRenamed and join.

You can also use row_number with Window function as below to get the distinct id for each rows in a dataframe. Also you can use monotonically_increasing_id for the same.

  • df.withColumn("ID", row_number() over Window.orderBy("any column name in the dataframe"))

  • df.withColumn("ID", monotonically_increasing_id())

If you want to know more about Spark, then do check out this awesome video tutorial:

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...