Append a column to Dataframe in Apache Spark 1.3

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-10T09:03:26+0000

The original task was to append a column with row identificators (basically, a sequence 1 to numRows) to any given data frame, so the rows order/presence can be tracked (e.g. when you sample). This can be achieved by something along these lines:

sqlContext.textFile(file).
zipWithIndex().
map(case(d, i)=>i.toString + delimiter + d).
map(_.split(delimiter)).
map(s=>Row.fromSeq(s.toSeq))

The "closest" approach to this functionality in Spark API are withColumn, withColumnRenamed and join.

You can also use row_number with Window function as below to get the distinct id for each rows in a dataframe. Also you can use monotonically_increasing_id for the same.

df.withColumn("ID", row_number() over Window.orderBy("any column name in the dataframe"))
df.withColumn("ID", monotonically_increasing_id())

If you want to know more about Spark, then do check out this awesome video tutorial:

Append a column to Dataframe in Apache Spark 1.3

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources