0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column.

I've tried the following without any success:

type(randomed_hours) # => list

# Create in Python and transform to RDD

new_col = pd.DataFrame(randomed_hours, columns=['new_col'])

spark_new_col = sqlContext.createDataFrame(new_col)

my_df_spark.withColumn("hours", spark_new_col["new_col"])


Also got an error using this:

my_df_spark.withColumn("hours",  sc.parallelize(randomed_hours))


So how do I add a new column (based on Python vector) to an existing DataFrame with PySpark?

1 Answer

0 votes
by (32.5k points)
edited by

To add a column using a UDF:

df = sqlContext.createDataFrame(

    [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))

from pyspark.sql.functions import udf

from pyspark.sql.types import *

def valueToCategory(value):

   if   value == 1: return 'cat1'

   elif value == 2: return 'cat2'

   ...

   else: return 'n/a'

# NOTE: it seems that calls to udf() must be after SparkContext() is called

udfValueToCategory = udf(valueToCategory, StringType())

df_with_cat = df.withColumn("category", udfValueToCategory("x1"))

df_with_cat.show()

## +---+---+-----+---------+

## | x1| x2|   x3| category|

## +---+---+-----+---------+

## |  1| a| 23.0|     cat1|

## |  3| B|-23.0|      n/a|

## +---+---+-----+---------+

Also another method to create new column is possible using literals.

from pyspark.sql.functions import lit

df = sqlContext.createDataFrame(

    [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))

df_with_x4 = df.withColumn("x4", lit(0))

df_with_x4.show()

## +---+---+-----+---+

## | x1| x2|   x3| x4|

## +---+---+-----+---+

## |  1| a| 23.0|  0|

## |  3| B|-23.0|  0|

## +---+---+-----+---+

If you want more information regarding Spark, you can refer to the following  video tutorial:

If you wish to learn What is Apache Spark visit this Apache Spark Training by Intellipaat.

Related questions

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...