0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns.

Suppose my dataframe had columns "a", "b", and "c". I know I can do this:

df.withColumn('total_col', df.a + df.b + df.c)


The problem is that I don't want to type out each column individually and add them, especially if I have a lot of columns. I want to be able to do this automatically or by specifying a list of column names that I want to add. Is there another way to do this?

1 Answer

0 votes
by (32.5k points)

Follow the code given below:

$ pyspark

>>> df = sc.parallelize([{'a': 1, 'b':2, 'c':3}, {'a':8, 'b':5, 'c':6}, {'a':3, 'b':1, 'c':0}]).toDF().cache()

>>> df

DataFrame[a: bigint, b: bigint, c: bigint]

>>> df.columns

['a', 'b', 'c']

>>> def column_add(a,b):

...     return a.__add__(b)

...

>>> df.withColumn('total', reduce(column_add, ( df[col] for col in df.columns ) )).collect()

[Row(a=1, b=2, c=3, total=6), Row(a=8, b=5, c=6, total=19), Row(a=3, b=1, c=0, total=4)]

If you wish to learn Pyspark visit this Pyspark Tutorial.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...