0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)
edited by

I found PySpark has a method called drop but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time?

df.drop(['col1','col2'])

TypeError                                 Traceback (most recent call last)
<ipython-input-96-653b0465e457> in <module>()
----> 1 selectedMachineView = machineView.drop([['GpuName','GPU1_TwoPartHwID']])

/usr/hdp/current/spark-client/python/pyspark/sql/dataframe.pyc in drop(self, col)
   1257             jdf = self._jdf.drop(col._jc)
   1258         else:
-> 1259             raise TypeError("col should be a string or a Column")
   1260         return DataFrame(jdf, self.sql_ctx)
   1261

TypeError: col should be a string or a Column

1 Answer

0 votes
by (31.4k points)

In your case you can simply use select to resolve your problem:

df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])’

or in case if you just want to use drop then “reduce” should do the trick:

from functools import reduce

from pyspark.sql import DataFrame

reduce(DataFrame.drop, ['GpuName','GPU1_TwoPartHwID'], df)


Note:Spark 2.x+ supports multiple columns in drop.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...