Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I would like to transform from a DataFrame that contains lists of words into a DataFrame with each word in its own row.

How do I do explode on a column in a DataFrame?

Here is an example with some of my attempts where you can uncomment each code line and get the error listed in the following comment. I use PySpark in Python 2.7 with Spark 1.6.1.

from pyspark.sql.functions import split, explode
DF = sqlContext.createDataFrame([('cat \n\n elephant rat \n rat cat', )], ['word'])
print 'Dataset:'
DF.show()
print '\n\n Trying to do explode: \n'
DFsplit_explode = (
 DF
 .select(split(DF['word'], ' '))
#  .select(explode(DF['word']))  # AnalysisException: u"cannot resolve 'explode(word)' due to data type mismatch: input to function explode should be array or map type, not StringType;"
#   .map(explode)  # AttributeError: 'PipelinedRDD' object has no attribute 'show'
#   .explode()  # AttributeError: 'DataFrame' object has no attribute 'explode'
).show()

# Trying without split
print '\n\n Only explode: \n'

DFsplit_explode = (
 DF
 .select(explode(DF['word']))  # AnalysisException: u"cannot resolve 'explode(word)' due to data type mismatch: input to function explode should be array or map type, not StringType;"
).show()


Please advice

1 Answer

0 votes
by (32.3k points)

Explode function basically takes in an array or a map as an input and outputs the elements of the array (map) as separate rows.

Also, I would like to tell you that explode and split are SQL functions. Both of them operate on SQL Column.

Now if you want to separate data on arbitrary whitespace you'll need something like this:

df = sqlContext.createDataFrame(

    [('cat \n\n elephant rat \n rat cat', )], ['word']

)

df.select(explode(split(col("word"), "\s+")).alias("word")).show()

## +--------+

## |    word|

## +--------+

## |     cat|

## |elephant|

## |     rat|

## |     rat|

## |     cat|

## +--------+

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...