Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I'm trying to dynamically build a row in pySpark 1.6.1, then build it into a dataframe. The general idea is to extend the results of describe to include, for example, skew and kurtosis. Here's what I thought should work:

from pyspark.sql import Row

row_dict = {'C0': -1.1990072635132698,
            'C3': 0.12605772684660232,
            'C4': 0.5760856026559944,
            'C5': 0.1951877800894315,
            'C6': 24.72378589441825,
            'summary': 'kurtosis'}

new_row = Row(row_dict)

But this returns TypeError: sequence item 0: expected string, dict found which is a fairly clear error. Then I found that if I defined the Row fields first, I could use a dict:

r = Row('summary', 'C0', 'C3', 'C4', 'C5', 'C6')
> Row(summary={'summary': 'kurtosis', 'C3': 0.12605772684660232, 'C0': -1.1990072635132698, 'C6': 24.72378589441825, 'C5': 0.1951877800894315, 'C4': 0.5760856026559944})

Which would be a fine step, except it doesn't seem like I can dynamically specify the fields in Row. I need this to work for an unknown number of rows with unknown names. According to the documentation you can actually go the other way:

>>> Row(name="Alice", age=11).asDict() == {'name': 'Alice', 'age': 11}

So it seems like I should be able to do this. It also appears there may be some deprecated features from older versions that allowed this, for example here. Is there a more current equivalent I'm missing?

1 Answer

0 votes
by (32.3k points)

You can use keyword arguments unpacking as follows:


## Row(C0=-1.1990072635132698, C3=0.12605772684660232, C4=0.5760856026559944, 

##     C5=0.1951877800894315, C6=24.72378589441825, summary='kurtosis')

You must note that it internally sorts data by key to address problems with older Python versions.

Related questions

Browse Categories