how to loop through each row of dataFrame in pyspark

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-15T14:21:23+0000

Using list comprehensions in python, you can collect an entire column of values into a list using just two lines:

df = sqlContext.sql("show tables in default")
tableList = [x["tableName"] for x in df.rdd.collect()]

In the above example, we return a list of tables in database 'default', but the same can be adapted by replacing the query used in sql().

Or more abbreviated:

tableList = [x["tableName"] for x in sqlContext.sql("show tables in default").rdd.collect()]

And for your example of three columns, we can create a list of dictionaries, and then iterate through them in a for loop.

sql_text = "select name, age, city from user"
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
             for x in sqlContext.sql(sql_text).rdd.collect()]
for row in tupleList:
    print("{} is a {} year old from {}".format(
        row["name"],
        row["age"],
        row["city"]))

Learn Spark with this Spark Certification Course by Intellipaat.

Your last block of code has an error in it. You need to put the dictionary keys in quotes like this:

sql_text = "select name, age, city from user"
tupleList = [{"name":x["name"], "age":x["age"], "city":x["city"]}
for x in sqlContext.sql(sql_text).rdd.collect()]

for row in tupleList:
print("{} is a {} year old from {}".format(
row["name"],
row["age"],
row["city"])) — anonymous, Jun 24, 2021

how to loop through each row of dataFrame in pyspark

1 Answer

Related questions

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources