RDD.toLocalIterator method an efficient way to do the job. It uses runJob to evaluate only a single partition on each step.
As for the toLocalIterator, it is used to collect the data from the RDD scattered around your cluster into one only node, the one from which the program is running, and do something with all the data in the same node. It is similar to the collect method, but instead of returning a List it will return an Iterator.
Pyspark “toLocalIterator” Example
# Create DataFrame
sample_df = sqlContext.sql("select * from sample_tab1")
# Ceate Iteraor
iter_var = sample_df.rdd.toLocalIterator()
You can use the ‘next’ method to get the data our of the pyspark iterator. However, ‘next’ returns only a row object.
>>> next(iter_var)
Row(id=1, name=u'AAA')
You can access the individual value by qualifying row object with column names.
You can use any of the below methods to get data for the given column.
>>> next(iter_var).id
2
If you want to know more about Spark, then do check out this awesome video tutorial: