0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I've got big RDD(1gb) in yarn cluster. On local machine, which use this cluster I have only 512 mb. I'd like to iterate over values in RDD on my local machine. I can't use collect(), because it would create too big array locally which more then my heap. I need some iterative way. There is method iterator(), but it requires some additional information, I can't provide.

1 Answer

0 votes
by (32.5k points)
edited by

RDD.toLocalIterator method an efficient way to do the job. It uses runJob to evaluate only a single partition on each step.

As for the toLocalIterator, it is used to collect the data from the RDD scattered around your cluster into one only node, the one from which the program is running, and do something with all the data in the same node. It is similar to the collect method, but instead of returning a List it will return an Iterator.

Pyspark “toLocalIterator” Example

# Create DataFrame

sample_df = sqlContext.sql("select * from sample_tab1")

# Ceate Iteraor

iter_var = sample_df.rdd.toLocalIterator()

You can use the ‘next’ method to get the data our of the pyspark iterator. However, ‘next’ returns only a row object.

>>> next(iter_var)

Row(id=1, name=u'AAA')

You can access the individual value by qualifying row object with column names.

You can use any of the below methods to get data for the given column.

>>> next(iter_var).id

2

If you want to know more about Spark, then do check out this awesome video tutorial:

 

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...