Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

Running a simple app in pyspark.

f = sc.textFile("README.md")
wc = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)


I want to view RDD contents using foreach action:

wc.foreach(print)


This throws a syntax error:

SyntaxError: invalid syntax


What am I missing?

1 Answer

0 votes
by (32.3k points)
edited by

You are encountering this error because in Python 2.6 print isn't a function.

You can either use the __future__ library to treat print as a function:

>>> from __future__ import print_function

>>> wc.foreach(print)

or

define a helper UDF that performs the print:

>>> from operator import add

>>> f = sc.textFile("README.md")

>>> def g(x):

...     print x

...

>>> wc.foreach(g)

Note: for each executes on the worker nodes and the outputs may not necessarily appear in your driver/shell (it probably will in local mode, but not when running on a cluster).

Therefore, it would be better to use collect() to bring the RDD contents back to the driver.

>>> for x in wc.collect():

...     print x

If you want to know more about Spark, then do check out this awesome video tutorial:

 

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer

Browse Categories

...