getting number of visible nodes in PySpark

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-23T14:00:14+0000

sc.defaultParallelism is just a hint. Basically, depending on the configuration it may not hold a relation to the number of nodes. This is actually the number of partitions if you execute an operation that takes a partition count argument but you don't provide it. For example, sc.parallelize will create a new RDD from a list. You can let it know how many partitions to create in the RDD with the second argument. But the default value for this argument is sc.defaultParallelism.

You can get the number of executors with sc.getExecutorMemoryStatus in the Scala API, but this is not exposed in the Python API.

In general, it is recommended to have around 4 times as many partitions in an RDD as you have executors. This is considered as a good tip because if there is variance in how much time the tasks take this will even it out. Some executors will be processing 5 faster tasks while others will be processing 3 slower tasks, for example.

You don't need to be very accurate with this. Now, if you are having a rough idea, you must go with an estimate. Like if you have an idea that you have less than 200 CPUs, you can consider 500 partitions will be fine.

So now you should try to create RDDs with this number of partitions:

rdd = sc.parallelize(data, 500) # when distributing local data.
rdd = sc.textFile('file.csv', 500) # If loading data from a file.

Or if you don't control the creation of the RDD, just repartition it before the computation:

rdd = rdd.repartition(500)

You can check the number of partitions in an RDD with rdd.getNumPartitions()

However, On pyspark you could still call the scala getExecutorMemoryStatus API using pyspark's py4j bridge:

sc._jsc.sc().getExecutorMemoryStatus().size()

getting number of visible nodes in PySpark

1 Answer

Related questions

Browse Categories