Top 5 Mistakes to Avoid When Writing Apache Spark Applications

As per the market gurus, Spark is the one of the big data engines that is trending the charts. The reason behind this popularity is its ability to process real-time streaming data. Let’s take a look at its exciting features-

It runs ten to hundred times faster than Hadoop MapReduce
It is equipped with machine learning abilities
Supports multiple languages
Able to perform advance analytics operations

Check out the video on PySpark Course to learn more about its basics:

Because of all these features, it has almost replaced Hadoop. Or in other words, we can say that apache Spark can work in standalone mode as well as on top of the Hadoop layer as well. What has made it stay ahead of Hadoop? Read below –

Hadoop	Spark
Stores data in local disk	Stores data in-memory
Slow speed	Faster speed
Suitable for batch processing	Suitable for real-time processing
External schedulers required	Schedules tasks itself
High latency	Low latency

However, despite having these capabilities we often get stuck in certain situations which arise due to inefficient codes written for applications. The situations and their solutions are discussed below-

Be careful in managing DAG

People often do mistakes in DAG controlling. So in order to avoid such mistakes. We should do the following:

Always try to use reducebykey instead of groupbykey : The ReduceByKey and GroupByKey can perform almost similar functions, but GroupByKey contains large data. Hence, try to use ReduceByKey the most.

Make sure you stay away from shuffles as much as possible:

Always try to lower the side of maps as much as possible
Try not to waste more time in Partitioning
Try not to shuffle more
Try to keep away from Skews as well as partitions too

Reduce should be lesser than TreeReduce: Always use TreeReduce instead of Reduce, Because TreeReduce does much more work in comparison to Reduce on the executors.

Maintain the required size of the shuffle blocks

In the shuffle operation, the task that emits the data in the source executor is “mapper”, the task that consumes the data into the target executor is “reducer”, and what happens between them is “shuffle”.

The blocking of the shuffles is called as a shuffle block. Often spark application fails becomes the shuffle blocks become greater than 2 GB. Generally during shuffling people use around 200 partitions which is usually less, as a result of which the shuffle blocks increase in size. As a result of this when it becomes more than 2GB, the application fails. So if we increase the number of partitions, we can remove the data skew as well. Normally according to the thumb rule, we have 128 MB for each partition. So if the partition size memory is too low also the tasks will be very slow. Hence in order to avoid failure as well as the fast running of the application, the partitions should be less than two thousand but near to it about to hit 2000, but not exactly 2000.

Do not let the jobs to slow down

When the application is shuffled, it takes more time around 4 long hours to run. This makes the system slow.

There are two stages of aggregation and they are :

action on the salted keys
action on the unsalted keys

So we have to remove the isolated keys and then accumulation should be used which will decrease the data used as a result we can huge information can be saved from being shuffled.

Perform shading operations to avoid error

In writing down an Apache Spark application, we face errors although guava is already included in the maven dependencies in the application, but still, errors occur when the applied guava version does not match with the Spark’s guava version. So in order to match it, we have to perform Shading as follows-

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.2</version>
<relocations>
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>com.company.my.protobuf</shadedPattern>
</relocation>
</relocations>

So always perform shady stuff, or else all the classpath will seep out all the efforts.

Avoid wrong dimensions of executors

In any particular Spark jobs, executors are the executing nodes that are responsible for processing singular tasks in the job. These executors provide in-memory storage for RDDs that are cached by user programs through Block Manager. They are created at the very starting of the particular Spark application and are on for the whole application span. After processing the entity works, the deliver the output to the driver. The mistakes that we do during the writing of the Spark application with the executors are that we take the wrong size executors. Things that we go wrong in the assigning of the following:

Number of Executors
Cores of each executor
Memory for each executor

Normally people use 6 executors, 16 cores each and 64 GB of RAM.

When using 16 core for each executor, the total number of cores for 16 executors become 96. And the memory per node becomes 64/16 i.e. 4 GB for each executor. Hence if it becomes most granular for using smallest size executors we fail to make use of the advantages of processing all the tasks in the same java virtual machine. But in the same calculation if it becomes least granular also it becomes a problem because no memory remains free for overhead for OS/Hadoop daemons. And instead of 16 cores, if we use 15 cores also it will result in bad throughput. So the perfect number is 5 cores per executor. Because for 6*15 we have 90 cores, so the number of executors will be 90/5 i.e 18. So leaving one executor for AM, we have 17 remaining, so executors in 1 node will be 3, Hence, RAM= 63/3= 21 GB, 21 x (1-0.07) ~ 19 GB. Therefore for correct application people should use 17 executors, 5 cores each and 19 GB of RAM.

The above tips should be followed in order to avoid mistakes while an Apache Spark Application development.