Top 5 Mistakes to Avoid When Writing Apache Spark Applications

A faster big data processing engine that allows the firms to process streaming data. However developers make a few mistakes which decreases Spark's performance. Read below to avoid such errors during Spark application development!

Top 5 Mistakes to Avoid When Writing Apache Spark Applications
07th Jun, 2019
7511 Views
2 comment(s)

As per the market gurus, Spark is the one of the big data engines that is trending the charts. The reason behind this popularity is its ability to process real-time streaming data. Let’s take a look at its exciting features-

  • It runs ten to hundred times faster than Hadoop MapReduce
  • It is equipped with machine learning abilities
  • Supports multiple languages
  • Able to perform advance analytics operations

Watch this Spark Tutorial For Beginners video


Because of all these features it has almost replaced Hadoop. Or in other words we can say that Spark can work in standalone mode as well as on top of Hadoop layer as well. What has made it stay ahead of Hadoop? Read below –

HadoopSpark
Stores data in local diskStores data in-memory
Slow speedFaster speed
Suitable for batch processingSuitable for real-time processing
External schedulers requiredSchedules tasks itself
High latencyLow latency

However despite having these capabilities we often get stuck in certain situations which arise due to inefficient codes written for applications. The situations and their solutions are discussed below-

Wish to Learn Spark? Click Here

Be careful in managing DAG

People often do mistakes in DAG controlling. So in order to avoid such mistakes. We should do the following:

Always try to use reducebykey instead of groupbykey : The ReduceByKey and GroupByKey can perform almost similar functions, but GroupByKey contains large data. Hence, try to use ReduceByKey to the most.

Make sure you stay away from shuffles as much as possible:

  • Always try to lower the side of maps as much as possible
  • Try not to waste more time in Partitioning
  • Try not to shuffle more
  • Try to keep away from Skews as well as partitions too

Reduce should be lesser than TreeReduce: Always use TreeReduce instead of Reduce, Because TreeReduce does much more work in comparison to the Reduce on the executors.

Read this informative Apache Spark Blog for more information!

Maintain the required size of the shuffle blocks

In the shuffle operation, the task that emits the data in the source executor is “mapper”, the task that consumes the data into the target executor is “reducer”, and what happens between them is “shuffle”.

The blocking of the shuffles is called as a shuffle block. Often spark application fails becomes the shuffle blocks become greater than 2 GB. Generally during shuffling people use around 200 partitions which is usually less, as a result of which the shuffle blocks increases in size. As a result of this when it becomes more than 2GB, the application fails. So if we increase the number of partitions, we can remove the data skew as well. Normally according to the thumb rule, we have 128 MB for each partition. So if the partition size memory  is too low also the tasks will be very slow. Hence in order to avoid failure as well as the fast running of the application,  the partitions should be less than two thousand but near to it about to hit 2000, but not exactly 2000.

Want to learn more? Read this extensive Apache Spark Tutorial!

Do not let the jobs to slow down

When the application is shuffled, it takes more time around 4 long hours to run. This makes the system slow.

There are two stages of aggregation and they are :

  • action on the salted keys
  • action on the unsalted keys

So we have to remove the isolated keys and then accumulation should be used which will decrease the data used as a result we can huge information can be saved from being shuffled.

Download latest questions asked on Spark in top MNC's ?

[contact-form-7 id="60621" title="Interview Subscription"]

Perform shading operations to avoid error

In writing down an Apache Spark application, we face errors although guava is already included in the maven dependencies in the application, but still, errors occur when the applied guava version does not match with the Spark’s guava version. So in order to match it, we have to perform Shading as follows-

<plugin>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-shade-plugin</artifactId>

<version>2.2</version>

<relocations>

<relocation>

<pattern>com.google.protobuf</pattern>

<shadedPattern>com.company.my.protobuf</shadedPattern>

</relocation>

</relocations>

So  always perform shady stuff, or else all the classpath will seep out all the efforts.

Grab a big data job today with these Top Spark Interview Questions!

Avoid wrong dimensions of executors

In any particular Spark jobs, executors are the executing nodes that are responsible for processing singular tasks in the job. These executors provide in-memory storage for RDDs that are cached by user programs through Block Manager. They are created at the very starting of the particular Spark application and are on for the whole application span. After processing the entity works, the deliver the output to the driver. The mistakes that we do during the writing of the Spark application with the executors are that we take the wrong size executors. Things that we go wrong in the assigning of the following:

  • Number of Executors
  • Cores of each executor
  • Memory for each executor

Normally people use 6 executors, 16 cores each and 64 GB of RAM.

When using 16 core  for each executor, the total number of cores for 16 executors become 96. And the memory per node becomes 64/16 i.e. 4 GB for each executor. Hence if it becomes most granular for using smallest size executors we fail to make use of the advantages of processing all the tasks in the same java virtual machine. But in the same calculation if it becomes least granular also it becomes a problem because no memory remains free for overhead for OS/Hadoop daemons. And instead of 16 cores, if we use 15 cores also it will result in bad throughput. So the perfect number  is 5 cores per executor. Because for 6*15 we have 90 cores, so the number of executors will be 90/5 i.e 18. So leaving one executor for AM, we have 17 remaining, so executors in 1 node will be 3, Hence, RAM= 63/3= 21 GB, 21 x (1-0.07) ~ 19 GB. Therefore for correct application people should use 17 executors, 5 cores each and 19 GB of RAM.

The above tips should be followed in order to avoid mistakes while an Apache Spark Application development.

Get in-touch with Intellipaat for an industry recognized Online Spark and Scala Certification Training Course!

 

 

Related Articles

2 thoughts on “Top 5 Mistakes to Avoid When Writing Apache Spark Applications”

    1. The term “DAG” does not have to do anything with cluster manager. A DAG is simply the graphical representation of the nodes (RDDs) and they have a definite direction. DAG are helpful in tracking dependencies between one node and another node, and you may think of it as dependency flow.

Leave a Reply

Your email address will not be published. Required fields are marked *

Solve : *
13 × 27 =