What Is Apache Spark?
If you want to learn Apache Spark, check out this link that gives a thorough introduction to Apache Spark.
Understanding the Need for Apache Spark
Hadoop still rules the Big Data world and it is the primary choice for Big Data Analytics. It is however not optimized for specialized kind of workloads. Reasons behind this are given below:
- Hadoop doesn’t have iteration support. It doesn’t support cyclic data flow where the output of a former stage is the input to the subsequent stage.
- On disk, there is persisting intermediate data, and this is the reason for high latency in Hadoop. MapReduce framework is relatively slower as it provides support for various structures, formats, and volumes of data. Time required to perform the Map and the Reduce tasks by MapReduce is therefore relatively very high when the time taken by Spark is considered.
In spite of this, Hadoop’s processing paradigm is good for dealing with large data. Why do you think PayPal is using it massively? Spark has improved over Hadoop by including the strengths of Hadoop, along with making up for its weaknesses. It, therefore, provides highly efficient batch processing of Hadoop, and furthermore the latency involved is also less. Spark has thus fulfilled the need for parallel execution requirements of Analytics professionals and it is an irreplaceable tool in the Big Data community.
Check out the video on PySpark Course to learn more about its basics:
How Does Spark’s Parallel Processing Work Like a Charm?
There is a driver program within the Spark cluster where the application logic execution is stored. Here, data is processed in parallel with multiple workers. This kind of data processing is not an ideal practice, but this is how it typically happens. Among workers, data is placed side by side and partitioned within the cluster across the same set of machines. The driver program during execution passes the code into the worker machines where processing will be conducted of the corresponding partition of data. To prevent data shuffling across machines, the data will undergo different steps of transformation, all the while staying in the same partition. At the worker machines, actions are executed and the result is returned to the driver program.
Resilient Distributed Dataset (RDD) is the trump card of Spark technology which is an important distributed data structure. Across multiple machines, it is physically partitioned inside a cluster and is a centralized entity when logically considered. Within a cluster, inter-machine data shuffling can be lowered by controlling how various RDDs are co-partitioned. There is a ‘partition-by’ operator which by redistributing data in the original RDD creates a new RDD across machines in the cluster.
Fast access is the obvious benefit when RDD is optimally cached in RAM. Currently in the Analytics world, caching granularity is done at the RDD level. It is like all or none. Either the entire RDD is cached or it is not cached. If sufficient memory is available in the cluster, Spark will try to cache the RDD. This is done based on the Least Recently Used (LRU) eviction algorithm. Expression of the application logic as a sequence of transformations is possible through an abstract data structure which is provided by the RDD. This process can happen regardless of the underlying distributed nature of data.
As said earlier, application logics are usually expressed in transformation and action. The processing dependency, DAG among RDDs is what ‘transformation’ specifies. The kind of output is specified by ‘action’. To find out the execution sequence of DAG, the scheduler performs a topology sort which traces way back to the source nodes. This node represents a cached RDD.
Subsets of Spark Dependencies
Dependencies of Spark will be of two types, and they are narrow dependency and wide dependency.
Narrow Dependency |
Wide Dependency |
Narrow dependency is where a single child RDD consumes all partitions of the parent RDD. |
In wide dependency, different child RDDs will get various splits of the parent RDD based on the child RDDs’ keys. |
As an example, narrow dependency can be compared to a family where for a parent (RDD) there is only a single child (RDD). In this case, all wealth and inheritance (partitions) will go to the single child (RDD). |
Wide dependency can be compared to another family where a single parent (RDD) has multiple children (RDD). Here, the parent’s wealth (partition) would get split among all the children (RDD). |
The key partitioning between parent and child RDDs is preserved when RDDs with narrow dependencies are used. RDDs can hence be co-partitioned with the same keys meaning that the child key range is the superset of the parent key range. Due to the virtue of the above process, the process of creating a child RDD from a parent RDD can be done across the network with no data within a machine. Data shuffling happens with wide dependencies. The type of dependencies will be examined by the scheduler and it will group the narrow dependency RDD into a stage which is a unit of processing. Across consecutive stages within the execution, wide dependencies will span. This process requires the number of child RDDs to be explicitly specified.
How Does Parallel Processing Actually Happen?
The parallel processing execution sequence in Spark is as follows:
- RDD is usually created from external data sources like local file or HDFS.
- RDD undergoes a series of parallel transformations like filter, map, groupBy, and join where each transformation provides a different RDD which gets fed to the next transformation.
- The last stage is of action where RDD is exported as an output to external data sources.
The above three stages of processing are something similar to the topological sort of DAG. Immutability is the key here where an RDD after being processed this way can’t be changed back or tampered with in anyway. If the RDD is not used as a cache, then typically it is used to feed the subsequent transformation to produce the next RDD which is used then to produce some action output.
You might already know how fault tolerance happens in the Cloud and Big Data systems, where a dataset is replicated across multiple data centers in the case of Cloud systems or nodes in the case of Big Data systems. In the event of any natural disasters or any untoward incident happening to a dataset in a particular data center or node, the dataset from another data center or node can be retrieved and used.
Spark’s Fault Resilience
Spark has a varied approach in fault resilience. Spark is essentially a highly efficient and large compute cluster, and it doesn’t have a storage capability like the way Hadoop has HDFS. Spark takes as obvious two assumptions of the workloads which come to its door for being processed:
- Spark expects that the processing time is finite. Obviously, the cost of recovery is higher when the processing time is high.
- Spark assumes that external data sources are responsible for data persistence in the parallel processing of data. Therefore, the responsibility of stabilizing the data during the processing falls on them.
Spark re-executes the previous steps to recover the lost data to compensate for the same during the execution. Not all executions need to be done from the beginning. Only those partitions in the parent RDD which were responsible for the faulty partitions need to be re-executed. In narrow dependencies, this process resolves to the same machine.
You can imagine the re-execution of the lost partition as something similar to the DAG lazy execution. The lazy evaluation starts all the way from the leaf node tracing through the parent nodes and finally reaching the source node in such traversal. Compared to lazy evaluation, here, an extra piece of information is required, i.e., the partition to find out which parent RDD is needed.
Re-execution of wide dependencies in this fashion will result in the re-execution of everything as it can touch on a lot of parent RDDs across various machines. How Spark overcomes this problem is worth taking a note. Spark persists the output intermediate data from a mapper function and it sends the data to various machines after shuffling it. Note that Spark performs such operations on many mapper functions in parallel. You may ask why Spark persists that intermediate data. This is because if a machine crashes, re-execution will just take that persisted intermediate mapper data again for re-execution from another machine where this data is replicated. Spark provides a checkpoint API which supports this re-execution process of persisted data. The checkpoint API is named appropriately to what it does.
Use Cases
We have dedicated a whole blog post on the use cases of Spark. Do go through it.
Conclusion
In providing low-latency highly parallel processing for Big Data Analytics, Spark has kept its promise. One can run action and transformation operations in widely used programming languages such as Java, Scala, and Python. Spark also provides a striking balance between the latency of recovery and check-pointing based on the statistical result. Spark will increasingly be used and will be synonymous to the real-time analytics framework in the near future. In fact, it has already begun wearing that garb.