Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

Apache Beam supports multiple runner backends, including Apache Spark and Flink. I'm familiar with Spark/Flink and I'm trying to see the pros/cons of Beam for batch processing.

Looking at the Beam word count example, it feels it is very similar to the native Spark/Flink equivalents, maybe with a slightly more verbose syntax.

1 Answer

0 votes
by (32.3k points)

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open sources Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

There's a few things that Beam adds over many of the existing engines.

  • Fusing batch and streaming: Many systems can handle both batch and streaming, but they often do so via separate APIs. But in Beam, batch and streaming are just two points on a spectrum of latency, completeness, and cost. There's no learning/rewriting cliff from batch to streaming. So if you write a batch pipeline today but tomorrow your latency needs change, it's incredibly easy to adjust. You can see this kind of journey in the Mobile Gaming examples.

  • APIs that raise the level of abstraction: Beam's APIs focus on capturing properties of your data and your logic, instead of letting details of the underlying runtime leak through. This is both key for portability (see next paragraph) and can also give runtimes a lot of flexibility in how they execute. Something like ParDo fusion (aka function composition) is a pretty basic optimization that the vast majority of runners already do. Other optimizations are still being implemented for some runners. For example, Beam's Source APIs are specifically built to avoid overspecification the sharding within a pipeline. Instead, they give runners the right hooks to dynamically rebalance work across available machines. This can make a huge difference in performance by essentially eliminating straggler shards. In general, the more smarts we can build into the runners, the better off we'll be. Even the most careful hand tuning will fail as data, code, and environments shift.

Portability across runtimes: Because data shapes and runtime requirements are neatly separated, the same pipeline can be run in multiple ways. And that means that you don't end up rewriting code when you have to move from on-prem to the cloud or from a tried and true system to something on the cutting edge. You can very easily compare options to find the mix of environment and performance that works best for your current needs. And that might be a mix of things -- processing sensitive data on premise with an open source runner and processing other data on a managed service in the cloud.

Browse Categories