0 votes
1 view
in Devops and Agile by (29.9k points)

I am surveying Google Dataflow and Apache Spark to decide which one is a more suitable solution for our big data analysis business needs.

I found there are Spark SQL and MLlib in the spark platform to do structured data query and machine learning.

I wonder if there any corresponding solution in the Google Dataflow platform?

1 Answer

0 votes
by (51.2k points)

Here are some key architectural points to consider about Google Cloud Dataflow v. Spark.

Resource management: Cloud Dataflow is a completely on demand execution environment. Specifically - when you execute a job in Dataflow the resources are allocated on demand for that job only. There is no sharing/contention of resources across jobs. In comparison to a Spark or MapReduce cluster, you would typically deploy a cluster of X nodes and then submit jobs and then tune the node resources across jobs. Of course, you can build up and tear down these clusters, but the Dataflow model is geared towards hands-free dev ops about resource management.

Interactivity: Currently Cloud Dataflow does not provide an interactive mode. Spark can be a better model if you want to load data into the cluster via in-memory RDD's and then dynamically execute queries. The challenge is that as your data sizes and query complexity increases you will have to handle the devOps.

Programming Model: Dataflow's programming model is functionally biased vs. a classic MapReduce model. There are many similarities between Spark and Dataflow in terms of API primitives.

Streaming & Windowing: Dataflow (building on top of the unified programming model) was architected to be a highly reliable, durable, and scalable execution environment for streaming. One of the key differences between Dataflow and Spark is that Dataflow enables you to easily process data in terms of its true event time vs. solely processing it at it's arrival time into the graph. You can window data into fixed, sliding, session or custom windows based on event time or arrival time. Dataflow also provides the ability to upgrade your streaming jobs while they are in flight.

if you need to process your ETL and or MR jobs over streams, Dataflow is a solid choice.

if you need interactivity Spark is a solid choice.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !