0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

Correct me if I'm wrong, but my understanding is that Hadoop does not use MPI for communication between different nodes.

What are the technical reasons for this?

I could hazard a few guesses, but I do not know enough of how MPI is implemented "under the hood" to know whether or not I'm right.

Come to think of it, I'm not entirely familiar with Hadoop's internals either. I understand the framework at a conceptual level (map/combine/shuffle/reduce and how that works at a high level) but I don't know the nitty gritty implementation details. I've always assumed Hadoop was transmitting serialized data structures (perhaps GPBs) over a TCP connection, eg during the shuffle phase. Let me know if that's not true.

1 Answer

0 votes
by (32.5k points)
edited by

MPI is Message Passing Interface. As by its name it clarifies that there is no data locality. You send the data to another node for it to be computed on. Thus MPI is network-bound in terms of performance when working with large data.

One of the big features of Hadoop/map-reduce is the fault tolerance. Fault tolerance is not supported in most of the current MPI implementations that is why the implementation of Hadoop using MPI is not practiced.

MapReduce with the Hadoop Distributed File System that duplicates data so that you can do your computer in local storage - streaming off the disk and straight to the processor.

A solution to this problem of Hadoop implementation using MPI is being thought about in future versions of OpenMPI.

If you want to know more about Hadoop, then do check out this awesome video tutorial:

Related questions

Welcome to Intellipaat Community. Get your technical queries answered by top developers !