Flat 10% & upto 50% off + Free additional Courses. Hurry up!

Top Hadoop Interview Questions And Answers

Hadoop Interview Questions
Here are top 21 objective type sample Hadoop Interview questions and their answers are given just below to them. These sample questions are framed by experts from Intellipaat who train for Learn Hadoop Online to give you an idea of type of questions which may be asked in interview. We have taken full care to give correct answers for all the questions. Do comment your thoughts. Happy Job Hunting!

Wish to Learn Hadoop? Click Here

hadoop interview questions

Top Answers to Hadoop Interview Questions

1. Compare Hadoop & Spark
Criteria Hadoop Spark
Dedicated storage HDFS None
Speed of processing average excellent
Libraries Separate tools available Spark Core, SQL, Streaming, MLlib, GraphX
2. What are real-time industry applications of Hadoop?

Hadoop, well known as Apache Hadoop, is an open-source software platform for scalable and distributed computing of large volumes of data. It provides rapid, high performance and cost-effective analysis of structured and unstructured data generated on digital platforms and within the enterprise. It is used in almost all departments and sectors today.Some of the instances where Hadoop is used:

  • Managing traffic on streets.
  • Streaming processing.
  • Content Management and Archiving Emails.
  • Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster.
  • Fraud detection and Prevention.
  • Advertisements Targeting Platforms are using Hadoop to capture and analyze click stream, transaction, video and social media data.
  • Managing content, posts, images and videos on social media platforms.
  • Analyzing customer data in real-time for improving business performance.
  • Public sector fields such as intelligence, defense, cyber security and scientific research.
  • Financial agencies are using Big Data Hadoop to reduce risk, analyze fraud patterns, identify rogue traders, more precisely target their marketing campaigns based on customer segmentation, and improve customer satisfaction.
  • Getting access to unstructured data like output from medical devices, doctor’s notes, lab results, imaging reports, medical correspondence, clinical data, and financial data.

Read this log to find out how Big Data is transforming real estate now.

3. How is Hadoop different from other parallel computing systems?

Hadoop is a distributed file system, which lets you store and handle massive amount of data on a cloud of machines, handling data redundancy. Go through this HDFS content to know how the distributed file system works. The primary benefit is that since data is stored in several nodes, it is better to process it in distributed manner. Each node can process the data stored on it instead of spending time in moving it over the network.

On the contrary, in Relational database computing system, you can query data in real-time, but it is not efficient to store data in tables, records and columns when the data is huge.

Learn about Oracle DBA now.

Hadoop also provides a scheme to build a Column Database with Hadoop HBase, for runtime queries on rows.

Learn more in this HBase Tutorial.

4. What all modes Hadoop can be run in?

Hadoop can run in three modes:

  • Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.
  • Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
  • Fully Distributed Mode (Multiple Cluster Node): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.

Learn more about Hadoop in this Hadoop Certification course to get ahead in your career!

5. Explain the major difference between HDFS block and InputSplit.

In simple terms, block is the physical representation of data while split is the logical representation of data present in the block. Split acts a s an intermediary between block and mapper.
Suppose we have two blocks:
Block 1: ii nntteell
Block 2: Ii ppaatt

Now, considering the map, it will read first block from ii till ll, but does not know how to process the second block at the same time. Here comes Split into play, which will form a logical group of Block1 and Block 2 as a single block.

It then forms key-value pair using inputformat and records reader and sends map for further processing With inputsplit, if you have limited resources, you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640MB (64MB each) and there are limited resources, you can assign ‘split size’ as 128MB. This will form a logical group of 128MB, with only 5 maps executing at a time.

However, if the ‘split size’ property is set to false, whole file will form one inputsplit and is processed by single map, consuming more time when the file is bigger.

6. What is distributed cache and what are its benefits?

Distributed Cache, in Hadoop, is a service by MapReduce framework to cache files when needed. Learn more in this MapReduce Tutorial now. Once a file is cached for a specific job, hadoop will make it available on each data node both in system and in memory, where map and reduce tasks are executing.Later, you can easily access and read the cache file and populate any collection (like array, hashmap) in your code.

Benefits of using distributed cache are:

  • It distributes simple, read only text/data files and/or complex types like jars, archives and others. These archives are then un-archived at the slave node.
  • Distributed cache tracks the modification timestamps of cache files, which notifies that the files should not be modified until a job is executing currently.

Give your career a big boost by going through our Hadoop Online Training Videos now!

7. Explain the difference between NameNode, Checkpoint NameNode and BackupNode.
  • NameNode is the core of HDFS that manages the metadata – the information of what file maps to what block locations and what blocks are stored on what datanode. In simple terms, it’s the data about the data being stored. NameNode supports a directory tree-like structure consisting of all the files present in HDFS on a Hadoop cluster. It uses following files for namespace:
    fsimage file- It keeps track of the latest checkpoint of the namespace.
    edits file-It is a log of changes that have been made to the namespace since checkpoint.
  • Checkpoint NameNode has the same directory structure as NameNode, and creates checkpoints for namespace at regular intervals by downloading the fsimage and edits file and margining them within the local directory. The new image after merging is then uploaded to NameNode.
    There is a similar node like Checkpoint, commonly known as Secondary Node, but it does not support the ‘upload to NameNode’ functionality.
  • Backup Node provides similar functionality as Checkpoint, enforcing synchronization with NameNode. It maintains an up-to-date in-memory copy of file system namespace and doesn’t require getting hold of changes after regular intervals. The backup node needs to save the current state in-memory to an image file to create a new checkpoint.

Learn about the various Hadoop components in this Big Data Hadoop Video Tutorial.

8. What are the most common Input Formats in Hadoop?

There are three most common input formats in Hadoop:

  • Text Input Format: Default input format in Hadoop.
  • Key Value Input Format: used for plain text files where the files are broken into lines
  • Sequence File Input Format: used for reading files in sequence

Download Hadoop Interview Questions asked by top MNCs in 2017



9. Define DataNode and how does NameNode tackle DataNode failures?

DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each datanode sends a heartbeat message to notify that it is alive. If the namenode does noit receive a message from datanode for 10 minutes, it considers it to be dead or out of place, and starts replication of blocks that were hosted on that data node such that they are hosted on some other data node.A BlockReport contains list of all blocks on a DataNode. Now, the system starts to replicate what were stored in dead DataNode.

The NameNode manages the replication of data blocksfrom one DataNode to other. In this process, the replication data transfers directly between DataNode such that the data never passes the NameNode.

10. What are the core methods of a Reducer?

The three core methods of a Reducer are:

  1. setup(): this method is used for configuring various parameters like input data size, distributed cache.
    public void setup (context)
  2. reduce(): heart of the reducer always called once per key with the associated reduced task
    public void reduce(Key, Value, context)
  3. cleanup(): this method is called to clean temporary files, only once at the end of the task
    public void cleanup (context)
11. What is SequenceFile in Hadoop?

Extensively used in MapReduce I/O formats, SequenceFile is a flat file containing binary key/value pairs. The map outputs are stored as SequenceFile internally. It provides Reader, Writer and Sorter classes. The three SequenceFile formats are:

  1. Uncompressed key/value records.
  2. Record compressed key/value records – only ‘values’ are compressed here.
  3. Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable.
12. What is Job Tracker role in Hadoop?

Job Tracker’s primary function is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking the taks progress and fault tolerance).

  • It is a process that runs on a separate node, not on a DataNode often.
  • Job Tracker communicates with the NameNode to identify data location.
  • Finds the best Task Tracker Nodes to execute tasks on given nodes.
  • Monitors individual Task Trackers and submits the overall job back to the client.
  • It tracks the execution of MapReduce workloads local to the slave node.
13. What is the use of RecordReader in Hadoop?

Since Hadoop splits data into various blocks, RecordReader is used to read the slit data into single record. For instance, if our input data is split like:
Row1: Welcome to

Row2: Intellipaat
It will be read as “Welcome to Intellipaat” using RecordReader.

14. What is Speculative Execution in Hadoop?

One limitation of Hadoop is that by distributing the tasks on several nodes, there are chances that few slow nodes limit the rest of the program. Tehre are various reasons for the tasks to be slow, which are sometimes not easy to detect. Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent task as backup. This backup mechanism in Hadoop is Speculative Execution.

It creates a duplicate task on another disk. The same input can be processed multiple times in parallel. When most tasks in a job comes to completion, the speculative execution mechanism schedules duplicate copies of remaining tasks (which are slower) across the nodes that are free currently. When these tasks finish, it is intimated to the JobTracker. If other copies are executing speculatively, Hadoop notifies the TaskTrackers to quit those tasks and reject their output.

Speculative execution is by default true in Hadoop. To disable, set and mapred.reduce.tasks.speculative.execution
JobConf options to false.

15. What happens if you try to run a Hadoop job with an output directory that is already present?

It will throw an exception saying that the output file directory already exists.

To run the MapReduce job, you need to ensure that the output directory does not exist before in the HDFS.

To delete the directory before running the job, you can use shell:Hadoop fs –rmr /path/to/your/output/Or via the Java API: FileSystem.getlocal(conf).delete(outputDir, true);

Prepare yourself for the MapReduce Interview questions and answers Now

16. How can you debug Hadoop code?

First, check the list of MapReduce jobs currently running. Next, we need to see that there are no orphaned jobs running; if yes, you need to determine the location of RM logs.

  1. Run: “ps –ef | grep –I ResourceManager”
    and look for log directory in the displayed result. Find out the job-id from the displayed list and check if there is any error message associated with that job.
  2. On the basis of RM logs, identify the worker node that was involved in execution of the task.
  3. Now, login to that node and run – “ps –ef | grep –iNodeManager”
  4. Examine the Node Manager log. The majority of errors come from user level logs for each map-reduce job.
17. How to configure Replication Factor in HDFS?

hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.
You can also modify the replication factor on a per-file basis using the

Hadoop FS Shell:[training@localhost ~]$ hadoopfs –setrep –w 3 /my/fileConversely,

you can also change the replication factor of all the files under a directory.

[training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir

Go through Hadoop Administration Training to learn about Replication Factor In HDFS now!

18. How to compress mapper output but not the reducer output?

To achieve this compression, you should set:

conf.set("", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)
19. What is the difference between Map Side join and Reduce Side Join?

Map side Join at map side is performed data reaches the map. You need a strict structure for defining map side join. On the other hand, Reduce side Join (Repartitioned Join) is simpler than map side join since the input datasets need not be structured. However, it is less efficient as it will have to go through sort and shuffle phases, coming with network overheads.

20. How can you transfer data from Hive to HDFS?

By writing the query:

hive> insert overwrite directory '/' select * from emp;

You can write your query for the data you want to import from Hive to HDFS. The output you receive will be stored in part files in the specified HDFS path.

21. What companies use Hadoop, any idea?

Learn how Big Data and Hadoop have changed the rules of the game in this blog post. Yahoo! (the biggest contributor to the creation of Hadoop) – Yahoo search engine uses Hadoop, Facebook – Developed Hive for analysis,Amazon,Netflix,Adobe,eBay,Spotify,Twitter,Adobe.



"24 Responses on Top Hadoop Interview Questions And Answers"

  1. sreenivas says:

    Hi intellipaat team, nice collections. can you please share some hadoop.2.x details…

  2. Susan says:

    Great post. Thank you for sharing

  3. Syed says:

    nice post. Can you please include process of row deletion in HBase.

  4. Monika says:

    How we can setup hadoop on a single node?

  5. Swati says:

    Wow nice collection of questions thank you for sharing useful information.

  6. Nitin says:

    Nice questions..definitely of great help. Thanks a lot!

  7. Kaushal says:

    Hi ,
    Nice interview question.
    can you tell me what is the syllabus of hadoop certification
    and what i should do for certification .
    please suggest me.

  8. Atikha says:

    I really appreciate your efforts for publishing these Q/A

  9. Kendrick says:

    Thanks Great Read!

  10. Rajasekhar says:

    Good questions and answers. It will be helpful if you add some more questions and answers.Thank you.

  11. Ashish says:

    Nice stuff..! I got to know few answers from here.. Also please share more questions. Thanks again.

  12. Ritu says:

    Awesome Interview Q and A. Keep up with the good work.

  13. Prakriti Vaibhav Tripathi says:

    it is very nice. it give very clear understanding about hadoop.

  14. MALLIKARJUN says:

    thanks for valuable information

  15. Adil says:

    Excellent work….
    I think 70 % of questions are being asked most commonly in interviews..
    great to get with the right answers.

  16. Sejal says:

    Thanks to author for sharing such a good collection of hadoop interview questions. I want a suggestion that from where I have to prepare for CCDH. I learned Hadoop recently and now I want to take Cloudera Certification.

  17. Akhilesh says:

    Thanks Intellipaat for providing these questions and I see most of the questions present here was asked to me at TCS interview which I faced last week and which made me answer them to the point. Keep doing the good work and help the community to grow !

  18. Jai says:

    Good stuff, thanks a lot. it will be helpful for my coming interviews.

  19. Surajeet says:

    very nice info shared. I am preparing for my technical interview round and it is helping me a lot.

  20. Darshan says:

    Very Good IQA. I am going to start facing interviews soon for hadoop development. Now I can say I am prepared for my interview round.

  21. Deepak says:

    Good Questions for interviews . It helped me to understand many of the concepts which i was lacking while reading online . Thanks

Leave a Message

100% Secure Payments. All major credit & debit cards accepted Or Pay by Paypal.

Sales Offer

  • To avail this offer, enroll before 20th September 2017.
  • This offer cannot be combined with any other offer.
  • This offer is valid on selected courses only.
  • Please use coupon codes mentioned below to avail the offer

Sign Up or Login to view the Free Top Hadoop Interview Questions And Answers.