Big Data Hadoop Cheat Sheet
In the last decade, mankind has seen a pervasive amount of growth in data. Then we started looking for ways to put these data to use. Analyzing and Learning from these data has opened many doors of opportunities. That is how Big Data became a buzzword in the IT industry. Then we are introduced to different technologies and platforms to learn from these enormous amounts of data collected from all kinds of sources. Now comes the question, “How do we process Big Data?”. Apache Hadoop has filled up the gap, also it has become one of the hottest open-source software.
Download a Printable PDF of this Cheat Sheet
The Big Data cheat sheet will guide you through the basics of Hadoop and important commands which will be helpful for new learners as well as for those who want to take a quick look at the important topics of Big Data Hadoop. Further, if you want to see the illustrated version of this topic you can refer to our tutorial blog on Big Data Hadoop.
For a better understanding of Big Data Hadoop, our project-based Data Science Course is a must complete.
Big Data: Big data comprises large datasets that cannot be processed using traditional computing techniques, which include huge volumes, high velocity, and an extensible variety of data.
Hadoop: Hadoop is an Apache open-source framework written in JAVA that allows distributed processing of large datasets across clusters of computers using simple programming models.
Hadoop Common: These are the JAVA libraries and utilities required by other Hadoop modules which contains the necessary scripts and files required to start Hadoop
Hadoop YARN: YARN is a framework used for job scheduling and managing the cluster resources
Hadoop Distributed File System: HDFS is a Java-based file system that provides scalable and reliable data storage and it provides high throughput access to the application data
Hadoop MapReduce: MapReduce is a software framework, which is used for writing applications easily that process a big amount of data in parallel on large clusters
Apache hive: It is an infrastructure for data warehousing for Hadoop
Apache oozie: It is an application in Java responsible for scheduling Hadoop jobs
Apache Pig: It is a data flow platform that is responsible for the execution of the MapReduce jobs
Apache Spark: Apache Spark is an open-source framework used for cluster computing
Flume: Flume is an open-source aggression service responsible for the collection and transport of data from source to destination
Hbase: Apache Hbase is a column-oriented database of Hadoop that stores big data in a scalable way
Sqoop: Scoop is an interface application that is used to transfer data between Hadoop and relational databases through commands
Watch this Hadoop Tutorial video
Hadoop Ecosystem:
Hadoop Ecosystem represents various components of the Apache software. Typically, it can be divided into the following categories.
- Top-Level Interface
- Top Level Abstraction
- Distributed Data Processing
- Self Healing Clustered Storage System
Hadoop file automation commands:
Cat: Cat command is used to copy the source path to the destination or the standard output
hdfsdfs –cat URI [URI- – -]
chgrp: This command is used to change the group of the files.
hdfsdfs –chgrp [-R] GROUP URI [URI—]
chmod: This command is used to change the permissions of the file.
hdfsdfs –chmod [-R] <MODE[,MODE]- – -: OCTALMODE> URI [URI – – -]
chown: This command is used to change the owner of the file
hdfsdfs –chown [-R][OWNER][:{GROUP]]URI[URI]
cp: This command can be used to copy one or more than one file from the source to the destination path
hdfsdfs –count [-q] <paths>
Du: It is used to display the size of directories or files
hdfsdfs –cpURI[URI – – -]<dest>
get: This command can be used to copy files to the local file system
hdfsdfs –get[-ignorecrc][-crc]<src><localdst>
ls: It is used to display the statistics of any file or directory
hdfsdfs –ls <args>
mkdir: This command is used to create one or more directories
hdfsdfs –mkdir<path>
mv: It is used to move one or more files from one location to other
hdfsdfs –mv URI[URI – – -]<dest>
put: This command is used to read from one file system to other
hdfsdfs –put<localsrc>- – -<dest>
rm: This command is used to delete one or more files
hdfsdfs –rmr[-skipTrash]URI[URI- – – ]
stat: It is used to display the information of any specific path
hdfsdfs –stat URI[URI – – -]
help: It is used to display the usage information of the command
help<cmd-name>
standard
Hadoop Administration commands:
The commands which can be used only by the Hadoop Administrators are mentioned below with the operations performed by them.
Balancer: To run cluster balancing utility
Daemonlog: To get or set the log level of each daemon
Dfsadmin: To run many HDFS administrative operations
Datanode: To run HDFS datanode service
mradmin: To run many MapReduce administrative operations
Jobtracker: To run MapReduce job tracker
Namenode: To run the name node
Tasktracker: To run MapReduce task tracker node
Secondary namenode: To run secondary namenode
Download a Printable PDF of this Cheat Sheet
With this, we come to an end of the Big Data Hadoop Cheat Sheet. Prepare yourself for the interview with our free material on Hadoop Interview Questions. To get in-depth knowledge, check out our interactive, live-online Intellipaat Big Data Hadoop Certification Training here, which comes with 24*7 support to guide you throughout your learning period. Intellipaat’s Big Data certification training course is a combination of the training courses in Hadoop developer, Hadoop administrator, Hadoop testing, and analytics with Apache Spark. This Cloudera Hadoop & Spark training will prepare you to clear Cloudera CCA 175 big data certification.