Big Data Hadoop Cheat Sheet
In the last decade mankind has seen a pervasive amount of growth in data. Then we started looking for ways to put these data in use. Analyzing and Learning from these data has opened many doors of opportunities. That is how Big Data became a buzzword in IT industry.
Then we are introduced to different technologies and platforms to learn from these enormous amounts of data collected from all kinds of sources. Now comes the question, “How do we process Big Data?”. Apache Hadoop has filled up the gap, also it has become one of the hottest open source software.
Big Data cheat sheet will guide you through the basics of the Hadoop and important commands which will be helpful for new learners as well as for those who wants to take a quick look at the important topics of Big Data Hadoop.
Watch this video on Hadoop before going further on this Hadoop Cheat Sheet.
For better understanding about Big Data Hadoop our project-based Data Science Course is a must complete.
Download a Printable PDF of this Cheat Sheet
Big Data: Big data comprises of large datasets that cannot be processed using traditional computing techniques, which includes huge volumes, high velocity and extensible variety of data.
Hadoop: Hadoop is an Apache open source framework written in JAVA which allows distributed processing of large datasets across clusters of computers using simple programming models.
Hadoop Common: These are the JAVA libraries and utilities required by other Hadoop modules which contains the necessary scripts and files required to start Hadoop
Hadoop YARN: YARN is a framework used for job scheduling and managing the cluster resources
Hadoop Distributed File System: HDFS is a Java-based file system that provides scalable and reliable data storage and it provides high throughput access to the application data
Hadoop MapReduce: It is a software framework, which is used for writing the applications easily which process big amount of data in parallel on large clusters
Apache hive: It is an infrastructure for data warehousing for Hadoop
Apache oozie: It is an application in Java responsible for scheduling Hadoop jobs
Apache Pig: It is a data flow platform that is responsible for the execution of the MapReduce jobs
Apache Spark: It is an open source framework used for cluster computing
Flume: Flume is an open source aggression service responsible for collekction and transport of data from source to destination
Hbase: Apache Hbase is a column-oriented database of Hadoop that stores big data in a scalable way
Sqoop: Scoop is an interface application that is used to transfer data between Hadoop and relational database through commands
Watch this Hadoop Tutorial video
Hadoop Ecosystem:
Hadoop Ecosystem represents various components of the Apache software. Typically, it can be divided into the following categories.
- Top Level Interface
- Top Level Abstraction
- Distributed Data Processing
- Self Healing Clustered Storage System
Hadoop file automation commands:
Cat: Cat command is used to copy the source path to the destination or the standard output
hdfsdfs –cat URI [URI- – -]
chgrp: This command is used to change the group of the files.
hdfsdfs –chgrp [-R] GROUP URI [URI—]
chmod: This command is used to change the permissions of the file.
hdfsdfs –chmod [-R] <MODE[,MODE]- – -: OCTALMODE> URI [URI – – -]
chown: This command is used to change the owner of the file
hdfsdfs –chown [-R][OWNER][:{GROUP]]URI[URI]
cp: This command can be used to copy one or more than one files from the source to destination path
hdfsdfs –count [-q] <paths>
Du: It is used to display the size of directories or files
hdfsdfs –cpURI[URI – – -]<dest>
get: This command can be used to copy files to the local file system
hdfsdfs –get[-ignorecrc][-crc]<src><localdst>
ls: It is used to display the statistics of any file or directory
hdfsdfs –ls <args>
mkdir: This command is used to create one or more directories
hdfsdfs –mkdir<path>
mv: It is used to move one or more files from one location to other
hdfsdfs –mv URI[URI – – -]<dest>
put: This command is used to read from one file system to other
hdfsdfs –put<localsrc>- – -<dest>
rm: This command is used to delete one or more than one files
hdfsdfs –rmr[-skipTrash]URI[URI- – – ]
stat: It is used to display the information of any specific path
hdfsdfs –stat URI[URI – – -]
help: It is used to display the usage information of the command
help<cmd-name>
standard
Hadoop Administration commands:
The commands which can be used only by the Hadoop Administrators are mentioned below with the operations performed by them.
Balancer: To run cluster balancing utility
Daemonlog: To get or set the log level of each daemon
Dfsadmin: To run many HDFS administrative operations
Datanode: To run HDFS datanode service
mradmin: To run a number of mapReduce administrative operations
Jobtracker: To run mapReduce job tracker
Namenode: To run name node
Tasktracker: To run mapReduce task tracker node
Secondary namenode: To run secondary namenode
Download a Printable PDF of this Cheat Sheet
With this, we come to an end of the Big Data Hadoop Cheat Sheet. To get in-depth knowledge, check out our interactive, live-online ellipaIntat Big Data Hadoop Certification Training here, which comes with 24*7 support to guide you throughout your learning period. Intellipaat’s Big Data certification training course is a combination of training courses in Hadoop development, Hadoop administration, Hadoop testing, and analytics with Apache Spark. This Cloudera Hadoop & Spark training will prepare you to clear Cloudera CCA 175 big data certification.