Big Data Hadoop Cheat Sheet
In the last decade, mankind has seen a pervasive amount of growth in data. Then we started looking for ways to put these data in use. Analyzing and Learning from these data has opened many doors of opportunities. That is how Big Data became a buzzword in the IT industry. Then we are introduced to different technologies and platforms to learn from these enormous amounts of data collected from all kinds of sources. Now comes the question, “How do we process Big Data?”. Apache Hadoop has filled up the gap, also it has become one of the hottest open-source software.
Big Data cheat sheet will guide you through the basics of the Hadoop and important commands which will be helpful for new learners as well as for those who want to take a quick look at the important topics of Big Data Hadoop.
Further, if you want to see the illustrated version of this topic you can refer to our tutorial blog on Big Data Hadoop.
For better understanding about Big Data Hadoop, our project-based Data Science Course is a must complete.
Big Data: Big data comprises of large datasets that cannot be processed using traditional computing techniques, which includes huge volumes, high velocity and extensible variety of data.
Hadoop: Hadoop is an Apache open-source framework written in JAVA which allows distributed processing of large datasets across clusters of computers using simple programming models.
Hadoop Common: These are the JAVA libraries and utilities required by other Hadoop modules which contains the necessary scripts and files required to start Hadoop
Hadoop YARN: Yarn is a framework used for job scheduling and managing the cluster resources
Hadoop Distributed File System: HDFS is a Java-based file system that provides scalable and reliable data storage and it provides high throughput access to the application data
Hadoop MapReduce: It is a software framework, which is used for writing the applications easily which process big amount of data in parallel on large clusters
Apache hive: It is an infrastructure for data warehousing for Hadoop
Apache oozie: It is an application in Java responsible for scheduling Hadoop jobs
Apache Pig: It is a data flow platform that is responsible for the execution of the MapReduce jobs
Apache Spark: It is an open source framework used for cluster computing
Flume: Flume is an open source aggression service responsible for collekction and transport of data from source to destination
Hbase: Apache Hbase is a column-oriented database of Hadoop that stores big data in a scalable way
Sqoop: Scoop is an interface application that is used to transfer data between Hadoop and relational database through commands
Watch this Hadoop Tutorial video
Hadoop Ecosystem represents various components of the Apache software. Typically, it can be divided into the following categories.
- Top-Level Interface
- Top Level Abstraction
- Distributed Data Processing
- Self Healing Clustered Storage System
Hadoop file automation commands:
Cat: Cat command is used to copy the source path to the destination or the standard output
chgrp: This command is used to change the group of the files.
chmod: This command is used to change the permissions of the file.
chown: This command is used to change the owner of the file
cp: This command can be used to copy one or more than one files from the source to destination path
Du: It is used to display the size of directories or files
get: This command can be used to copy files to the local file system
ls: It is used to display the statistics of any file or directory
mkdir: This command is used to create one or more directories
mv: It is used to move one or more files from one location to other
put: This command is used to read from one file system to other
rm: This command is used to delete one or more than one files
stat: It is used to display the information of any specific path
help: It is used to display the usage information of the command
Hadoop Administration commands:
The commands which can be used only by the Hadoop Administrators are mentioned below with the operations performed by them.
Balancer: To run cluster balancing utility
Daemonlog: To get or set the log level of each daemon
Dfsadmin: To run many HDFS administrative operations
Datanode: To run HDFS datanode service
mradmin: To run a number of MapReduce administrative operations
Jobtracker: To run MapReduce job tracker
Namenode: To run the name node
Tasktracker: To run MapReduce task tracker node
Secondary namenode: To run secondary namenode
With this, we come to an end of Big Data Hadoop Cheat Sheet. To get in-depth knowledge, check out our interactive, live-online Intellipaat Big Data Hadoop Certification Training here, that comes with 24*7 support to guide you throughout your learning period. Intellipaat’s Big Data certification training course is a combination of the training courses in Hadoop developer, Hadoop administrator, Hadoop testing, and analytics with Apache Spark. This Cloudera Hadoop & Spark training will prepare you to clear Cloudera CCA 175 big data certification.