Hive cheat sheet

Introduction:

All the industries deal with the Big data that is a large amount of data and Hive is a tool that is used for the analysis of this Big Data. Apache Hive is a tool where the data is stored for analysis and querying. This cheat sheet guides you through the basic concepts and commands required to start with it.

You can also download the printable PDF of this Apache Hive cheat sheet

This Apache Hive cheat sheet will guide you to the basics of Hive which will be helpful for beginners and also for those who want to take a quick look at the important topics of Hive
Further, if you want to learn Apache Hive in-depth, you can refer to the tutorial blog on Hive.

Apache Hive: It is a data warehouse infrastructure based on Hadoop framework which is perfectly suitable for data summarization, analysis, and querying. It uses an SQL like language called HQL (Hive Query Language)
HQL: It is a query language used to write the custom map-reduce framework in Hive to perform more sophisticated analysis of the data
Table: The table in the hive is a table that contains logically stored data

Components of Hive:

Meta store: Meta store is where the schemas of the Hive tables are stored, it stores the information about the tables and partitions that are in the warehouse.
SerDe: Serializer, Deserializer which gives instructions to hive on how to process records

Hive interfaces:

Hive interfaces include WEB UI
Hive command line
HD insight (windows server)

Hive Function Meta commands:

Show functions: Lists Hive functions and operators
Describe function [function name]: Displays short description of the particular function
Describe function extended [function name]: Displays extended description of the particular function

Grab high-paying Big Data jobs with these Big Data Engineer Interview Questions!

Types of Hive Functions:

UDF (User-defined Functions): It is a function that fetches one or more columns from a row as arguments and returns a single value
UDTF (User-defined Tabular Functions): This function is used to produce multiple columns or rows of output by taking zero or more inputs
Macros: It is a function that uses other Hive functions

User defined aggregate functions: A user defined function that takes multiple rows or columns and returns the aggregation of the data
User defined table generating functions: A function that takes a column from a single record and splitting it into multiple rows
Indexes: Indexes are created to the speedy access to columns in the database

<br>
Syntax: Create index <INDEX_NAME> on table <TABLE_NAME><br>

Get 100% Hike!

Master Most in Demand Skills Now!

Thrift: A thrift service is used to provide remote access from other processors
Meta store: This is a service that stores the metadata information such as table schemas
Hcatalog: It is a metadata and table management system for the Hadoop platform which enables the storage of data in any format.

Hive SELECT statement syntax using HQL:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...<br>
FROM table_reference<br>
[WHERE where_condition]<br>
[GROUP BY col_list]<br>
[HAVING having_condition]<br>
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]<br>
[LIMIT number];

Select: Select is a projection operator in HiveQL, which scans the table specified by the FROM clause
Where: Where is a condition which specifies what to filter
Group by: It uses the list of columns, which specifies how to aggregate the records
Cluster by, distribute by, Sort by: Specifies the algorithm to sort, distribute and create a cluster, and the order for sorting
Limit: This specifies how many records are to be retrieved

Partitioner: Partitioner controls the partitioning of keys of the intermediate map outputs, typically by a hash function which is the same as the number of reduced tasks for a job
Partitioning: It is used for distributing load horizontally. It is a way of dividing the tables into related parts based on values such as date, city, departments, etc.
Bucketing: It is a technique to decompose the datasets into more manageable parts

Hive commands in HQL:

Data Definition Language (DDL):

It is used to build or modify tables and objects stored in a database
Some of the DDL commands are as follows:

To create a database in Hive: create database<data base name>
To list out the databases created in a Hive warehouse: show databases
To use the database created: USE <data base name>
To describe the associated database in metadata: describe<data base name>
To alter the database created: alter<data base name>

Data Manipulation Language (DML):

These statements are used to retrieve, store, modify, delete, insert and update data in a database

Inserting data in a database: The load function is used to move the data into a particular Hive table.

<br>
LOAD data <LOCAL> inpath <file path> into table [tablename]<br>

Drop table: The drop table statements delete the data and metadata from the table:

<br>
drop table<table name><br>

Aggregation: It is used to count different categories from the table :

<br>
Select count (DISTINCT category) from tablename;<br>

Grouping: Group command is used to group the result set, where the result of one table is stored in the other:

<br>
Select <category>, sum( amount) from <txt records> group by <category><br>

To exit from the Hive shell: Use the command quit

Hive Architecture

Hive data types:

Integral data types:
- Tinyint
- Smallint
- Int
- Bigint
String types:
- VARCHAR-Length(1 to 65355)
- CHAR-Length(255)
Timestamp: It supports the traditional Unix timestamp with optional nanosecond precision
- Dates
- Decimals

Union type: It is a collection of heterogeneous data types.

<br>
Syntax: UNIONTYPE<int, double, array<string>, struct<a:int,b:string>><br>

Complex types:

Arrays: Syntax-ARRAY<data_type>

Maps: Syntax- MAP<primitive_type, data_type>

Structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>

Operations that can be performed on Hive:

Function	SQL Query
To retrieve information	SELECT from_columns FROM table WHERE conditions;
To select all values	SELECT * FROM table;
To select a particular category values	SELECT * FROM table WHERE rec_name = “value”;
To select for multiple criteria	SELECT * FROM TABLE WHERE rec1 = “value1” AND rec2 = “value2”;
For selecting specific columns	SELECT column_name FROM table;
To retrieve unique output records	SELECT DISTINCT column_name FROM table;
For sorting	SELECT col1, col2 FROM table ORDER BY col2;
For sorting backward	SELECT col1, col2 FROM table ORDER BY col2 DESC;
For counting rows from the table	SELECT COUNT(*) FROM table;
For grouping along with counting	SELECT owner, COUNT(*) FROM table GROUP BY owner;
For selecting maximum values	SELECT owner, COUNT(*) FROM table GROUP BY owner;
Selecting from multiple tables and joining	SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name);

Watch this Hive Course For Beginners Tutorial

Metadata functions and queries used for operations:

Function	Hive query or commands
Selecting a database	USE database;
Listing databases	SHOW DATABASES;
listing table in a database	SHOW TABLES;
Describing format of a table	DESCRIBE (FORMATTED\|EXTENDED) table;
Creating a database	CREATE DATABASE db_name;
Dropping a database	DROP DATABASE db_name (CASCADE);

Command Line statements:

Function	Hive commands
To run the query	hive -e ‘select a.col from tab1 a’
To run a query in a silent mode	hive -S -e ‘select a.col from tab1 a’
To select hive configuration variables	hive -e ‘select a.col from tab1 a’ -hiveconf hive.root.logger=DEBUG,console
To use the initialization script	hive -i initialize.sql
To run the non-interactive script	hive -f script.sql
To run a script inside the shell	source file_name
To run the list command	dfs –ls /user
To run ls (bash command) from the shell	!ls
To set configuration variables	set mapred.reduce.tasks=32
Tab auto completion	set hive.<TAB>
To display all variables starting with hive	set
To revert all variables	reset
To add jar files to distributed cache	add jar jar_path
To display all the jars in the distributed cache	list jars
To delete jars from the distributed cache	delete jar jar_name

With this, we come to the end of the Apache Hive Cheatsheet. To get in-depth knowledge, check out our interactive, live-online Big Data Hadoop certification Training here, which comes with 24*7 support to guide you throughout your learning period. This training course helps you understand the Hadoop Hive, detailed architecture of Hive, comparing Hive with Pig and RDBMS, working with Hive Query Language, creation of database, etc.

Discover Big Data and Hadoop’s full potential with our comprehensive collection of cheat sheets, covering everything from fundamental concepts to advanced techniques in one convenient guide!