Introduction:
All the industries deal with the Big data that is a large amount of data and Hive is a tool that is used for the analysis of this Big Data. Apache Hive is a tool where the data is stored for analysis and querying. This cheat sheet guides you through the basic concepts and commands required to start with it.
You can also download the printable PDF of this Apache Hive cheat sheet
This Apache Hive cheat sheet will guide you to the basics of Hive which will be helpful for beginners and also for those who want to take a quick look at the important topics of Hive
Further, if you want to learn Apache Hive in-depth, you can refer to the tutorial blog on Hive.
Apache Hive: It is a data warehouse infrastructure based on Hadoop framework which is perfectly suitable for data summarization, analysis, and querying. It uses an SQL like language called HQL (Hive Query Language)
HQL: It is a query language used to write the custom map-reduce framework in Hive to perform more sophisticated analysis of the data
Table: The table in the hive is a table that contains logically stored data
Components of Hive:
- Meta store: Meta store is where the schemas of the Hive tables are stored, it stores the information about the tables and partitions that are in the warehouse.
- SerDe: Serializer, Deserializer which gives instructions to hive on how to process records
Hive interfaces:
- Hive interfaces include WEB UI
- Hive command line
- HD insight (windows server)
Hive Function Meta commands:
- Show functions: Lists Hive functions and operators
- Describe function [function name]: Displays short description of the particular function
- Describe function extended [function name]: Displays extended description of the particular function
Grab high-paying Big Data jobs with these Big Data Engineer Interview Questions!
Types of Hive Functions:
- UDF (User-defined Functions): It is a function that fetches one or more columns from a row as arguments and returns a single value
- UDTF (User-defined Tabular Functions): This function is used to produce multiple columns or rows of output by taking zero or more inputs
- Macros: It is a function that uses other Hive functions
User defined aggregate functions: A user defined function that takes multiple rows or columns and returns the aggregation of the data
User defined table generating functions: A function that takes a column from a single record and splitting it into multiple rows
Indexes: Indexes are created to the speedy access to columns in the database
Syntax: Create index <INDEX_NAME> on table <TABLE_NAME>
Get 100% Hike!
Master Most in Demand Skills Now!
Thrift: A thrift service is used to provide remote access from other processors
Meta store: This is a service that stores the metadata information such as table schemas
Hcatalog: It is a metadata and table management system for the Hadoop platform which enables the storage of data in any format.
Hive SELECT statement syntax using HQL:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
- Select: Select is a projection operator in HiveQL, which scans the table specified by the FROM clause
- Where: Where is a condition which specifies what to filter
- Group by: It uses the list of columns, which specifies how to aggregate the records
- Cluster by, distribute by, Sort by: Specifies the algorithm to sort, distribute and create a cluster, and the order for sorting
- Limit: This specifies how many records are to be retrieved
Partitioner: Partitioner controls the partitioning of keys of the intermediate map outputs, typically by a hash function which is the same as the number of reduced tasks for a job
Partitioning: It is used for distributing load horizontally. It is a way of dividing the tables into related parts based on values such as date, city, departments, etc.
Bucketing: It is a technique to decompose the datasets into more manageable parts
Hive commands in HQL:
Data Definition Language (DDL):
It is used to build or modify tables and objects stored in a database
Some of the DDL commands are as follows:
- To create a database in Hive: create database<data base name>
- To list out the databases created in a Hive warehouse: show databases
- To use the database created: USE <data base name>
- To describe the associated database in metadata: describe<data base name>
- To alter the database created: alter<data base name>
Data Manipulation Language (DML):
These statements are used to retrieve, store, modify, delete, insert and update data in a database
- Inserting data in a database: The load function is used to move the data into a particular Hive table.
LOAD data <LOCAL> inpath <file path> into table [tablename]
- Drop table: The drop table statements delete the data and metadata from the table:
drop table<table name>
- Aggregation: It is used to count different categories from the table :
Select count (DISTINCT category) from tablename;
- Grouping: Group command is used to group the result set, where the result of one table is stored in the other:
Select <category>, sum( amount) from <txt records> group by <category>
- To exit from the Hive shell: Use the command quit
Hive Architecture
Hive data types:
- Integral data types:
- Tinyint
- Smallint
- Int
- Bigint
- String types:
- VARCHAR-Length(1 to 65355)
- CHAR-Length(255)
- Timestamp: It supports the traditional Unix timestamp with optional nanosecond precision
- Union type: It is a collection of heterogeneous data types.
Syntax: UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
- Complex types:
Arrays: Syntax-ARRAY<data_type>
Maps: Syntax- MAP<primitive_type, data_type>
Structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
Operations that can be performed on Hive:
Function |
SQL Query |
To retrieve information |
SELECT from_columns FROM table WHERE conditions; |
To select all values |
SELECT * FROM table; |
To select a particular category values |
SELECT * FROM table WHERE rec_name = “value”; |
To select for multiple criteria |
SELECT * FROM TABLE WHERE rec1 = “value1” AND rec2 = “value2”; |
For selecting specific columns |
SELECT column_name FROM table; |
To retrieve unique output records |
SELECT DISTINCT column_name FROM table; |
For sorting |
SELECT col1, col2 FROM table ORDER BY col2; |
For sorting backward |
SELECT col1, col2 FROM table ORDER BY col2 DESC; |
For counting rows from the table |
SELECT COUNT(*) FROM table; |
For grouping along with counting |
SELECT owner, COUNT(*) FROM table GROUP BY owner; |
For selecting maximum values |
SELECT owner, COUNT(*) FROM table GROUP BY owner; |
Selecting from multiple tables and joining |
SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name); |
Watch this Hive Course For Beginners Tutorial
Metadata functions and queries used for operations:
Function |
Hive query or commands |
Selecting a database |
USE database; |
Listing databases |
SHOW DATABASES; |
listing table in a database |
SHOW TABLES; |
Describing format of a table |
DESCRIBE (FORMATTED|EXTENDED) table; |
Creating a database |
CREATE DATABASE db_name; |
Dropping a database |
DROP DATABASE db_name (CASCADE); |
Command Line statements:
Function |
Hive commands |
To run the query |
hive -e ‘select a.col from tab1 a’ |
To run a query in a silent mode |
hive -S -e ‘select a.col from tab1 a’ |
To select hive configuration variables |
hive -e ‘select a.col from tab1 a’ -hiveconf hive.root.logger=DEBUG,console |
To use the initialization script |
hive -i initialize.sql |
To run the non-interactive script |
hive -f script.sql |
To run a script inside the shell |
source file_name |
To run the list command |
dfs –ls /user |
To run ls (bash command) from the shell |
!ls |
To set configuration variables |
set mapred.reduce.tasks=32 |
Tab auto completion |
set hive.<TAB> |
To display all variables starting with hive |
set |
To revert all variables |
reset |
To add jar files to distributed cache |
add jar jar_path |
To display all the jars in the distributed cache |
list jars |
To delete jars from the distributed cache |
delete jar jar_name |
With this, we come to the end of the Apache Hive Cheatsheet. To get in-depth knowledge, check out our interactive, live-online Big Data Hadoop certification Training here, which comes with 24*7 support to guide you throughout your learning period. This training course helps you understand the Hadoop Hive, detailed architecture of Hive, comparing Hive with Pig and RDBMS, working with Hive Query Language, creation of database, etc.