Hive cheat sheet

Introduction: 

All the industries deal with the Big data that is a large amount of data and Hive is a tool that is used for the analysis of this Big Data. Apache Hive is a tool where the data is stored for analysis and querying. This cheat sheet guides you through the basic concepts and commands required to start with it.

You can also download the printable PDF of this Apache Hive cheat sheet

Hive Cheat Sheet

 

This Apache Hive cheat sheet will guide you to the basics of Hive which will be helpful for beginners and also for those who want to take a quick look at the important topics of Hive
Further, if you want to learn Apache Hive in-depth, you can refer to the tutorial blog on Hive.

 

Hive cheat sheet

 

Apache Hive: It is a data warehouse infrastructure based on Hadoop framework which is perfectly suitable for data summarization, analysis, and querying. It uses an SQL like language called HQL (Hive Query Language)
HQL: It is a query language used to write the custom map-reduce framework in Hive to perform more sophisticated analysis of the data
Table: The table in the hive is a table that contains logically stored data

 

Components of Hive:

  • Meta store: Meta store is where the schemas of the Hive tables are stored, it stores the information about the tables and partitions that are in the warehouse.
  • SerDe: Serializer, Deserializer which gives instructions to hive on how to process records

 

Hive interfaces:

  • Hive interfaces include WEB UI
  • Hive command line
  • HD insight (windows server)

 

Hive Function Meta commands:

  • Show functions: Lists Hive functions and operators
  • Describe function [function name]: Displays short description of the particular function
  • Describe function extended [function name]: Displays extended description of the particular function

 

Grab high-paying Big Data jobs with these Big Data Engineer Interview Questions!

Types of Hive Functions:

  • UDF (User-defined Functions): It is a function that fetches one or more columns from a row as arguments and returns a single value
  • UDTF (User-defined Tabular Functions): This function is used to produce multiple columns or rows of output by taking zero or more inputs
  • Macros: It is a function that uses other Hive functions

 

User defined aggregate functions: A user defined function that takes multiple rows or columns and returns the aggregation of the data
User defined table generating functions: A function that takes a column from a single record and splitting it into multiple rows
Indexes: Indexes are created to the speedy access to columns in the database

Syntax: Create index <INDEX_NAME> on table <TABLE_NAME>

Get 100% Hike!

Master Most in Demand Skills Now!


Thrift: A thrift service is used to provide remote access from other processors
Meta store: This is a service that stores the metadata information such as table schemas
Hcatalog: It is a metadata and table management system for the Hadoop platform which enables the storage of data in any format.

Hive SELECT statement syntax using HQL:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
  • Select: Select is a projection operator in HiveQL, which scans the table specified by the FROM clause
  • Where: Where is a condition which specifies what to filter
  • Group by: It uses the list of columns, which specifies how to aggregate the records
  • Cluster by, distribute by, Sort by: Specifies the algorithm to sort, distribute and create a cluster, and the order for sorting
  • Limit: This specifies how many records are to be retrieved

Partitioner: Partitioner controls the partitioning of keys of the intermediate map outputs, typically by a hash function which is the same as the number of reduced tasks for a job
Partitioning: It is used for distributing load horizontally. It is a way of dividing the tables into related parts based on values such as date, city, departments, etc.
Bucketing: It is a technique to decompose the datasets into more manageable parts

 

Hive commands in HQL:

Data Definition Language (DDL):

It is used to build or modify tables and objects stored in a database
Some of the DDL commands are as follows:

  • To create a database in Hive: create database<data base name>
  • To list out the databases created in a Hive warehouse: show databases
  • To use the database created: USE <data base name>
  • To describe the associated database in metadata: describe<data base name>
  • To alter the database created: alter<data base name>

 

Data Manipulation Language (DML):

These statements are used to retrieve, store, modify, delete, insert and update data in a database

 

  • Inserting data in a database: The load function is used to move the data into a particular Hive table.
LOAD data <LOCAL> inpath <file path> into table [tablename]
  • Drop table: The drop table statements delete the data and metadata from the table:
drop table<table name>
  • Aggregation: It is used to count different categories from the table :
Select count (DISTINCT category) from tablename;
  • Grouping: Group command is used to group the result set, where the result of one table is stored in the other:
Select <category>, sum( amount) from <txt records> group by <category>
  • To exit from the Hive shell: Use the command quit

 

Hive Architecture

Hive Architecture

Hive data types:

  • Integral data types:
    • Tinyint
    • Smallint
    • Int
    • Bigint
  • String types:
    • VARCHAR-Length(1 to 65355)
    • CHAR-Length(255)
  • Timestamp: It supports the traditional Unix timestamp with optional nanosecond precision
    • Dates
    • Decimals
  • Union type: It is a collection of heterogeneous data types.
    Syntax: UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
    
  • Complex types:
    Arrays: Syntax-ARRAY<data_type>
    Maps: Syntax- MAP<primitive_type, data_type>
    Structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>

Become a Big Data Architect

 

Operations that can be performed on Hive:

Function SQL Query
To retrieve information SELECT from_columns FROM table WHERE conditions;
To select all values SELECT * FROM table;
To select a particular category values SELECT * FROM table WHERE rec_name = “value”;
To select for multiple criteria SELECT * FROM TABLE WHERE rec1 = “value1” AND rec2 = “value2”;
For selecting specific columns SELECT column_name FROM table;
To retrieve unique output records SELECT DISTINCT column_name FROM table;
For sorting SELECT col1, col2 FROM table ORDER BY col2;
For sorting backward SELECT col1, col2 FROM table ORDER BY col2 DESC;
For counting rows from the table SELECT COUNT(*) FROM table;
For grouping along with counting SELECT owner, COUNT(*) FROM table GROUP BY owner;
For selecting maximum values SELECT owner, COUNT(*) FROM table GROUP BY owner;
Selecting from multiple tables and joining SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name);

 

Watch this Hive Course For Beginners Tutorial

Video Thumbnail

 

Metadata functions and queries used for operations:

Function Hive query or commands
Selecting a database USE database;
Listing databases SHOW DATABASES;
listing table in a database SHOW TABLES;
Describing format of a table DESCRIBE (FORMATTED|EXTENDED) table;
Creating a database CREATE DATABASE db_name;
Dropping a database DROP DATABASE db_name (CASCADE);

 

Command Line statements:

Function Hive commands
To run the query hive -e ‘select a.col from tab1 a’
To run a query in a silent mode hive -S -e ‘select a.col from tab1 a’
To select hive configuration variables hive -e ‘select a.col from tab1 a’ -hiveconf hive.root.logger=DEBUG,console
To use the initialization script hive -i initialize.sql
To run the non-interactive script hive -f script.sql
To run a script inside the shell source file_name
To run the list command dfs –ls /user
To run ls (bash command) from the shell !ls
To set configuration variables set mapred.reduce.tasks=32
Tab auto completion set hive.<TAB>
To display all variables starting with hive set
To revert all variables reset
To add jar files to distributed cache add jar jar_path
To display all the jars in the distributed cache list jars
To delete jars from the distributed cache delete jar jar_name

 

With this, we come to the end of the Apache Hive Cheatsheet. To get in-depth knowledge, check out our interactive, live-online Big Data Hadoop certification Training here, which comes with 24*7 support to guide you throughout your learning period. This training course helps you understand the Hadoop Hive, detailed architecture of Hive, comparing Hive with Pig and RDBMS, working with Hive Query Language, creation of database, etc.

Our Big Data Courses Duration and Fees

Program Name
Start Date
Fees
Cohort starts on 11th Jan 2025
₹22,743
Cohort starts on 1st Feb 2025
₹22,743
Cohort starts on 25th Jan 2025
₹22,743

About the Author

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.