Hive partitioning is a technique to organize hive tables in an efficient manner. Based on partition keys it divides tables into different parts. Partition keys determine how the data is stored in the table. It is used to divide tables into related parts based on the values of the given columns in a table.
For example, You have a College Management System of all the colleges of any country, where every college data is given as a whole. and we want to deliver a table with
Here, we can take the college column as a partition key and perform partition operation on the college management system that contains all types of colleges, we can get Number of partitions, which will be equal to a number of states. This will make a separate partition table where each data can be viewed separately.
Creation of table allcolleges :
> create table allcolleges(college string, section string)
> row format delimited
> fields terminated by ‘,’;
Loading data into allcolleges :
> load data local inpath '/home/hduser/Desktop/allcolleges.csv' into table allcolleges;
Creating a partitioning table:
create table college_part(section string) PARTITIONED BY(college string);
For partition, we have to set this schema :
set hive.exec.dynamic.partition.mode=nonstrict
Making partition based on college field :
insert overwrite table college_part PARTITION (college) SELECT section,college from allcolleges;
Now, the partition table will be created with “college” as a partition key and you will get the output as a table of college.
Bucketing is a technique where the tables or partitions are further sub-categorized into buckets for better structure of data and efficient querying.
Let Suppose there be a table using employee_id as the top-level partition and salary as the second-level partition, which creates many small partitions. Instead, if we bucket the employee table and use salary as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. All the same salary records will be stored in a similar bucket.
For example, take an already existing table in your Hive(employees table).
And enable the bucketing using command
set.hive.enforce.bucketing=true;
Now, create a sample bucket :
> create table sample_bucket{name string , job_id int , salary int , state string};
> clustered by state into 4 buckets
> row format delimited
> fields terminated ‘,’;
Loading data from the table into sample bucket from the employees table(already existing in your Hive table) :
>from employees
>insert overwrite table sample_bucket
>select name,job_id,salary,state;
Your data is fetched into the buckets now.
For displaying the 4 buckets that were created in the first step write the following command :
>/Hadoop_YARN/hadoop-2.9.2/bin$ ./hadoop fs -ls /usr/hive/
Refer to the following video if you want more information regarding Hive: