+10 votes
2 views
in Big Data Hadoop & Spark by (1.5k points)

In the hive, partitioning and bucketing a table, both are done on a column. But how exactly are they different?

2 Answers

+12 votes
by (13.2k points)

Partitioning

Based on values of columns of a table, Partition divides large amount of data into multiple slices. What that means is we are able to differentiate a large amount of data on the basis of our need, for example if we have the data for all the employees working in a particular company ( with huge number of employees) but we need to survey only the employees which belong to a particular category, in the absence of partitioning our process would be to scan through all the entries and find those out, but if we partition our table on the basis of category then it becomes very simple to survey the lot.

Bucketing

Bucketing basically puts data into more manageable or equal parts. When we go for partitioning, we might end up with multiple small partitions based on column values. But when we go for bucketing, we restrict number of buckets to store the data ( which is defined earlier).

Difference and Conclusion

When we are dealing with some field in our data which has high cardinality ( number of possible values the field can have) it should be taken care that partitioning is not used. If we partition a field with large amount of values, we might end up with too many directories in our file system. What bucketing does differently to partitioning is we have a fixed number of files, since you do specify the number of buckets, then hive will take the field, calculate a hash, which is then assigned to that bucket. We can partition on multiple fields ( category, country of employee etc), while you can bucket on only one field.

So, bucketing is useful for the situation in which the field has high cardinality and data is evenly spread among all buckets ( approximately). Partitioning works best when the cardinality of the partitioning field is not too high and it can quickly be queued after.

0 votes
by (31.4k points)
edited by

Hive partitioning is a technique to organize hive tables in an efficient manner. Based on partition keys it divides tables into different parts. Partition keys determine how the data is stored in the table. It is used to divide tables into related parts based on the values of the given columns in a table.

For example, You have a College Management System of all the colleges of any country, where every college data is given as a whole. and we want to deliver a table with

Here, we can take the college column as a partition key and perform partition operation on the college management system that contains all types of colleges, we can get Number of partitions, which will be equal to a number of states. This will make a separate partition table where each data can be viewed separately.

Creation of table allcolleges :

> create table allcolleges(college string, section string)

> row format delimited

> fields terminated by ‘,’;

Loading data into allcolleges :

> load data local inpath '/home/hduser/Desktop/allcolleges.csv' into table allcolleges;

Creating a partitioning table:

create table college_part(section string) PARTITIONED BY(college string);

For partition, we have to set this schema :

set hive.exec.dynamic.partition.mode=nonstrict

Making partition based on college field :

insert overwrite table college_part PARTITION (college) SELECT section,college from allcolleges;

Now, the partition table will be created with “college” as a partition key and you will get the output as a table of college.


Bucketing is a technique where the tables or partitions are further sub-categorized into buckets for better structure of data and efficient querying.

Let Suppose there be a table using employee_id  as the top-level partition and salary as the second-level partition, which creates many small partitions. Instead, if we bucket the employee table and use salary as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. All the same salary records will be stored in a similar bucket.

For example, take an already existing table in your Hive(employees table).

And enable the bucketing using command

set.hive.enforce.bucketing=true;

Now, create a sample bucket :

> create table sample_bucket{name string , job_id int , salary int , state string};

> clustered by state into 4 buckets

> row format delimited

> fields terminated ‘,’;


 

Loading data from the table into sample bucket from the employees table(already existing in your Hive table) :

>from employees

>insert overwrite table sample_bucket

>select name,job_id,salary,state;

Your data is fetched into the buckets now.

For displaying the 4 buckets that were created in the first step write the following command :

>/Hadoop_YARN/hadoop-2.9.2/bin$ ./hadoop fs -ls /usr/hive/

Refer to the following video if you want more information regarding Hive:

by (31.4k points)
Answers are very accurate
Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...