Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

Is there a way to keep the duplicates in a collected set in Hive, or simulate the sort of aggregate collection that Hive provides using some other method? I want to aggregate all of the items in a column that have the same key into an array, with duplicates.

I.E.:

hash_id | num_of_cats

=====================

ad3jkfk            4

ad3jkfk            4

ad3jkfk            2

fkjh43f            1

fkjh43f            8

fkjh43f            8

rjkhd93            7

rjkhd93            4

rjkhd93            7

should return:

hash_agg | cats_aggregate

===========================

ad3jkfk   Array<int>(4,4,2)

fkjh43f   Array<int>(1,8,8)

rjkhd93   Array<int>(7,4,7)

1 Answer

0 votes
by (32.3k points)
edited by

After the release of Hive 13.0, collect_list(col) built-in aggregate function is supported in Hive, It returns the list of objects with duplicates. You should use it here:

SELECT

    hash_id, collect_list(num_of_cats) AS aggr_set

FROM

    <tablename>

WHERE

    <condition>

GROUP BY

    hash_id

;

If you are having any doubt regarding Hive, then you can refer the following video tutorial regarding the same:

Browse Categories

...