Top Answers to Hive Interview Questions

1. Differentiate between Pig and Hive.
CriteriaPigHive
ArchitectureProcedural data flow languageSQL type declarative language
ApplicationProgramming purposesReport creation
Operational fieldClient sideServer side
Support for Avro filesYesNo
2. What is the definition of Hive? What is the present version of Hive? Explain ACID transactions in Hive.

Hive is an open-source data warehouse system. We can use Hive for analyzing and querying large data sets of Hadoop files. It’s similar to SQL. The present version of Hive is 0.13.1. Hive supports ACID (Atomicity, Consistency, Isolation, and Durability) transactions. ACID transactions are provided at the row levels. There are the following options in Hive so that it can support ACID transactions.

  • Insert
  • Delete
  • Update

Want to learn more about Hive? Go through this insightful blog ‘What is Hive?’

3. What is a Hive variable? What do we use it for?

Hive variable is basically created in the Hive environment that is referenced by Hive scripting languages. It allows to pass some values to the Hive queries when the query starts executing. It uses the source command.

CTA

Watch this video on Hadoop before going further on this Hadoop tutorial:

Learn for free ! Subscribe to our youtube Channel.

4. What kind of data warehouse application is suitable for Hive? What are the types of tables in Hive?

Hive is not considered as a full database. The design rules and regulations of Hadoop and HDFS put restrictions on what Hive can do. However, Hive is most suitable for data warehouse applications because it:

  • Analyzes the relatively static data
  • Has less responsive time
  • Does not make rapid changes in data

Hive doesn’t provide fundamental features required for Online Transaction Processing (OLTP).

Hive is suitable for data warehouse applications in large data sets. There are two types of tables in Hive:

  • Managed table
  • External table

Get a better understanding of Hive by going through this Hive Tutorial now!

5. Can we change settings within a Hive session? If yes, how?

Yes, we can change the settings within a Hive session using the SET command. It helps change Hive job settings for an exact query. For example, the following commands show that buckets are occupied according to the table definition.

hive> SET hive.enforce.bucketing=true;

We can see the current value of any property by using SET with the property name. SET will list all the properties with their values set by Hive.

hive> SET hive.enforce.bucketing;

hive.enforce.bucketing=true

This list will not include defaults of Hadoop. So, we should use the below code:

SET -v

It will list all the properties including the Hadoop defaults in the system.

Interested in learning Hive? Well, we have a comprehensive master’s Big Data Hadoop Course to give you all concepts in Hive!

6. Is it possible to add 100 nodes when we already have 100 nodes in Hive? If yes, how?

Yes, we can add the nodes by following the below steps.

  1. Take a new system; create a new username and password
  2. Install SSH and with master node setup SSH connections
  3. Add ssh public_rsa id key to the authorized keys file
  4. Add the new DataNode host name, IP address, and other details in /etc/hosts slaves file:
    192.168.1.102 slave3.in slave3
  5. Start the DataNode on the new node
  6. Login to the new node like suhadoop or
    ssh -X hadoop@192.168.1.103
  7. Start HDFS of the newly added slave node by using the following command:
    ./bin/hadoop-daemon.sh start data node
  8. Check the output of the jps command on a new node

Go through this Hadoop Training in London to get a clear understanding of Hadoop!

7. Explain the concatenation function in Hive with an example.

The concatenate function will join the input strings. We can specify
‘n’ number of strings separated by a comma.
Example:

CONCAT ('Intellipaat','-','is','-','a','-','eLearning',’-’,’provider’);

Output:

Intellipaat-is-a-eLearning-provider

Every time, we set the limits of the strings by ‘-‘. If it is common for every string, then Hive provides another command:

CONCAT_WS

In this case, we have to specify the set limits of the operator first as follows:

CONCAT_WS ('-',’Intellipaat’,’is’,’a’,’eLearning’,‘provider’);

Output:

Intellipaat-is-a-eLearning-provider
CTA

8. Explain the Trim and Reverse functions in Hive with examples.

The trim function will delete the spaces associated with a string.
Example:

TRIM(‘ INTELLIPAAT ‘);

Output:

INTELLIPAAT

To remove the leading space:

LTRIM(‘ INTELLIPAAT’);

To remove the trailing space:

RTRIM(‘INTELLIPAAT ‘);

In the reverse function, characters are reversed in the string.
Example:

REVERSE(‘INTELLIPAAT’);

Output:

TAAPILLETNI
9. How to change the column data type in Hive? Explain RLIKE in Hive.

We can change the column data type by using ALTER and CHANGE as follows:

ALTER TABLE table_name CHANGE column_namecolumn_namenew_datatype;

For example, if we want to change the data type of the salary column from integer to bigint in the employee table.

ALTER TABLE employee CHANGE salary salary BIGINT;

RLIKE: Its full form is Right-Like and it is a special function in Hive. It helps examine two substrings, i.e., if the substring of A matches with B, then it evaluates to true.
Example:

‘Intellipaat’ RLIKE ‘tell’  True
‘Intellipaat’ RLIKE ‘^I.*’  True (this is a regular expression)

Learn more about Apache Hive from this detailed blog post now!

10. What are the components used in Hive Query Processor?

The components of a Hive Query Processor include:

  • Logical Plan of Generation
  • Physical Plan of Generation
  • Execution Engine
  • Operators
  • UDFs and UDAFs
  • Optimizer
  • Parser
  • Semantic Analyzer
  • Type Checking
11. What are Buckets in Hive?

The present data is partitioned and divided into different Buckets. This data is divided on the basis of Hash of the particular table columns.

12. Explain the process to access subdirectories recursively in Hive queries.

By using the below commands, we can access subdirectories recursively in Hive:

hive> Set mapred.input.dir.recursive=true;
hive> Set hive.mapred.supports.subdirectories=true;

Hive tables can be pointed to the higher level directory, and this is suitable for the directory structure which is like:

/data/country/state/city/
CTA

13. How to skip header rows from a table in Hive?

Header records in log files:

System=….
Version=…
Sub-version=….
We do not want to include the above three lines of headers in our Hive query. To skip header lines from our tables in Hive, set a table property that will allow us to skip the header lines.

CREATE EXTERNAL TABLE employee (
name STRING,
job STRING,
dob STRING,
id INT,
salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘ ‘ STORED AS TEXTFILE
LOCATION ‘/user/data’
TBLPROPERTIES("skip.header.line.count"="2”);
14. What is the maximum size of a string data type supported by Hive? Explain Hive support binary formats.

The maximum size of a string data type supported by Hive is 2 GB.
Hive supports the text file format by default, and it supports the binary format Sequence files, ORC files, Avro data files, and Parquet files.

Sequence file: A splittable, compressible, and row-oriented file with a general binary format.

ORC file: Optimized row columnar (ORC) format file is a record columnar file and column-oriented storage file. It divides the table in row split. Each split stores the value of the first row in the first column and follows subsequently.

Avro data file: It is same as a sequence file that is splittable, compressible, and row oriented, without the support of schema evolution and multilingual binding.

Parquet file: In Parquet format, along with storing rows of data adjacent to one another, we can also store column values adjacent to each other such that both horizontally and vertically the datasets are partitioned.

Learn more about Hadoop from this Hadoop Training in New York to get ahead in your career!

15. What is the precedence order of Hive configuration?

We are using a precedence hierarchy for setting the properties:

  1. The SET Command in Hive
  2. The command line –hiveconf option
  3. Hive-site.XML
  4. Hive-default.xml
  5. Hadoop-site.xml
  6. Hadoop-default.xml
16. If you run a select * query in Hive, why does it not run MapReduce?

The hive.fetch.task.conversion property of Hive lowers the latency of MapReduce overhead, and in effect when executing queries such as SELECT, FILTER, LIMIT, etc. it skips MapReduce function.

If you have any doubts or queries related to Hive, get them clarified from Hadoop experts on our Hive Community!

17. How can we improve the performance with ORC format tables in Hive?

We can store Hive data in a highly efficient manner in an Optimized Row Columnar (ORC) file format. It can simplify many Hive file format limitations. We can improve the performance by using ORC files while reading, writing, and processing the data.

Set hive.compute.query.using.stats-true;
Set hive.stats.dbclass-fs;
CREATE TABLE orc_table (
idint,
name string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\:’
LINES TERMINATED BY ‘\n’
STORES AS ORC;

Need a reason to learn Apache Hadoop and Hive? Well, go through this blog post to find out why Hadoop is the new black!

18. Explain the functionality of ObjectInspector.

ObjectInspector helps analyze the internal structure of a row object and the individual structure of columns in Hive. It also provides a uniform way to access complex objects that can be stored in multiple formats in the memory.

  • An instance of Java class
  • A standard Java object
  • A lazily initialized object

ObjectInspector tells the structure of the object and also the ways to access the internal fields inside the object.

19. Whenever we run Hive query, new metastore_db is created. Why?

Local metastore is created when we run Hive in an embedded mode. Before creating, it checks whether the metastore exists or not, and this metastore property is defined in the configuration file, hive-site.xml. Property is javax.jdo.option.ConnectionURL with default value jdbc:derby:;databaseName=metastore_db;create=true. So, change the behavior of the location to an absolute path so that from that location metastore will be used.

Interested in learning Hadoop? Check out the Hadoop Training in Sydney!

CTA

20. Differentiate between Hive and HBase.
HiveHBase
It enables most of the SQL queries.It doesn’t allow SQL queries.
Operations don’t run in real time in Hive.In HBase, they are said to run in real time on the database.
It is a data warehouse framework.It is a NoSQL database.
Hive runs on top of MapReduce.HBase runs on top of HDFS.
21. How can we access the subdirectories recursively?

By using the below commands, we can access subdirectories recursively in Hive:

hive> Set mapred.input.dir.recursive=true;
hive> Set hive.mapred.supports.subdirectories=true;

Hive tables can be pointed to the higher level directory, and this is suitable for the directory structure:

/data/country/state/city/
22. What are the uses of Hive Explode?

Hadoop developers consider an array as their input and convert it into a separate table row. To convert complicated data types into desired table formats, Hive uses Explode.

Learn end-to-end Hadoop concepts through the Hadoop Course in Hyderabad to take your career to a whole new level!

23. What is the available mechanism for connecting applications when we run Hive as a server?
  1. Thrift Client: Using thrift, you can call Hive commands from various programming languages, e.g., C++, PHP, Java, Python, and Ruby.
  2. JDBC Driver: JDBC Driver supports Type 4 (pure Java) JDBC Driver.
  3. ODBC Driver: ODBC Driver supports the ODBC protocol.
24. How do we write our own custom SerDe?

End users want to read their own data format instead of writing, so the user would prefer to write a Deserializer than SerDe.

For example, RegexDeserializer will deserialize the data using the configuration parameter ‘regex’ and a list of column names.

If our SerDe supports DDL, we would probably want to implement a protocol based on DynamicSerDe. It’s non-trivial to write a ‘thrift DDL’ parser.

Are you interested in learning Hadoop from experts? Enroll in our Hadoop Course in Bangalore now!

25. Mention the date data type in Hive. Name the Hive data type collection.

The TIMESTAMP data type stores date in the java.sql.timestamp format.

Three collection data types in Hive are:

  1. Array
  2. Map
  3. Struct
26. Can we run UNIX shell commands from Hive? Can Hive queries be executed from script files? If yes, how? Give an example.

Yes, we can run UNIX shell commands from Hive using the ! mark before the command. For example, !pwd at Hive prompt will list the current directory.
We can execute Hive queries from the script files using the source command.

Example:

Hive> source /path/to/file/file_with_query.hql

 

Interview Questions