Hive Interview Questions

What is Hive?

Hive is a tool to perform ETL (extract, transform and load) and Data warehousing functions. It was developed as a sub-project of Hadoop Distributed File System (HDFS). Hive makes operations like data encapsulation, ad-hoc queries, analysing large sets of data simpler.

How to check Hive version?

To check Hive version, you can use Putty. Open Putty and type leo-ingesting.vip.name.com in the hostname. Click on open and then enter username and password. Follow the command set below to find the version of Hive.

$ bash
bash-3.00$ Hive
Hive history file=/tmp/rkost/Hive_job_log_rkost_201207010451_1212680168.txt
Hive> set mapred.job.queue.name=hdmi-technology;
Hive> select * from table LIMIT 1;

What is SerDe in Hive?

SerDe is the short form for Serializer/Deserializer. Hive uses SerDe to take care of serialization and deserialization. It also interprets the outcomes of serialization considering individual fields for processing. SerDe enables Hive to read the information from a table, and then write it back as outputs of HDFS in any format you want. Anyone can write his or her own SerDe for his or her choice of data formats.

What is Hive metastore?

Hive Metastore is the chief storehouse of Hive’s metadata. It stores metadata i.e. the schema and locations of Hive tables and partitions in the form of a relational database. It also stores data such as table name, table location, column names and types, sorting columns if any, partition columns, number of buckets in the table, storage handler, etc. The metastore updates itself whenever you create a table. You can directly access this information from the metastore.

When running Hive as a server what connection options are available?

You can connect to the Hive Server in three ways.

Thrift Client: You can use different programming languages such as C++, Java, PHP, Python etc. to call any Hive command with thrive client.
JDBC Driver: This type of connection supports Type 4 JDBC Driver i.e. pure Java.
ODBC Driver: This connection works on ODBC protocol.

hive-interview-questions

How to create an external table in Hive?

The following syntax will help you create an external table in Hive-

CREATE EXTERNAL TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[FIELDS TERMINATED BY char]
[STORED AS file_format]
[LOCATION hdfs_path];

How to run Hive script?

You have to input the following command to run Hive script.

Hive–f /home/cloudera/sample.sql

The complete path of the location of Script file must be present to successfully run the script.

What is lateral view in Hive?

A lateral view is used to apply the User Defined Tabular Function (UDTF) to every row of a table. It then joins the output and input rows to form a virtual table. Lateral view works in conjunction with user-defined table generating functions like explode().

What is hcatalog in Hive?

HCatalog in Hive is a metadata and table management system that enables storing data in any format irrespective of the structure. Both structured and unstructured data can be processed, stored and shared in HCatalog. This capability in combination with the ‘schema on read’ nature of Hadoop reduces cycle time and encourages exploration.

Explain map side join in Hive?

Map join is a feature in Hive that speeds up queries as it works without reducers. It enables a table to be loaded into memory to perform a join within a mapper, without using the Map/Reduce step. It loads a smaller table in memory and the joins in the map phase of the Map/Reduce operation.

What is bucketing in Hive? How bucketing works in Hive?

Bucketing in Hive allows the user to divide table data sets into parts that are more manageable. It gives an acceptable structure to Hive tables when performing queries on very large datasets. Bucketing works by subdividing partitions in a table containing huge datasets into parts that are more manageable or work directly on the table without partitions.

How to create a database in Hive?

In hive, a database refers to a namespace or a group of tables. To create a database, you have to use the Create Database command. The syntax to create a database is:

CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>

How to load JSON file in Hive?

To load a JSON file in Hive, you must first create a JSON table in hive. Then load and display the actual contents of the schema. After that, fetch the data values in the existing JSON hierarchy by using get_json_object().

Explain serialization in Hive?

Serialization is the process in which Hive converts objects in different programming languages to suitable formats as a stream of bytes, which can be stored in HDFS and used by Hive. This may be done with structured or unstructured data bytes. using SerDe. It optimises and saves the state of an object to recreate it when the need arises.

How will you optimize Hive performance?

You can optimise the performance of Hive in one of the following ways-

Tez Execution Engine – A Hive optimization technique increases the performance of hive queries by using Tez execution engine. It works on the framework of Hadoop Yarn and executes complex-directed acyclic graphs of general data processing tasks.
ORC File Format – Optimized Row Columnar (ORC) File Format drastically improves query performance as it stores data in a more optimized way than other formats.
Hive Partition – By using partitions, the entries in the different columns of the dataset are separated and stored in partitions. Hence, fetching the values requires querying only the required partitions. Thus, it reduces the time taken by a query to return results.
Bucketing in Hive – The data in the table are divided into more manageable portions or buckets that reduces the time taken to give results.
Vectorization– Use of vectorized query execution to perform scans, aggregations, filters, and joins in batches of 1024 rows at once reduces the time taken to perform operations.
Cost-Based Optimization– It this process, Hive optimizes the logical as well as physical execution plan of the query. The results depend on how the order joins, the type of join to performed, the degree of parallelism etc.
Hive Index – Among the best ways of optimising, it enhances the performance of the query. The original table is indexed to create a separate index table that acts as a reference.

What is the primary purpose of Hive in the Hadoop architecture?

Hadoop is used to crunch large amounts of data with the help of an array of tools. Hive is one such tool that is used to query and analyse huge datasets. It is an open source data warehousing system, which exclusively stores data in Hadoop storage.

What is partitioning in Hive?

Partition in Hive is used to organise tables into partitions by dividing a table into associated parts. The division depends on the values of the columns like date, city, and department. Every table in the Hive can have more than one partition keys to identify the partition.

What is the difference between static and dynamic partitioning in Hive?

Static Partition in Hive

In static partitioning, you must specify the partition column value in with every LOAD statement.
Static Partition is timesaving while loading data compared to dynamic partition.
You have the ability to alter the partition in the static partition
The partition column value can be obtained from the filename, day of date etc. without reading the entire file.

Dynamic Partition in Hive

To load the data from a non-partitioned table, you can use a dynamic partition to insert partitions individually.
Dynamic Partition takes more time to load data than static partition.
Dynamic partition is suitable for large data or to partition many columns
You can use a dynamic partition on hive external table and managed tables.

What is vertex and vectorization in Hive?

Vectorization allows Hive to process a group of rows together instead of processing a single row at a given time. Every batch is an array of a primitive type. In vectorization, operations of the entire column vector at one go, which improves the instruction pipelines as well as cache usage.

How to update records in Hive?

To update records in Hive, use the following syntax-

UPDATE <target table>
SET <set clause list>
[ WHERE <search condition> ]

Hive Interview Questions and Answers