• Articles
  • Tutorials
  • Interview Questions

Sqoop and Impala

Sqoop

Sqoop is an automated set of volume data transfer tool which allows to simple import, export of data from structured based data which stores NoSql systems, relational databases and enterprise data warehouses to Hadoop ecosystems.

Watch this video on Hadoop before going further on this Hadoop tutorial

Video Thumbnail

Key features of Sqoop

It has following features:

  • JDBC based implementation are used
  • Auto generation of tedious user side code
  • Integration with hive
  • Extensible Backend

Why Sqoop

  • Forcing Map Reduce to access data from RDBMS is repetitive, error prone and costlier than excepted.
  • Data are required to prepare for effective map reduce consumption.

Important Sqoop control commands to import RDBMS data are as follows:

  • Append: Append data to an existing dataset in HDFS by using the append command
  • Columns: columns are used to import from the table. –columns <col,col……>
  • Where: where clause to use during the import from the table. –where <where clause>

The common large components in Sqoop are namely Blog and Clob. If the object is less than 16 MB, it is stored as an in line with the rest of the data. If there are big objects, then they are temporarily stored in a subdirectory containing the name _lob. Those data are then materialized in memory for processing further. If we set the lob limit as ZERO (0) then it is stored in external memory for a time peroid.

Sqoop allows to Export and Import the data from the data table based on where clause. And the syntax is as follows:

–-columns <col1,col2……>
–-where <condition>
–-query <SQL query>

Example:

sqoop import –connect jdbc:mysql://db.one.com/corp –table INTELLIPAAT_EMP –where “start_date> ’2016-07-20’ ”
sqoopeval –connect jdbc:mysql://db.test.com/corp –query “SELECT * FROM intellipaat_emp LIMIT 20”
sqoop import –connect jdbc:mysql://localhost/database –username root –password aaaaa –columns “name,emp_id,jobtitle”

Certification in Bigdata Analytics

Sqoop supports data imported for the following services:

  • HDFS
  • Hive
  • HBase
  • Hcatalog
  • Accumulo

Sqoop basically needs a connector to connect different relational databases. Almost all Database vendors are using the JDBC connector available specific for the typical Database; Sqoop needs a JDBC driver of the database for further interaction. No, Sqoop requires the JDBC and a connector to connect to the database.

Sqoop command to control the number of mappers

We can control the large number of mappers by executing the following parameter –num-mapers in sqoop command. The –num-mapper’s arguments control the number of map tasks, where the degree of parallelism is being used. Initially start with a small number of map tasks, and then later choose a high number of mappers starting with the performance which may down on the database side.

Syntax: -m, –num-mappers <n>
Sqoop command are shown all the databases in MySQL server
$ sqoop list –databases –connect jdbc:mysql://database.test.com/
Sqoopmetastore

It is a tool basically used for hosting in a shared metadata repository. Multiple and remote users can define and execute saved jobs that are defined in meta store. End users are configured to connect the metastore with respect to sqoop-site.xml or with the

–meta-connect argument.

The purpose and usage of sqoop-merge is:

This tool combines two set of datasets where entries are the one dataset which overwrite entries of an older dataset preserving only the new version of the records between both the data sets.

Get 100% Hike!

Master Most in Demand Skills Now!

Impala

It is an open source platform massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop.

Goals of Impala

  • General purpose SQL query engine:
    •Must work both for transactional and analytical workloads
    •Support queries that get from milliseconds to hours timelimit.
  • Runs directly within Hadoop:
    •Reads Hadoop file formats which are broadly used
    •Talks to Hadoop storage managers which are extensively used
    •Runs on same nodes that run Hadoop processes
  • High performance:
    •Runtime code generation
    •Usage of C++ in place of Java
    •Completely new execution engine which is not build on MapReduce

Course Schedule

Name Date Details
Big Data Course 23 Nov 2024(Sat-Sun) Weekend Batch View Details
30 Nov 2024(Sat-Sun) Weekend Batch
07 Dec 2024(Sat-Sun) Weekend Batch

About the Author

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.