Apache Pig
The Apache Pig is a platform for managing large sets of data which consists of high-level programming to analyze the data. Pig also consists of the infrastructure to evaluate the programs. The advantage of Pig programming is that it can easily handle parallel processes for managing very large amounts of data. The programming on this platform is basically done using the textual language Pig Latin.
Pig Latin features
- Simple programming: it is easy to code, execute and manage
- Better optimization: The system can automatically optimize the execution
- Extensive nature: it can be used to achieve highly specific processing tasks
Watch this video on HIVE by Intellipaat:
Pig can be used for the following purposes:
- ETL data pipeline
- Research on raw data
- Iterative processing.
The scalar data types in pig are int, float, double, long, chararray, and bytearray. The complex data types in Pig are map, tuple, and bag.
Map: The data element with the data type chararray where an element has pig data type include complex data type
Example- [city’#’bang’,’pin’#560001]
In this city and pin is data element mapping to values.
Tuple: It is a collection of data types and has a fixed length. A tuple is having multiple fields and these are ordered.
Bag: It is a collection of tuples, but it is unordered, tuples in the bag are separated by a comma
Example: {(‘Bangalore’, 560001),(‘Mysore’,570001),(‘Mumbai’,400001)
LOAD function:
Load function helps to load data from the file system. It is a relational operator. In the first step in data-flow language, we need to mention the input, which is completed by using the ‘load’ keyword.
The LOAD syntax is
LOAD ‘mydata’ [USING function] [AS schema];
Example- A = LOAD ‘intellipaat.txt’;
A = LOAD ‘intellipaat.txt’ USINGPigStorage(‘t’);
The relational operations in Pig:
foreach, order by, filters, group, distinct, join, limit.
foreach: It takes a set of expressions and applies them to all records in the data pipeline to the next operator.
A =LOAD ‘input’ as (emp_name :charrarray, emp_id : long, emp_add : chararray, phone : chararray, preferences : map [] );
B = foreach A generate emp_name, emp_id;
Filters: It contains a predicate and allows us to select which records will be retained in our data pipeline.
Syntax: alias = FILTER alias BY expression;
Alias indicates the name of the relation, By indicating the required keyword, and the expression has Boolean.
Example: M = FILTER N BY F5 == 4;
Apache Sqoop
Apache Sqoop is a tool that is extensively used to transfer large amounts of data from Hadoop to the relational database servers and vice-versa. Sqoop can be used to import the various types of data from Oracle, MySQL, and other databases.
Important Sqoop control commands to import RDBMS data
- Append: Append data to an existing dataset in HDFS. –append
- Columns: columns to import from the table. –columns <col,col……>
- Where: where clause to use during import. –where <where clause>
The common large objects in Sqoop are Blog and Clob.Suppose the object is less than 16 MB, it is stored inline with the rest of the data. If there are big objects, they are temporarily stored in a subdirectory with the name _lob. Those data are then materialized in memory for processing. If we set the lob limit as ZERO (0) then it is stored in external memory.
Sqoop allows to Export and Import the data from the data table based on the where clause. The syntax is
--columns <col1,col2……>
–-where <condition>
–-query <SQL query>
Example:
sqoop import –connect jdbc:mysql://db.one.com/corp --table INTELLIPAAT_EMP --where “start_date> ’2016-07-20’ ”
sqoopeval –connect jdbc:mysql://db.test.com/corp –query “SELECT * FROM intellipaat_emp LIMIT 20”
sqoop import –connect jdbc:mysql://localhost/database –username root –password aaaaa –columns “name,emp_id,jobtitle”
Sqoop supports data imported into the following services:
Sqoop needs a connector to connect the different relational databases. Almost all Database vendors make a JDBC connector available specific to that Database, Sqoop needs a JDBC driver of the database for interaction.
No, Sqoop needs JDBC and a connector to connect a database.
Sqoop command to control the number of mappers
We can control the number of mappers by executing the parameter –num-mapers in sqoop command. The –num-mappers arguments control the number of map tasks, which is the degree of parallelism used. Start with a small number of map tasks, then choose a high number of mappers starting the performance may down on the database side.
Syntax: -m, --num-mappers <n>
Sqoop command to show all the databases in MySQL server
$ sqoop list –databases –connect jdbc:mysql://database.test.com/
Sqoopmetastore
It is a tool for using hosts in a shared metadata repository. Multiple users and remote users can define and execute saved jobs defined in metastore. End users configured to connect the metastore in sqoop-site.xml or with the
–meta-connect argument.
The purpose of sqoop-merge is:
This tool combines 2 datasets where entries in one dataset overwrite entries of an older dataset preserving only the new version of the records between both the data sets.
Apache Hive
The Apache Hive is a data warehouse software that lets you read, write and manage huge volumes of datasets that are stored in a distributed environment using SQL. It is possible to project structure onto data that is in storage. Users can connect to Hive using a JDBC driver and a command-line tool.
Hive is an open system. We can use Hive for analyzing and querying in large datasets of Hadoop files. It’s similar to SQL. The present version of Hive is 0.13.1.
Hive supports ACID transactions: The full form of ACID is Atomicity, Consistency, Isolation, and Durability. ACID transactions are provided at the row levels, there are Insert, Delete, and Update options so that Hive supports ACID transactions.
Hive is not considered a full database. The design rules and regulations of Hadoop and HDFS put restrictions on what Hive can do.
Hive is most suitable for following data warehouse applications
- Analyzing the relatively static data
- Less Responsive time
- No rapid changes in data.
Hive doesn’t provide fundamental features required for OLTP, Online Transaction Processing. Hive is suitable for data warehouse applications in large data sets.
The two types of tables in Hive
- Managed table
- External table
We can change the settings within the Hive session, using the SET command. It helps to change Hive job settings for an exact query.
Example: The following commands show buckets are occupied according to the table definition.
hive> SET hive.enforce.bucketing=true;
We can see the current value of any property by using SET with the property name. SET will list all the properties with their values set byHive.
hive> SET hive.enforce.bucketing;
hive.enforce.bucketing=true
And this list will not include defaults of Hadoop. So we should use the below like
SET -v
It will list all the properties including the Hadoop defaults in the system.
Add a new node with the following steps
1)Take a new system – create a new username and password
2) Install the SSH and with master node setup ssh connections
3) Add sshpublic_rsa id key to the authorized keys file
4) Add the new data node hostname, IP address and other details in /etc/hosts slaves file
192.168.1.102 slave3.in slave3
5) Start the DataNode on New Node
6) log in to the new node like suhadoop or ssh -X [email protected]
7) Start HDFS of a newly added slave node by using the following command
./bin/hadoop-daemon.sh start data node
8) Check the output of jps command on a new node.