Introduction to Hadoop and its Ecosystem, Map Reduce and HDFS
Big Data, Factors constituting Big Data,Hadoop and Hadoop Ecosystem,Map Reduce -Concepts of Map, Reduce, Ordering, Concurrency, Shuffle, Reducing, Concurrency ,Hadoop Distributed File System (HDFS) Concepts and its Importance,Deep Dive in Map Reduce – Execution Framework, Partitioner Combiner, Data Types, Key pairs,HDFS Deep Dive – Architecture, Data Replication, Name Node, Data Node, Data Flow, Parallel Copying with DISTCP, Hadoop Archives
Hands on Exercises
Installing Hadoop in Pseudo Distributed Mode, Understanding Important configuration files, their Properties and Demon Threads,Accessing HDFS from Command Line
Map Reduce – Basic Exercises,Understanding Hadoop Eco-system,Introduction to Sqoop, use cases and Installation,Introduction to Hive, use cases and Installation,Introduction to Pig, use cases and Installation,Introduction to Oozie, use cases and Installation,Introduction to Flume, use cases and Installation,Introduction to Yarn
Mini Project – Importing Mysql Data using Sqoop and Querying it using Hive
A. Introduction to Hive
What Is Hive?,Hive Schema and Data Storage,Comparing Hive to Traditional Databases,Hive vs. Pig,Hive Use Cases,Interacting with Hive
B. Relational Data Analysis with Hive
Hive Databases and Tables,Basic HiveQL Syntax,Data Types ,Joining Data Sets,Common Built-in Functions,Hands-On Exercise: Running Hive Queries on the Shell, Scripts, and Hue
C. Hive Data Management
Hive Data Formats,Creating Databases and Hive-Managed Tables,Loading Data into Hive,Altering Databases and Tables,Self-Managed Tables,Simplifying Queries with Views,Storing Query Results,Controlling Access to Data,Hands-On Exercise: Data Management with Hive
D. Hive Optimization
Understanding Query Performance,Partitioning,Bucketing,Indexing Data
E. Extending Hive
Topics : User-Defined Functions
F. Hands on Exercises – Playing with huge data and Querying extensively.
G. User defined Functions, Optimizing Queries, Tips and Tricks for performance tuning
A. Introduction to Pig
What Is Pig?,Pig’s Features,Pig Use Cases,Interacting with Pig
B. Basic Data Analysis with Pig
Pig Latin Syntax, Loading Data,Simple Data Types,Field Definitions,Data Output,Viewing the Schema,Filtering and Sorting Data,Commonly-Used Functions,Hands-On
Exercise: Using Pig for ETL Processing
C. Processing Complex Data with Pig
Complex/Nested Data Types,Grouping,Iterating Grouped Data,Hands-On Exercise: Analyzing Data with Pig
D. Multi-Data set Operations with Pig
Techniques for Combining Data Sets,Joining Data Sets in Pig,Set Operations,Splitting Data Sets,Hands-On Exercise
E. Extending Pig
Macros and Imports,UDFs,Using Other Languages to Process Data with Pig,Hands-On Exercise: Extending Pig with Streaming and UDFs
F. Pig Jobs
Job and Certification Support
Major Project, Hadoop Development, cloudera Certification Tips and Guidance and Mock Interview Preparation, Practical Development Tips and Techniques, certification preparation
DataStage Course Content
Introduction to the IBM Information Server Architecture, the Server Suite components, the various tiers in the Information Server.
Understanding the IBM InfoSphere DataStage, the Job life cycle to develop, test, deploy and run data jobs, high performance parallel framework, real-time data integration.
Introduction to the design elements, various DataStage jobs, creating massively parallel framework, scalable ETL features, working with DataStage jobs.
Understanding the DataStage Job, creating a Job that can effectively extract, transform and load data, cleansing and formatting data to improve its quality.
Parallelism, Partitioning and Collecting
Learning about data parallelism – pipeline parallelism and partitioning parallelism, the two types of data partitioning – Key-based partitioning and Keyless partitioning, detailed understanding of partitioning techniques like round robin, entire, hash key, range, DB2 partitioning, data collecting techniques and types like round robin, order, sorted merge and same collecting methods.
Job Stages of InfoSphere DataStage
Understanding the various job stages – data source, transformer, final database, the various parallel stages – general objects, debug and development stages, processing stage, file stage types, database stage, real time stage, restructure stage, data quality and sequence stages of InfoSphere DataStage.
Understanding the parallel job stage editors, the important types of stage editors in DataStage.
Working with the Sequential file stages, understanding runtime column propagation, working with RCP in sequential file stages, using the sequential file stage as a source stage and target stage.
Dataset and Fileset
Understanding the difference between dataset and fileset and how DataStage works in each scenario.
Sample Job Creation
Creating of a sample DataStage job using the dataset and fileset types of data.
Properties of Sequential File stage and Data Set Stage
Learning about the various properties of Sequential File Stage and Dataset stage.
Lookup File Set Stage
Creating a lookup file set, working in parallel or sequential stage, learning about single input and output link.
Studying the Transformer Stage in DataStage, the basic working of this stage, characteristics -single input, any number of outputs and reject link, how it differs from other processing stages, the significance of Transformer Editor, and evaluation sequence in this stage.
Transformer Stage Functions & Features
Deep dive into Transformer functions – String, type conversion, null handling, mathematical, utility functions, understanding the various features like constraint, system variables, conditional job aborting, Operators and Trigger Tab.
Understanding the looping functionality in Transformer Stage, output with multiple rows for single input row, the procedure for looping, loop variable properties.
Teradata Enterprise Stage
Connecting to the Teradata Enterprise Stage, properties of connection.
Single partition and parallel execution
Generating data using Row Generator sequentially in a single partition, configuring to run in parallel.
Understanding the Aggregator Stage in DataStage, the two types of aggregation – hash mode and sort mode.
Different Stages Of Processing
Deep learning of the various stages in DataStage, the importance of Copy, Filter and Modify stages to reduce number of Transformer Stages.
Parameters and Value File
Understanding Parameter Set, storing DataStage and Quality Stage job parameters and default values in files, the procedure to deploy Parameter Sets function and its advantages.
Introduction to Funnel Stage, copying multiple input data sets into single output data set, the three modes – continuous funnel, sort funnel and sequence.
Topics – Understanding the Join Stage and its types, Join Stage Partitioning, performing various Join operations.
Understanding the Lookup Stage for processing using lookup operations, knowing when to use Lookup Stage, partitioning method for Lookup Stage, comparing normal and sparse lookup, doing lookup for a range of values using Range Lookup.
Learning about the Merge Stage, multiple input links and single output link, need for key partitioned and sorted input data set, specifying several reject links in Merge Stage, comparing the Join vs. Lookup vs. Merge Stages of processing.
FTP Enterprise Stage
Studying the FTP Enterprise Stage, transferring multiple files in parallel, invoking the FTP client, transferring to or from remote host using FTP protocol, FTP Enterprise Stage properties.
Understanding the Sort Stage, performing complex sort operations, learning about Stable Sort, removing duplicates.
Working with Teradata Connector in DataStage, configuring as a source, target or parallel in a lookup context for parallel or server jobs, learning about Teradata Parallel Transporter direct API for bulk operations and the Operators deployed.
Learning about the various Database Connector Stages for working with Balanced Optimization Tool.
ABAP Extract Stage
Understanding the ABAP Extract Stage, extracting data from SAP data repositories, generating ABAP extraction programs, executing SQL query and sending data to DataStage Server.
Development / Debug Stages
The various Stages for debugging the parallel job designs, controlling flow of multiple activities in a job sequence, understanding the various data sampling stages in a Debug/Development Stage like Head Stage, Tail Stage and Sample Stage.
Job Activity Stage
Learning about Job Activity Stage which specifies a DataStage Server or parallel job to execute.
Pentaho Course Content
Introduction to Pentaho Tool
Pentaho user console, Oveview of Pentaho Business Intelligence and Analytics tools, database dimensional modelling, using Star Schema for querying large data sets, understanding fact tables and dimensions tables, Snowflake Schema, principles of Slowly Changing Dimensions, knowledge of how high availability is supported for the DI server and BA server, managing Pentaho artifacts Knowledge of big data solution architectures
Hands-on Exercise – Schedule a report using user console, Create model using database dimensional modeling techniques, create a Star Schema for querying large data sets, Use fact tables and dimensions tables, manage Pentaho artifacts
Designing data models for reporting, Pentaho support for predictive analytics, Design a Streamlined Data Refinery (SDR) solution for a client
Hands-on Exercise – Design data models for reporting, Perform predictive analytics on a data set, design a Streamlined Data Refinery (SDR) solution for a dummy client
Clustering in Pentaho
Understanding the basics of clustering in Pentaho Data Integration, creating a database connection, moving a CSV file input to table output and Microsoft Excel output, moving from Excel to data grid and log.
Hands-on Exercise – Create a database connection, move a csv file input to table output and Microsoft excel output, move data from excel to data grid and log
The Pentaho Data Integration Transformation steps, adding sequence, understanding calculator, Penthao number range, string replace, selecting field value, sorting and splitting rows, string operation, unique row and value mapper, Usage of metadata injection
Hands-on Exercise – Practice various steps to perform data integration transformation, add sequence, use calculator, work on number range, selecting field value, sorting and splitting rows, string operation, unique row and value mapper, use metadata injection
Working with secure socket command, Pentaho null value and error handling, Pentaho mail, row filter and priorities stream.
Hands-on Exercise – Work with secure socket command, Handle null values in the data, perform error handling, send email, get row filtered data, set stream priorities
Understanding Slowly Changing Dimensions, making ETL dynamic, dynamic transformation, creating folders, scripting, bulk loading, file management, working with Pentaho file transfer, Repository, XML, Utility and File encryption.
Hands-on Exercise – Make ETL dynamic transformation, create folders, write scripts, load bulk data, perform file management ops, work with Pentaho file transfer, XML utility and File encryption
Type of Repository in Pentaho
Creating dynamic ETL, passing variable and value from job to transformation, deploying parameter with transformation, importance of Repository in Pentaho, database connection, environmental variable and repository import.
Hands-on Exercise – Create dynamic ETL, pass variable and value from job to transformation, deploy parameter with transformation, connect to a database, set pentaho environmental variables, import a repository in the pentaho workspace
Pentaho Repository & Report Designing
Working with Pentaho dashboard and Report, effect of row bending, designing a report, working with Pentaho Server, creation of line, bar and pie chart in Pentaho, How to achieve localization in reports
Hands-on Exercise – Create Pentaho dashboard and report, check effect of row bending, design a report, work with Pentaho Server, create line, bar and pie chart in Pentaho, Implement localization in a report
Working with Pentaho Dashboard, passing parameters in Report and Dashboard, drill-down of Report, deploying Cubes for report creation, working with Excel sheet, Pentaho data integration for report creation.
Hands-on Exercise – Pass parameters in Report and Dashboard, deploy Cubes for report creation, drill-down in report to understand the entries, import data from an excel sheet, Perform data integration for report creation
What is a Cube? Creation and benefit of Cube, working with Cube, Report and Dashboard creation with Cube.
Hands-on Exercise – Create a Cube, create report and dashboard with Cube
Multi Dimensional Expression
Understanding the basics of Multi Dimensional Expression (MDX), basics of MDX, understanding Tuple, its implicit dimensions, MDX sets, level, members, dimensions referencing, hierarchical navigation, and meta data.
Hands-on Exercise – Work with MDX, Use MDX sets, level, members, dimensions referencing, hierarchical navigation, and meta data
Pentaho analytics for discovering, blending various data types and sizes, including advanced analytics for visualizing data across multiple dimensions, extending Analyzer functionality, embedding BA server reports, Pentaho REST APIs
Hands-on Exercise – Blend various data types and sizes, Perform advanced analytics for visualizing data across multiple dimensions, Embed BA server report
Pentaho Data Integration (PDI) Development
Knowledge of the PDI steps used to create an ETL job, Describing the PDI / Kettle steps to create an ETL transformation, Describing the use of property files
Hands-on Exercise – Create an ETL transformation using PDI / Kettle steps, Use property files
Hadoop ETL Connectivity
Deploying ETL capabilities for working on the Hadoop ecosystem, integrating with HDFS and moving data from local file to distributed file system, deploying Apache Hive, designing MapReduce jobs, complete Hadoop integration with ETL tool.
Hands-on Exercise – Deploy ETL capabilities for working on the Hadoop ecosystem, Integrate with HDFS and move data from local file to distributed file system, deploy Apache Hive, design MapReduce jobs
Creating dashboards in Pentaho
Creating interactive dashboards for visualizing highly graphical representation of data for improving key business performance.
Hands-on Exercise – Create interactive dashboards for visualizing graphical representation of data
Managing BA server logging, tuning Pentaho reports, monitoring the performance of a job or a transformation, Auditing in Pentaho
Hands-on Exercise – Manage logging in BA server, Fine tune Pentaho report, Monitor the performance of an ETL job
Integrating user security with other enterprise systems, Extending BA server content security, Securing data, Pentaho’s support for multi-tenancy, Using Kerberos with Pentaho
Hands-on Exercise – Configure security settings to implement high level security
Project 1 : General Manager Insight – Dash Board
Client : Cisco
Technology : Teradata, Informatica, SQL
Cisco is the worldwide leader in networking that transforms how people connect, communicate and collaborate. Current portfolio of products and services is focused upon three market segments—Enterprise and Service Provider, Small Business and the Home. The solutions for each market are segmented into Architectures, which form the basis for how Cisco approaches each market
Description : GMI dashboard is graphical representation performance of various Cisco Business segments. Dashboard provides pictorial representation of Summery and detailed data about various Subjects Areas in different dimensions such as Entity level, Product level and Region level. Dashboard covers various measures of the business such as Bookings, Revenue and Gross Margin; Product forecast which are used by managers at different hierarchies of the Cisco management.
Project 2 : Deploying Informatica ETL for business intelligence
Industry : General
Problem Statement : Disparate data needs to be converted into insights using Informatica
Topics : In this Informatica project you have access to all environments like dev, QA, UAT and production. You will first configure all the repositories in various environment. You will receive the requirement from client through source to target mapping sheet. You will extract data from various source systems and fetch it into staging. From staging it will go to the operational data store, and from there the data will go to the enterprise data warehouse and from there it will be directly deployed for generating reports and deriving business insights.
- Access data from multiple sources
- Manage current & historic data with SCD
- Import source & target tables
Case Study – 1
Project: Banking products augmentation
Problem Statement: How to improve the profits of a bank by customizing the products and adding new products based on customer needs.
Topics: In this Informatica project you will construct a multidimensional model for the bank. You will create a set of diagrams depicting the star-join schemas needed to streamline the products as per customer requirements. You will implement slowly changing dimensions, understand the customer/account relationship and create diagram for description of the hierarchies. You will also recommend heterogeneous products for the customers of the bank.
- Deploy Star join schema
- Create demographic mini-dimensions
- Informatica Aggregator Transformations
Case Study – 2
Project: Employee data integration
Problem Statement: How to load a table with employee data using Informatica
Topics: In this Informatica case study you will create a design for a common framework that can be used for loading, updating the employee ID and other details lookup for multiple shared tables. Your design will address the regular loading of shared table. You will also keep track of when the regular load runs, when the lookup requests run, prioritization of requests if needed and so on.
- Creating multiple shared table
- Plug-and-play capability of framework
- Code and framework reusability.
Project 1 : Configuration and Logging
Industry : General
Problem Statement : How to integrate data from multiple sources into the SQL Server
Topics : In this SQL Server Integration Services (SSIS) project you will extensively work on multiple data from heterogeneous source into SQL Server. As part of the project you will learn to clean and standardize data and automate the administrative work. Some of the tasks that you will be performing are adding logs to SSIS package, configuration and saving it to an XML file. Upon completion of the project you will have hands-on experience in handling constraints, error row configuration and event handlers.
- Integrate data from heterogeneous sources
- Working with Connection Manager
- Deploying data modeling
Project : Report formatting using OBIEE
Industry : General
Problem Statement : How to find the revenue generated for a business
Topics : This is an Oracle Business Intelligence project that is associated with creating complex dashboards and performing formatting of the report. You will gain hands-on experience in filtering and sorting of the report data depending on the business requirements. This project will also help you understand how to convert the data into graphs for easy visualization and analysis. As part of the project you will gain experience in calculating the subtotal and grand total in a business scenario while finding the revenue generated.
- Filtering and sorting of Reports
- Deploying visualization & analysis techniques
- Designing an OBIEE dashboard.
Talend For Hadoop Project
1. Project – Jobs
Problem Statement – It describes that how to create a job using metadata. For this it includes following actions:
Create XML File,Create Delimited File,Create Excel File,Create Database Connection
2. Hadoop Projects
A. Project – Working with Map Reduce, Hive, Sqoop
Problem Statement – It describes that how to import mysql data using sqoop and querying it using hive and also describes that how to run the word count mapreduce job.
B. Project – Connecting Pentaho with Hadoop Eco-system
Problem Statement – It includes:
Quick Overview of ETL and BI,Configuring Pentaho to work with Hadoop Distribution,Loading data into Hadoop cluster,Transforming data into Hadoop cluster
Extracting data from Hadoop Cluster
Data Stage Projects
Project 1 : Making sense of financial data
Industry : Financial Services
Problem Statement : Extract value from multiple sources & varieties of data in the financial domain
Description : In this project you will learn how to work with disparate data in the financial services domain and come up with valuable business insights. You will deploy IBM InfoSphere DataStage for the entire Extract, Transform, Load process to leverage it for a parallel framework either on-premise or on the cloud for high performance results. You will work on big data at rest and big data in motion as well.
- Creating DataStage jobs for ETL process
- Deploying DataStage Parallel Stage Editor
- Data Partitioning for getting consistent results
Project 2 : Enterprise IT data management
Industry : Information Technology
Problem Statement : Software enterprises have a lot of data and this needs to made sense of in order to derive valuable insights from it
Description : This project involves working with the data warehouse existing in a company deploying the IBM DataStage onto it for the various processes of extract, transform, and load. You will learn how DataStage manages high performance parallel computing. You will learn how it implements extended metadata management and enterprise connectivity. This also includes combining heterogeneous data.
- Enforce workload & business rules
- DataStage deployed on heterogeneous data
- Integrating real-time data at scale.
Project 3 : Medical drug discovery and development
Industry : Pharmaceutical
Problem Statement : A pharmaceutical company wants to speed the process of drug discovery and development through using ETL solutions.
Description : This project deals with the domain of drug molecule discovery and development. You will learn how DataStage helps to make sense of the huge data warehouse that resides within the pharmaceutical domain which includes data about patient history, existing molecules, and the effect of the existing drugs and so on. The ETL tool DataStage will help to make the process of drug discovery that much easier.
- Combining various types of data with ETL process
- Converting the data and transferring it for analysis
- Making the data ready for visualization & insights.
Project 4 : Finding the oil reserves in ocean
Industry : Oil and Gas
Problem Statement : Finding new oil reserves is a very herculean task. There are huge amounts of data that need to be parsed in order to find where oil exists in the ocean. This is where there is a need for an ETL tool like DataStage.
Description : This project deals with the process of deploying ETL tool like Datastage to parse petabytes of data for discovering new oil. This data could be in the form of geological data, sensor data, streaming data and so. You will learn how DataStage can make sense of all this data.
- Working with cloud or on-premise data
- Deploying DataStage for static or streaming data
- Converting data into the right format for analysis
Project 1– Pentaho Interactive Report
Data– Sales, Customer, Product
Objective – In this Pentaho project you will be exclusively working on creating Pentaho interactive reports for sales, customer and product data fields. As part of the project you will learn to create a data source, build a Mondrian cube which is represented in an XML file. You will gain advanced experience in managing data sources, building and formatting Pentaho report, change the report template and scheduling of reports.
Project 2– Pentaho Interactive Report
Objective – Build complex dashboard with drill down reports and charts for analysing business trends.
Project 3– Pentaho Interactive Report
Objective – To do automation testing in ETL environment, Check the correctness of data transformation, Data loading in datawarehouse without any loss or truncation, Rejecting, Replacing and Reporting invalid data, Creation of unit tests to target exceptions