What is Big Data, Where does Hadoop fit in, Hadoop Distributed File System – Replications, Block Size, Secondary Namenode, High Availability, Understanding YARN – ResourceManager, NodeManager, Difference between 1.x and 2.x
Hadoop 2.x Cluster Architecture , Federation and High Availability, A Typical Production Cluster setup , Hadoop Cluster Modes, Common Hadoop Shell Commands, Hadoop 2.x Configuration Files, Cloudera Single node cluster
What is Graph, Graph Representation, Breadth first Search Algorithm, Graph Representation of Map Reduce, How to do the Graph Algorithm, Example of Graph Map Reduce,
Understanding Apache Pig, the features, various uses and learning to interact with Pig
The syntax of Pig Latin, the various definitions, data sort and filter, data types, deploying Pig for ETL, data loading, schema viewing, field definitions, functions commonly used.
Various data types including nested and complex, processing data with Pig, grouped data iteration, practical exercise
Data set joining, data set splitting, various methods for data set combining, set operations, hands-on exercise
Understanding user defined functions, performing data processing with other languages, imports and macros, using streaming and UDFs to extend Pig, practical exercises
Working with real data sets involving Walmart and Electronic Arts as case study
Understanding Hive, traditional database comparison with Hive, Pig and Hive comparison, storing data in Hive and Hive schema, Hive interaction and various use cases of Hive
Understanding HiveQL, basic syntax, the various tables and databases, data types, data set joining, various built-in functions, deploying Hive queries on scripts, shell and Hue.
The various databases, creation of databases, data formats in Hive, data modeling, Hive-managed Tables, self-managed Tables, data loading, changing databases and Tables, query simplification with Views, result storing of queries, data access control, managing data with Hive, Hive Metastore and Thrift server.
Learning performance of query, data indexing, partitioning and bucketing
Deploying user defined functions for extending Hive
Deploying Hive for huge volumes of data sets and large amounts of querying
Working extensively with User Defined Queries, learning how to optimize queries, various methods to do performance tuning.
What is Impala?, How Impala Differs from Hive and Pig, How Impala Differs from Relational Databases, Limitations and Future Directions, Using the Impala Shell
Data Storage Overview, Creating Databases and Tables, Loading Data into Tables, HCatalog, Impala Metadata Caching
Partitioning Overview, Partitioning in Impala and Hive
Selecting a File Format, Tool Support for File Formats, Avro Schemas, Using Avro with Hive and Sqoop, Avro Schema Evolution, Compression
What is Hbase, Where does it fits, What is NOSQL
What is Spark, Comparison between Spark and Hadoop, Components of Spark
Apache Spark- Introduction, Consistency, Availability, Partition, Unified Stack Spark, Spark Components, Scalding example, mahout, storm, graph
Explain python example, Show installing a spark, Explain driver program, Explaining spark context with example, Define weakly typed variable, Combine scala and java seamlessly, Explain concurrency and distribution., Explain what is trait, Explain higher order function with example, Define OFI scheduler, Advantages of Spark, Example of Lamda using spark, Explain Mapreduce with example
Multi Node Cluster Setup using Amazon ec2 – Creating 4 node cluster setup, Running Map Reduce Jobs on Cluster
Putting it all together and Connecting Dots, Working with Large data sets, Steps involved in analyzing large data
How ETL tools work in Big data Industry, Connecting to HDFS from ETL tool and moving data from Local system to HDFS, Moving Data from DBMS to HDFS, Working with Hive with ETL Tool, Creating Map Reduce job in ETL tool, End to End ETL PoC showing big data integration with ETL tool.
Configuration overview and important configuration file, Configuration parameters and values, HDFS parameters MapReduce parameters, Hadoop environment setup, ‘Include’ and ‘Exclude’ configuration files, Lab: MapReduce Performance Tuning
Namenode/Datanode directory structures and files, File system image and Edit log, The Checkpoint Procedure, Namenode failure and recovery procedure, Safe Mode, Metadata and Data backup, Potential problems and solutions / what to look for, Adding and removing nodes, Lab: MapReduce File system Recovery
Best practices of monitoring a cluster, Using logs and stack traces for monitoring and troubleshooting, Using open-source tools to monitor the cluster
How to schedule Jobs on the same cluster, FIFO Schedule, Fair Scheduler and its configuration
Multi Node Cluster Setup using Amazon ec2 – Creating 4 node cluster setup, Running Map Reduce Jobs on Cluster
ZOOKEEPER Introduction, ZOOKEEPER use cases, ZOOKEEPER Services, ZOOKEEPER data Model, Znodes and its types, Znodes operations, Znodes watches, Znodes reads and writes, Consistency Guarantees, Cluster management, Leader Election, Distributed Exclusive Lock, Important points
Why Oozie?, Installing Oozie, Running an example, Oozie- workflow engine, Example M/R action, Word count example, Workflow application, Workflow submission, Workflow state transitions, Oozie job processing, Oozie security, Why Oozie security?, Job submission, Multi tenancy and scalability, Time line of Oozie job, Coordinator, Bundle, Layers of abstraction, Architecture, Use Case 1: time triggers, Use Case 2: data and time triggers, Use Case 3: rolling window
Overview of Apache Flume, Physically distributed Data sources, Changing structure of Data, Closer look, Anatomy of Flume, Core concepts, Event, Clients, Agents, Source, Channels, Sinks, Interceptors, Channel selector, Sink processor, Data ingest, Agent pipeline, Transactional data exchange, Routing and replicating, Why channels?, Use case- Log aggregation, Adding flume agent, Handling a server farm, Data volume per agent, Example describing a single node flume deployment
HUE introduction, HUE ecosystem, What is HUE?, HUE real world view, Advantages of HUE, How to upload data in File Browser?, View the content, Integrating users, Integrating HDFS, Fundamentals of HUE FRONTEND
IMPALA Overview: Goals, User view of Impala: Overview, User view of Impala: SQL, User view of Impala: Apache HBase, Impala architecture, Impala state store, Impala catalogue service, Query execution phases, Comparing Impala to Hive
Why testing is important, Unit testing, Integration testing, Performance testing, Diagnostics, Nightly QA test, Benchmark and end to end tests, Functional testing, Release certification testing, Security testing, Scalability Testing, Commissioning and Decommissioning of Data Nodes Testing, Reliability testing, Release testing
Understanding the Requirement, preparation of the Testing Estimation, Test Cases, Test Data, Test bed creation, Test Execution, Defect Reporting, Defect Retest, Daily Status report delivery, Test completion, ETL testing at every stage (HDFS, HIVE, HBASE) while loading the input (logs/files/records etc) using sqoop/flume which includes but not limited to data verification, Reconciliation, User Authorization and Authentication testing (Groups, Users, Privileges etc), Report defects to the development team or manager and driving them to closure, Consolidate all the defects and create defect reports, Validating new feature and issues in Core Hadoop.
Report defects to the development team or manager and driving them to closure, Consolidate all the defects and create defect reports, Responsible for creating a testing Framework called MR Unit for testing of Map-Reduce programs.
Automation testing using the OOZIE, Data validation using the query surge tool.
Test plan for HDFS upgrade, Test automation and result
How to test install and configure
Cloudera Certification Tips and Guidance and Mock Interview Preparation, Practical Development Tips and Techniques
Introducing Scala and deployment of Scala for Big Data applications and Apache Spark analytics.
The importance of Scala, the concept of REPL (Read Evaluate Print Loop), deep dive into Scala pattern matching, type interface, higher order function, currying, traits, application space and Scala for data analysis.
Learning about the Scala Interpreter, static object timer in Scala, testing String equality in Scala, Implicit classes in Scala, the concept of currying in Scala, various classes in Scala.
Learning about the Classes concept, understanding the constructor overloading, the various abstract classes, the hierarchy types in Scala, the concept of object equality, the val and var methods in Scala.
Understanding Sealed traits, wild, constructor, tuple, variable pattern, and constant pattern.
Understanding traits in Scala, the advantages of traits, linearization of traits, the Java equivalent and avoiding of boilerplate code.
Implementation of traits in Scala and Java, handling of multiple traits extending.
Introduction to Scala collections, classification of collections, the difference between Iterator, and Iterable in Scala, example of list sequence in Scala.
The two types of collections in Scala, Mutable and Immutable collections, understanding lists and arrays in Scala, the list buffer and array buffer, Queue in Scala, double-ended queue Deque, Stacks, Sets, Maps, Tuples in Scala.
Introduction to Scala packages and imports, the selective imports, the Scala test classes, introduction to JUnit test class, JUnit interface via JUnit 3 suite for Scala test, packaging of Scala applications in Directory Structure, example of Spark Split and Spark Scala.
Introduction to Spark, how Spark overcomes the drawbacks of working MapReduce, understanding in-memory MapReduce, Spark Hadoop YARN, HDFS Revision, YARN Revision, the overview of Spark and how it is better Hadoop, deploying Spark without Hadoop.
Spark installation guide, working with Spark Shell, the concept of Resilient Distributed Datasets (RDD), learning to do functional programming in Spark, the architecture of Spark.
Deep dive into Spark RDDs, the RDD general operations, a read-only partitioned collection of records, using the concept of RDD for faster and efficient data processing.
Understanding the concept of Key-Value pair in RDDs, learning how Spark makes MapReduce operations faster, various operations of RDD.
Comparing the Spark applications with Spark Shell, creating a Spark application using Scala or Java, deploying a Spark application, the web user interface of Spark application, a real world example of Spark and configuring of Spark.
Learning about Spark parallel processing, deploying on a cluster, introduction to Spark partitions, file-based partitioning of RDDs, understanding of HDFS and data locality, mastering the technique of parallel operations.
Understanding the RDD persistence overview, distributed persistence, RDD lineage
Understanding the Spark streaming, creating a Spark stream application, processing of Spark stream, streaming request count and DStreams.
Introduction to Spark multi-batch operations, state operations, sliding window operations and advanced data sources.
Learning about the Spark common use cases, the concept of iterative algorithm in Spark, analyzing with Spark graph processing, introduction to K-Means and machine learning.
Introduction to various variables in Spark like shared variables, broadcast variables, learning about accumulators, the common performance issues and troubleshooting the performance problems.
Learning about Spark SQL, the context of SQL in Spark for providing structured data processing, understanding the Data Frames in Spark, learning to query and transform data in Data Frames, how Data Frame provides the benefit of both Spark RDD and Spark SQL, deploying Hive on Spark as the execution engine.
Learning about the scheduling and partitioning in Spark, scheduling within and around applications, static partitioning, dynamic sharing, fair scheduling, Spark master high availability, standby Masters with Zookeeper, Single Node Recovery With Local File System, High Order Functions.
Understanding how to design capacity planning in Spark, creation of Maps, Transformations, the concept of concurrency in Java and Scala.
Understanding about log analysis with Spark, first log analyzers in Spark, working with various buffers like array, compact and protocol buffer.
Pentaho user console, Oveview of Pentaho Business Intelligence and Analytics tools, database dimensional modelling, using Star Schema for querying large data sets, understanding fact tables and dimensions tables, Snowflake Schema, principles of Slowly Changing Dimensions, knowledge of how high availability is supported for the DI server and BA server, managing Pentaho artifacts Knowledge of big data solution architectures
Hands-on Exercise – Schedule a report using user console, Create model using database dimensional modeling techniques, create a Star Schema for querying large data sets, Use fact tables and dimensions tables, manage Pentaho artifacts
Designing data models for reporting, Pentaho support for predictive analytics, Design a Streamlined Data Refinery (SDR) solution for a client
Hands-on Exercise – Design data models for reporting, Perform predictive analytics on a data set, design a Streamlined Data Refinery (SDR) solution for a dummy client
Understanding the basics of clustering in Pentaho Data Integration, creating a database connection, moving a CSV file input to table output and Microsoft Excel output, moving from Excel to data grid and log.
Hands-on Exercise – Create a database connection, move a csv file input to table output and Microsoft excel output, move data from excel to data grid and log
The Pentaho Data Integration Transformation steps, adding sequence, understanding calculator, Penthao number range, string replace, selecting field value, sorting and splitting rows, string operation, unique row and value mapper, Usage of metadata injection
Hands-on Exercise – Practice various steps to perform data integration transformation, add sequence, use calculator, work on number range, selecting field value, sorting and splitting rows, string operation, unique row and value mapper, use metadata injection
Working with secure socket command, Pentaho null value and error handling, Pentaho mail, row filter and priorities stream.
Hands-on Exercise – Work with secure socket command, Handle null values in the data, perform error handling, send email, get row filtered data, set stream priorities
Understanding Slowly Changing Dimensions, making ETL dynamic, dynamic transformation, creating folders, scripting, bulk loading, file management, working with Pentaho file transfer, Repository, XML, Utility and File encryption.
Hands-on Exercise – Make ETL dynamic transformation, create folders, write scripts, load bulk data, perform file management ops, work with Pentaho file transfer, XML utility and File encryption
Creating dynamic ETL, passing variable and value from job to transformation, deploying parameter with transformation, importance of Repository in Pentaho, database connection, environmental variable and repository import.
Hands-on Exercise – Create dynamic ETL, pass variable and value from job to transformation, deploy parameter with transformation, connect to a database, set pentaho environmental variables, import a repository in the pentaho workspace
Working with Pentaho dashboard and Report, effect of row bending, designing a report, working with Pentaho Server, creation of line, bar and pie chart in Pentaho, How to achieve localization in reports
Hands-on Exercise – Create Pentaho dashboard and report, check effect of row bending, design a report, work with Pentaho Server, create line, bar and pie chart in Pentaho, Implement localization in a report
Working with Pentaho Dashboard, passing parameters in Report and Dashboard, drill-down of Report, deploying Cubes for report creation, working with Excel sheet, Pentaho data integration for report creation.
Hands-on Exercise – Pass parameters in Report and Dashboard, deploy Cubes for report creation, drill-down in report to understand the entries, import data from an excel sheet, Perform data integration for report creation
What is a Cube? Creation and benefit of Cube, working with Cube, Report and Dashboard creation with Cube.
Hands-on Exercise – Create a Cube, create report and dashboard with Cube
Understanding the basics of Multi Dimensional Expression (MDX), basics of MDX, understanding Tuple, its implicit dimensions, MDX sets, level, members, dimensions referencing, hierarchical navigation, and meta data.
Hands-on Exercise – Work with MDX, Use MDX sets, level, members, dimensions referencing, hierarchical navigation, and meta data
Pentaho analytics for discovering, blending various data types and sizes, including advanced analytics for visualizing data across multiple dimensions, extending Analyzer functionality, embedding BA server reports, Pentaho REST APIs
Hands-on Exercise – Blend various data types and sizes, Perform advanced analytics for visualizing data across multiple dimensions, Embed BA server report
Knowledge of the PDI steps used to create an ETL job, Describing the PDI steps to create an ETL transformation, Describing the use of property files
Hands-on Exercise – Create an ETL transformation using PDI steps, Use property files
Deploying ETL capabilities for working on the Hadoop ecosystem, integrating with HDFS and moving data from local file to distributed file system, deploying Apache Hive, designing MapReduce jobs, complete Hadoop integration with ETL tool.
Hands-on Exercise – Deploy ETL capabilities for working on the Hadoop ecosystem, Integrate with HDFS and move data from local file to distributed file system, deploy Apache Hive, design MapReduce jobs
Creating interactive dashboards for visualizing highly graphical representation of data for improving key business performance.
Hands-on Exercise – Create interactive dashboards for visualizing graphical representation of data
Managing BA server logging, tuning Pentaho reports, monitoring the performance of a job or a transformation, Auditing in Pentaho
Hands-on Exercise – Manage logging in BA server, Fine tune Pentaho report, Monitor the performance of an ETL job
Integrating user security with other enterprise systems, Extending BA server content security, Securing data, Pentaho’s support for multi-tenancy, Using Kerberos with Pentaho
Hands-on Exercise – Configure security settings to implement high level security
What is Python Language and features, Why Python and why it is different from other languages, Installation of Python, Anaconda Python distribution for Windows, Mac, Linux. Run a sample python script, working with Pyhton IDE’s. Running basic python commands – Data types, Variables,Keywords,etc
Hands-on Exercise – Install Anaconda Python distribution for your OS (Windows/Linux/Mac)
Indentation(Tabs and Spaces) and Code Comments (Pound # character); Variables and Names; Built-in Data Types in Python – Numeric: int, float, complex – Containers: list, tuple, set, dict – Text Sequence: Str (String) – Others: Modules, Classes, Instances, Exceptions, Null Object, Ellipsis Object – Constants: False, True, None, NotImplemented, Ellipsis, __debug__; Basic Operators: Arithmetic, Comparison, Assignment, Logical, Bitwise, Membership, Indentity; Slicing and The Slice Operator [n:m]; Control and Loop Statements: if, for, while, range(), break, continue, else;
Hands-on Exercise – Write your first Python program Write a Python Function (with and without parameters) Use Lambda expression Write a class, create a member function and a variable, Create an object Write a for loop to print all odd numbers
Classes – classes and objects, access modifiers, instance and class members OOPS paradigm – Inheritance, Polymorphism and Encapsulation in Python. Functions: Parameters and Return Types; Lambda Expressions, Making connection with Database for pulling data.
Open a File, Read from a File, Write into a File; Resetting the current position in a File; The Pickle (Serialize and Deserialize Python Objects); The Shelve (Overcome the limitation of Pickle); What is an Exception; Raising an Exception; Catching an Exception;
Hands-on Exercise – Open a text file and read the contents, Write a new line in the opened file, Use pickle to serialize a python object, deserialize the object, Raise an exception and catch it
Arrays and Matrices, ND-array object, Array indexing, Datatypes, Array math Broadcasting, Std Deviation, Conditional Prob, Covariance and Correlation.
Hands-on Exercise – Import numpy module, Create an array using ND-array, Calculate std deviation on an array of numbers, Calculate correlation between two variables
Builds on top of NumPy, SciPy and its characteristics, subpackages: cluster, fftpack, linalg, signal, integrate, optimize, stats; Bayes Theorem using SciPy
Hands-on Exercise – Import SciPy, Apply Bayes theorem using SciPy on the given dataset
Plotting Grapsh and Charts (Line, Pie, Bar, Scatter, Histogram, 3-D); Subplots; The Matplotlib API
Hands-on Exercise – Plot Line, Pie, Scatter, Histogram and other charts using Matplotlib
Dataframes, NumPy array to a dataframe; Import Data (csv, json, excel, sql database); Data operations: View, Select, Filter, Sort, Groupby, Cleaning, Join/Combine, Handling Missing Values; Introduction to Machine Learning(ML); Linear Regression; Time Series
Hands-on Exercise – Import Pandas, Use it to import data from a json file,,Select records by a group and apply filter on top of that, View the records, Perform Linear Regression analysis, Create a Time Series
Introduction to Natural Language Processing (NLP); NLP approach for Text Data; Environment Setup (Jupyter Notebook); Sentence Analysis; ML Algorithms in Scikit-Learn; What is Bag of Words Model; Feature Extraction from Text; Model Training; Search Grid; Multiple Parameters; Build a Pipeline
Hands-on Exercise – Setup Jupyter Notebook environment, Load a dataset in Jupyter, Use algorithm in Scikit-Learn package to perform ML techniques, Train a model Create a search grid
What is Web Scraping; Web Scraping Libraries (Beautifulsoup, Scrapy); Installation of Beautifulsoup; Install lxml Python Parser; Making a Soup Object using an input html; Navigating Py Objects in the Soup Tree; Searching the Tree; Output Print; Parsing Full or Partial
Hands-on Exercise – Install Beautifulsoup and lxml Python parser, Make a Soup object using an input html file, Navigate Py objects in the soup tree, Search tree, Print output
Understanding Hadoop and its various components; Hadoop ecosystem and Hadoop common; HDFS and MapReduce Architecture; Python scripting for MapReduce Jobs on Hadoop framework
Hands-on Exercise – Write a basic MapReduce Job in Python and connect with Hadoop Framework to perform the task
What is Spark,understanding RDDs, Spark Libs, writing Spark code using python,Spark Machine Libraries Mlib, Regression, Classification and Clustering using Spark MLlib
Hands-on Exercise – Implement sandbox, Run a python code in sandbox, Work with HDFS file system from sandbox
RDBMS, types of relational databases, challenges of RDBMS, NoSQL database, its significance, how NoSQL suits Big Data needs, Introduction to MongoDB and its advantages, MongoDB installation, JSON features, data types and examples.
Installing MongoDB, basic MongoDB commands and operations, MongoChef (MongoGUI) Installation, MongoDB Data types.
Hands-on Exercise – Install MongoDB, Install MongoChef (MongoGUI)
The need for NoSQL, types of NoSQL databases, OLTP, OLAP, limitations of RDBMS, ACID properties, CAP Theorem, Base property, learning about JSON/BSON, database collection & document, MongoDB uses, MongoDB Write Concern – Acknowledged, Replica Acknowledged, Unacknowledged, Journaled, Fsync.
Hands-on Exercise – Write a JSON document
Understanding CRUD and its functionality, CRUD concepts, MongoDB Query & Syntax, read and write queries and query optimization.
Hands-on Exercise – Use Insert query to Create a data entry, Use find query to Read data, Use update and replace queris to Update, Use delete query operations on a DB file
Concepts of data modeling, difference between MongoDB and RDBMS modeling, Model tree structure, operational strategies, monitoring and backup.
Hands-on Exercise – Write a data model tree structure for a family hierarchy
In this module you will learn MongoDB® Administration activities such as Health Check, Backup, Recovery, database sharding and profiling, Data Import/Export, Performance tuning etc.
Hands-on Exercise – Use shard key and hashed shard keys, Perform backup and recovery of a dummy dataset, Import data from a csv file, Export data to a csv file
Concepts of data aggregation and types, data indexing concepts, properties and variations.
Hands-on Exercise – Do aggregation using pipeline, sort, skip and limit, Create index on data using single key, using multikey
Understanding database security risks, MongoDB security concept and security approach, MongoDB integration with Java and Robomongo.
Hands-on Exercise – MongoDB integration with Java and Robomongo.
Implementing techniques to work with variety of unstructured data like images, videos, log data, and others, understanding GridFS MongoDB file system for storing data.
Hands-on Exercise – Work with variety of unstructured data like images, videos, log data, and others
Introduction to Java Programming, Defining Java, Need for Java, Platform Independent in Java, Define JRE,JVM, JDK, Important Features and Evolution of Java
Overview of Coding basics, Setting up the required environment, Knowing the available IDEs, Writing a Basic-level Java Program, Define Package, What are Java Comments?, Understanding the concept of Reserved Words, Introduction to Java Statements, What are Blocks in Java, Explain a Class, Different Methods
Big Data characteristics, understanding Hadoop distributed computing, the Bayesian Law, deploying Storm for real time analytics, the Apache Storm features, comparing Storm with Hadoop, Storm execution, learning about Tuple, Spout, Bolt.
Installing the Apache Storm, various types of run modes of Storm.
Understanding Apache Storm and the data model.
Installation of Apache Kakfa and its configuration.
Understanding of advanced Storm topics like Spouts, Bolts, Stream Groupings, Topology and its Life cycle, learning about Guaranteed Message Processing.
Various Grouping types in Storm, reliable and unreliable messages, Bolt structure and life cycle, understanding Trident topology for failure handling, process, Call Log Analysis Topology for analyzing call logs for calls made from one number to another.
Understanding of Trident Spouts and its different types, the various Trident Spout interface and components, familiarizing with Trident Filter, Aggregator and Functions, a practical and hands-on use case on solving call log problem using Storm Trident.
Various components, classes and interfaces in storm like – Base Rich Bolt Class, i RichBolt Interface, i RichSpout Interface, Base Rich Spout class and the various methodology of working with them.
Understanding Cassandra, its core concepts, its strengths and deployment.
Twitter Boot Stripping, detailed understanding of Boot Stripping, concepts of Storm, Storm Development Environment.
Getting started with HBase, Core concepts of HBase, Understanding HBase with an Example
Why HBase?, Where to use HBase?, What is NoSQL?
HDFS vs.HBase, HBase Use Cases, Data Modeling HBase
HBase Architecture, Main components of HBase Cluster
HBase Shell, HBase API, Primary Operations, Advanced Operations
Create a Table and Insert Data into it, Integration of Hive with HBase, Load Utility
Putting Folder to VM, File loading with both load Utility
Introduction to Cassandra, its strengths and deployment areas
Significance of NoSQL, RDBMS Replication, Key Challenges, types of NoSQL, benefits and drawbacks, salient features of NoSQL database. CAP Theorem, Consistency.
Installation, introduction to Cassandra, key concepts and deployment of non relational database, column-oriented database, Data Model – column, column family,
Token calculation, Configuration overview, Node tool, Validators, Comparators, Expiring column, QA
How Cassandra modelling varies from Relational database modelling, Cassandra modelling steps, introduction to Time Series modelling, comparing Column family Vs. Super Column family, Counter column family, Partitioners, Partitioners strategies, Replication, Gossip protocols, Read operation, Consistency, Comparison
Creation of multi node cluster, node settings, Key and Row cache, System Key space, understanding of Read Operation, Cassandra Commands overview, VNodes, Column family
JSON, Hector client, AVRO, Thrift, JAVA code writing method, Hector tag
Cassandra management, commands of node tool, MapReduce and Cassandra, Secondary index, Datastax Installation
Rules of Cassandra data modelling, increasing data writes, duplication, and reducing data reads, modelling data around queries, creating table for data queries
Understanding the Java application creation methodology, learning key drivers, deploying the IDE for Cassandra applications,cluster connection and data query implementation
Learning about Node Tool Utility, cluster management using Command Line Interface, Cassandra management and monitoring via DataStax Ops Center.
Cassandra client connectivity, connection pool internals, API, important features and concepts of Hector client, Thrift, JAVA code, Summarization.
Understanding what is Apache Kafka, the various components and use cases of Kafka, implementing Kafka on a single node.
Learning about the Kafka terminology, deploying single node Kafka with independent Zookeeper, adding replication in Kafka, working with Partitioning and Brokers, understanding Kafka consumers, the Kafka Writes terminology, various failure handling scenarios in Kafka.
Introduction to multi node cluster setup in Kafka, the various administration commands, leadership balancing and partition rebalancing, graceful shutdown of kafka Brokers and tasks, working with the Partition Reassignment Tool, cluster expending, assigning Custom Partition, removing of a Broker and improving Replication Factor of Partitions.
Understanding the need for Kafka Integration, successfully integrating it with Apache Flume, steps in integration of Flume with Kafka as a Source.
Detailed understanding of the Kafka and Flume Integration, deploying Kafka as a Sink and as a Channel, introduction to PyKafka API and setting up the PyKafka Environment.
Connecting Kafka using PyKafka, writing your own Kafka Producers and Consumers, writing a random JSON Producer, writing a Consumer to read the messages from a topic, writing and working with a File Reader Producer, writing a Consumer to store topics data into a file.
Project 1 – Working with MapReduce, Hive, Sqoop
This project is involved with working on the various Hadoop components like MapReduce, Apache Hive and Apache Sqoop. Work with Sqoop to import data from relational database management system like MySQL data into HDFS. Deploy Hive for summarizing data, querying and analysis. Convert SQL queries using HiveQL for deploying MapReduce on the transferred data. You will gain considerable proficiency in Hive, and Sqoop after completion of this project.
Project 2 – Work on MovieLens data for finding top records
Data – MovieLens dataset
In this project you will work exclusively on data collected through MovieLens available rating data sets. The project involves the following important components:
Project 3 – Hadoop YARN Project – End to End PoC
In this project you will work on a live Hadoop YARN project. YARN is part of the Hadoop 2.0 ecosystem that lets Hadoop to decouple from MapReduce and deploy more competitive processing and wider array of applications. You will work on the YARN central Resource Manager. The salient features of this project include:
Project 4 – Partitioning Tables in Hive
This project involves working with Hive table data partitioning. Ensuring the right partitioning helps to read the data, deploy it on the HDFS, and run the MapReduce jobs at a much faster rate. Hive lets you partition data in multiple ways like:
This will give you hands-on experience in partitioning of Hive tables manually, deploying single SQL execution in dynamic partitioning, bucketing of data so as to break it into manageable chunks.
Project 5 – Connecting Pentaho with Hadoop Ecosystem
This project lets you connect Pentaho with the Hadoop ecosystem. Pentaho works well with HDFS, HBase, Oozie and Zookeeper. You will connect the Hadoop cluster with Pentaho data integration, analytics, Pentaho server and report designer. Some of the components of this project include the following:
Project 6 – Multi-node cluster setup
This is a project that gives you opportunity to work on real world Hadoop multi-node cluster setup in a distributed environment. The major components of this project involve:
You will get a complete demonstration of working with various Hadoop cluster master and slave nodes, installing Java as a prerequisite for running Hadoop, installation of Hadoop and mapping the nodes in the Hadoop cluster.
Project 7 – Hadoop Testing using MR
In this project you will gain proficiency in Hadoop MapReduce code testing using MRUnit. You will learn about real world scenarios of deploying MRUnit, Mockito, and PowerMock. Some of the important aspects of this project include:
After completion of this project you will be well-versed in test driven development and will be able to write light-weight test units that work specifically on the Hadoop architecture.
Project 8 – Hadoop Weblog Analytics
Data – Weblogs
This project is involved with making sense of all the web log data in order to derive valuable insights from it. You will work with loading the server data onto a Hadoop cluster using various techniques. The various modules of this project include:
The web log data can include various URLs visited, cookie data, user demographics, location, date and time of web service access, etc. In this project you will transport the data using Apache Flume or Kafka, workflow and data cleansing using MapReduce, Pig or Spark. The insight thus derived can be used for analyzing customer behavior and predict buying patterns.
Project 9 – Hadoop Maintenance
This project is involved with working on the Hadoop cluster for maintaining and managing it. You will work on a number of important tasks like:
Project 1: Movie Recommendation
Topics – This is a project wherein you will gain hands-on experience in deploying Apache Spark for movie recommendation. You will be introduced to the Spark Machine Learning Library, a guide to MLlib algorithms and coding which is a machine learning library. Understand how to deploy collaborative filtering, clustering, regression, and dimensionality reduction in MLlib. Upon completion of the project you will gain experience in working with streaming data, sampling, testing and statistics.
Project 2: Twitter API Integration for tweet Analysis
Topics – With this project you will learn to integrate Twitter API for analyzing tweets. You will write codes on the server side using any of the scripting languages like PHP, Ruby or Python, for requesting the Twitter API and get the results in JSON format. You will then read the results and perform various operations like aggregation, filtering and parsing as per the need to come up with tweet analysis.
Project 3: Data Exploration Using Spark SQL – Wikipedia data set
Topics – This project lets you work with Spark SQL. You will gain experience in working with Spark SQL for combining it with ETL applications, real time analysis of data, performing batch analysis, deploying machine learning, creating visualizations and processing of graphs.
Project 1– Pentaho Interactive Report
Data– Sales, Customer, Product
Objective – In this Pentaho project you will be exclusively working on creating Pentaho interactive reports for sales, customer and product data fields. As part of the project you will learn to create a data source, build a Mondrian cube which is represented in an XML file. You will gain advanced experience in managing data sources, building and formatting Pentaho report, change the report template and scheduling of reports.
Project 2– Pentaho Interactive Report
Objective – Build complex dashboard with drill down reports and charts for analysing business trends.
Project 3– Pentaho Interactive Report
Objective – To do automation testing in ETL environment, Check the correctness of data transformation, Data loading in datawarehouse without any loss or truncation, Rejecting, Replacing and Reporting invalid data, Creation of unit tests to target exceptions
Project 1: – Python Web Scraping for Data Science
In this project you will be introduced to the process of web scraping using Python. It involves installation of Beautiful Soup, web scraping libraries, working on common data and page format on the web, learning the important kinds of objects, Navigable String, deploying the searching tree, navigation options, parser, search tree, searching by CSS class, list, function and keyword argument.
Objective – To generate a password using Python code which would be tough to guess
Domain – Finance
Objective – The project aims to find the most impacting factors in preferences of pre-paid model, also identifies which are all the variables highly correlated with impacting factors
Domain – Stock Market
Objective – This project focuses on Machine Learning by creating predictive data model to predict future stock prices
Project 5 : Server logs/Firewall logs
Objective – This includes the process of loading the server logs into the cluster using Flume. It can then be refined using Pig Script, Ambari and HCatlog. You can then visualize it using elastic search and excel.
This project task includes:
Java is one of the most popular programming languages for working with MongoDB. This project tells you how to work with the MongoDB Java Driver, and using MongoDB as a Java Developer. Become proficient in creating a table for inserting video using Java programming. Some of the tasks and steps involved are as below–
Project – Library Management System
Problem Statement – It creates library management system project which includes following functionalities:
Add book, Add Member, Issue Book, Return Book, Available Book etc.
Project 1 – Integrate Hive & Java with HBase
Topics : This is project that gives you hands-on experience to connect Hive and Java with HBase. Hive is used for querying using HiveQL that translates SQL-like queries into MapReduce jobs on Hadoop framework. In this project you will do HBase Installation, create Hive for HBase, import the data onto Hive from HBase, use HiveQL for Hive Table data querying and analyzing, and managing the HBase Table. You will also learn to Integrate Java with HBase to run HBase queries using Java applications that you
Type : Deploying the IDE for Cassandra applications
Topics : This project gives you a hands-on experience in installing and working with Apache Cassandra which is a high performance and extremely scalable database for distributed data with no single point of failure. You will deploy the Java Integrated Development Environment for running Cassandra, learn about the key drivers, work with Cassandra applications in a cluster setup and implement data querying techniques.
Type : Multi Broker Kafka Implementation
Topics : In this project you will learn about the Apache Kakfa which is a platform for handling real-time data feeds. You will exclusively work with Kafka brokers, understand partitioning, Kafka consumers, the terminology used for Kafka writes and failure handling in Kafka, understand how to deploy a single node Kafka with independent Zookeeper. Upon completion of the project you will gain considerable experience in working in a real world scenario for processing streaming data within an enterprise infrastructure.
Intellipaat’s Masters program is a structured learning path specially designed by industry experts which ensures that you transform into Big Data expert. Individual courses at Intellipaat focus on one or two specializations. However, if you have to masters big data then this program is for you
Intellipaat is the pioneer of Big Data Architect training we provide:
Intellipaat offers the self-paced training and online instructor-led training.
Hadoop developer, Hadoop admin, Hadoop analyst, Hadoop testing, Spark & Scala, Pentaho, Python, MongoDB are online instructor-led courses
Java, Apache Storm, Hbase, Cassandra, Apache kafka are self-paced courses
If you have any queries you can contact our 24/7 dedicated support to raise a ticket. We provide you email support and solution to your queries. If the query is not resolved by email we can arrange for a one-on-one session with our trainers. The best part is that you can contact Intellipaat even after completion of training to get support and assistance. There is also no limit on the number of queries you can raise when it comes to doubt clearance and query resolution.
We provide you with the opportunity to work on 28 real world projects wherein you can apply your knowledge and skills that you acquired through our training, making you perfectly industry- ready
Yes, Intellipaat does provide you with placement assistance. We have tie-ups with 80+ organizations including Ericsson, Cisco, Cognizant, TCS, among others that are looking for Hadoop professionals and we would be happy to assist you with the process of preparing yourself for the interview and the job
Upon successful completion of training you have to take a set of quizzes, complete the projects and upon review and on scoring over 60% marks in the qualifying quiz the official Intellipaat verified certificate is awarded.The Intellipaat Certification is a seal of approval and is highly recognized in 80+ corporations around the world including many in the Fortune 500 list of companies.
Preferably 8 GB RAM (Windows or Mac) with a good internet connection
All the instructors are from the industry with over 18+ years’ experience. They are subjects experts and each of them has gone through rigorous selection process.
This is a comprehensive course that is designed to clear multiple certifications viz.
The entire training course content is in line with respective certification program and helps you clear the requisite certification exam with ease and get the best jobs in the top MNCs.
As part of this training you will be working on real time projects and assignments that have immense implications in the real world industry scenario thus helping you fast track your career effortlessly.
At the end of this training program there will be quizzes that perfectly reflect the type of questions asked in the respective certification exams and helps you score better marks in certification exam.
Intellipaat Course Completion certificate will be awarded on the completion of Project work (on expert review) and upon scoring of at least 60% marks in the quiz. Intellipaat certification is well recognized in top 80+ MNCs like Ericsson, Cisco, Cognizant, Sony, Mu Sigma, Saint-Gobain, Standard Chartered, TCS, Genpact, Hexaware, etc. The entire training course content is in line with respective certification program and helps you clear the requisite certification exam with ease and get the best jobs in the top MNCs.
This is a comprehensive course that is designed to clear multiple certifications viz. CCA Spark and Hadoop Developer (CCA175) Pentaho Business Analytics Certification Exam C100 DEV: MongoDB Certified Developer Associate Exam Java SE Programmer Certification Apache Cassandra Data Stax Certification The entire training course content is in line with respective certification program and helps you clear the requisite certification exam with ease and get the best jobs in the top MNCs.Intellipaat enjoys strong relationship with 80+ MNCs across the globe. We have a dedicated team who will help you with your resume building once you complete the course and your resume will be forwarded to partner MNCs. Intellipaat don’t charge any extra fees for passing the resume to our partners and clients
"PMI®", "PMP®" and "PMI-ACP®" are registered marks of the Project Management Institute, Inc.
The Open Group®, TOGAF® are trademarks of The Open Group.
The Swirl logoTM is a trade mark of AXELOS Limited.
ITIL® is a registered trade mark of AXELOS Limited.
PRINCE2® is a Registered Trade Mark of AXELOS Limited.
Certified ScrumMaster® (CSM) and Certified Scrum Trainer® (CST) are registered trademarks of SCRUM ALLIANCE®
Professional Scrum Master is a registered trademark of Scrum.org