What is Big Data, Where does Hadoop fit in, Hadoop Distributed File System – Replications, Block Size, Secondary Namenode, High Availability, Understanding YARN – ResourceManager, NodeManager, Difference between 1.x and 2.x
Hadoop 2.x Cluster Architecture , Federation and High Availability, A Typical Production Cluster setup , Hadoop Cluster Modes, Common Hadoop Shell Commands, Hadoop 2.x Configuration Files, Cloudera Single node cluster
What is Graph, Graph Representation, Breadth first Search Algorithm, Graph Representation of Map Reduce, How to do the Graph Algorithm, Example of Graph Map Reduce,
Understanding Apache Pig, the features, various uses and learning to interact with Pig
The syntax of Pig Latin, the various definitions, data sort and filter, data types, deploying Pig for ETL, data loading, schema viewing, field definitions, functions commonly used.
Various data types including nested and complex, processing data with Pig, grouped data iteration, practical exercise
Data set joining, data set splitting, various methods for data set combining, set operations, hands-on exercise
Understanding user defined functions, performing data processing with other languages, imports and macros, using streaming and UDFs to extend Pig, practical exercises
Working with real data sets involving Walmart and Electronic Arts as case study
Understanding Hive, traditional database comparison with Hive, Pig and Hive comparison, storing data in Hive and Hive schema, Hive interaction and various use cases of Hive
Understanding HiveQL, basic syntax, the various tables and databases, data types, data set joining, various built-in functions, deploying Hive queries on scripts, shell and Hue.
The various databases, creation of databases, data formats in Hive, data modeling, Hive-managed Tables, self-managed Tables, data loading, changing databases and Tables, query simplification with Views, result storing of queries, data access control, managing data with Hive, Hive Metastore and Thrift server.
Learning performance of query, data indexing, partitioning and bucketing
Deploying user defined functions for extending Hive
Deploying Hive for huge volumes of data sets and large amounts of querying
Working extensively with User Defined Queries, learning how to optimize queries, various methods to do performance tuning.
What is Impala?, How Impala Differs from Hive and Pig, How Impala Differs from Relational Databases, Limitations and Future Directions, Using the Impala Shell
Data Storage Overview, Creating Databases and Tables, Loading Data into Tables, HCatalog, Impala Metadata Caching
Partitioning Overview, Partitioning in Impala and Hive
Selecting a File Format, Tool Support for File Formats, Avro Schemas, Using Avro with Hive and Sqoop, Avro Schema Evolution, Compression
What is Hbase, Where does it fits, What is NOSQL
What is Spark, Comparison between Spark and Hadoop, Components of Spark
Apache Spark- Introduction, Consistency, Availability, Partition, Unified Stack Spark, Spark Components, Scalding example, mahout, storm, graph
Explain python example, Show installing a spark, Explain driver program, Explaining spark context with example, Define weakly typed variable, Combine scala and java seamlessly, Explain concurrency and distribution., Explain what is trait, Explain higher order function with example, Define OFI scheduler, Advantages of Spark, Example of Lamda using spark, Explain Mapreduce with example
Multi Node Cluster Setup using Amazon ec2 – Creating 4 node cluster setup, Running Map Reduce Jobs on Cluster
Putting it all together and Connecting Dots, Working with Large data sets, Steps involved in analyzing large data
How ETL tools work in Big data Industry, Connecting to HDFS from ETL tool and moving data from Local system to HDFS, Moving Data from DBMS to HDFS, Working with Hive with ETL Tool, Creating Map Reduce job in ETL tool, End to End ETL PoC showing big data integration with ETL tool.
Configuration overview and important configuration file, Configuration parameters and values, HDFS parameters MapReduce parameters, Hadoop environment setup, ‘Include’ and ‘Exclude’ configuration files, Lab: MapReduce Performance Tuning
Namenode/Datanode directory structures and files, File system image and Edit log, The Checkpoint Procedure, Namenode failure and recovery procedure, Safe Mode, Metadata and Data backup, Potential problems and solutions / what to look for, Adding and removing nodes, Lab: MapReduce File system Recovery
Best practices of monitoring a cluster, Using logs and stack traces for monitoring and troubleshooting, Using open-source tools to monitor the cluster
How to schedule Jobs on the same cluster, FIFO Schedule, Fair Scheduler and its configuration
Multi Node Cluster Setup using Amazon ec2 – Creating 4 node cluster setup, Running Map Reduce Jobs on Cluster
ZOOKEEPER Introduction, ZOOKEEPER use cases, ZOOKEEPER Services, ZOOKEEPER data Model, Znodes and its types, Znodes operations, Znodes watches, Znodes reads and writes, Consistency Guarantees, Cluster management, Leader Election, Distributed Exclusive Lock, Important points
Why Oozie?, Installing Oozie, Running an example, Oozie- workflow engine, Example M/R action, Word count example, Workflow application, Workflow submission, Workflow state transitions, Oozie job processing, Oozie security, Why Oozie security?, Job submission, Multi tenancy and scalability, Time line of Oozie job, Coordinator, Bundle, Layers of abstraction, Architecture, Use Case 1: time triggers, Use Case 2: data and time triggers, Use Case 3: rolling window
Overview of Apache Flume, Physically distributed Data sources, Changing structure of Data, Closer look, Anatomy of Flume, Core concepts, Event, Clients, Agents, Source, Channels, Sinks, Interceptors, Channel selector, Sink processor, Data ingest, Agent pipeline, Transactional data exchange, Routing and replicating, Why channels?, Use case- Log aggregation, Adding flume agent, Handling a server farm, Data volume per agent, Example describing a single node flume deployment
HUE introduction, HUE ecosystem, What is HUE?, HUE real world view, Advantages of HUE, How to upload data in File Browser?, View the content, Integrating users, Integrating HDFS, Fundamentals of HUE FRONTEND
IMPALA Overview: Goals, User view of Impala: Overview, User view of Impala: SQL, User view of Impala: Apache HBase, Impala architecture, Impala state store, Impala catalogue service, Query execution phases, Comparing Impala to Hive
Why testing is important, Unit testing, Integration testing, Performance testing, Diagnostics, Nightly QA test, Benchmark and end to end tests, Functional testing, Release certification testing, Security testing, Scalability Testing, Commissioning and Decommissioning of Data Nodes Testing, Reliability testing, Release testing
Understanding the Requirement, preparation of the Testing Estimation, Test Cases, Test Data, Test bed creation, Test Execution, Defect Reporting, Defect Retest, Daily Status report delivery, Test completion, ETL testing at every stage (HDFS, HIVE, HBASE) while loading the input (logs/files/records etc) using sqoop/flume which includes but not limited to data verification, Reconciliation, User Authorization and Authentication testing (Groups, Users, Privileges etc), Report defects to the development team or manager and driving them to closure, Consolidate all the defects and create defect reports, Validating new feature and issues in Core Hadoop.
Report defects to the development team or manager and driving them to closure, Consolidate all the defects and create defect reports, Responsible for creating a testing Framework called MR Unit for testing of Map-Reduce programs.
Automation testing using the OOZIE, Data validation using the query surge tool.
Test plan for HDFS upgrade, Test automation and result
How to test install and configure
Cloudera Certification Tips and Guidance and Mock Interview Preparation, Practical Development Tips and Techniques
R language for statistical programming, the various features of R, introduction to R Studio, the statistical packages, familiarity with different data types and functions, learning to deploy them in various scenarios, use SQL to apply ‘join’ function, components of R Studio like code editor, visualization and debugging tools, learn about R-bind.
R Functions, code compilation and data in well-defined format called R-Packages, learn about R-Package structure, Package metadata and testing, CRAN (Comprehensive R Archive Network), Vector creation and variables values assignment.
R functionality, Rep Function, generating Repeats, Sorting and generating Factor Levels, Transpose and Stack Function.
Introduction to matrix and vector in R, understanding the various functions like Merge, Strsplit, Matrix manipulation, rowSums, rowMeans, colMeans, colSums, sequencing, repetition, indexing and other functions.
Understanding subscripts in plots in R, how to obtain parts of vectors, using subscripts with arrays, as logical variables, with lists, understanding how to read data from external files.
Generate plot in R, Graphs, Bar Plots, Line Plots, Histogram, components of Pie Chart.
Understanding Analysis of Variance (ANOVA) statistical technique, working with Pie Charts, Histograms, deploying ANOVA with R, one way ANOVA, two way ANOVA.
K-Means Clustering for Cluster & Affinity Analysis, Cluster Algorithm, cohesive subset of items, solving clustering issues, working with large datasets, association rule mining affinity analysis for data mining and analysis and learning co-occurrence relationships.
Introduction to Association Rule Mining, the various concepts of Association Rule Mining, various methods to predict relations between variables in large datasets, the algorithm and rules of Association Rule Mining, understanding single cardinality.
Understanding what is Simple Linear Regression, the various equations of Line, Slope, Y-Intercept Regression Line, deploying analysis using Regression, the least square criterion, interpreting the results, standard error to estimate and measure of variation.
Scatter Plots, Two variable Relationship, Simple Linear Regression analysis, Line of best fit
Deep understanding of the measure of variation, the concept of co-efficient of determination, F-Test, the test statistic with an F-distribution, advanced regression in R, prediction linear regression.
Logistic Regression Mean, Logistic Regression in R.
Advanced logistic regression, understanding how to do prediction using logistic regression, ensuring the model is accurate, understanding sensitivity and specificity, confusion matrix, what is ROC, a graphical plot illustrating binary classifier system, ROC curve in R for determining sensitivity/specificity trade-offs for a binary classifier.
Detailed understanding of ROC, area under ROC Curve, converting the variable, data set partitioning, understanding how to check for multicollinearlity, how two or more variables are highly correlated, building of model, advanced data set partitioning, interpreting of the output, predicting the output, detailed confusion matrix, deploying the Hosmer-Lemeshow test for checking whether the observed event rates match the expected event rates.
Data analysis with R, understanding the WALD test, MC Fadden’s pseudo R-squared, the significance of the area under ROC Curve, Kolmogorov Smirnov Chart which is non-parametric test of one dimensional probability distribution.
Connecting to various databases from the R environment, deploying the ODBC tables for reading the data, visualization of the performance of the algorithm using Confusion Matrix.
Creating an integrated environment for deploying R on Hadoop platform, working with R Hadoop, RMR package and R Hadoop Integrated Programming Environment, R programming for MapReduce jobs and Hadoop execution.
Logistic Regression Case Study
In this case study you will get a detailed understanding of the advertisement spends of a company that will help to drive more sales. You will deploy logistic regression to forecast the future trends, detect patterns, uncover insights and more all through the power of R programming. Due to this the future advertisement spends can be decided and optimized for higher revenues.
Multiple Regression Case Study
You will understand how to compare the miles per gallon (MPG) of a car based on the various parameters. You will deploy multiple regression and note down the MPG for car make, model, speed, load conditions, etc. It includes the model building, model diagnostic, checking the ROC curve, among other things.
Receiver Operating Characteristic (ROC) case study
You will work with various data sets in R, deploy data exploration methodologies, build scalable models, predict the outcome with highest precision, diagnose the model that you have created with various real world data, check the ROC curve and more.
Introduction to the search engine, the Apache Lucene, understanding the inverted index, documents and fields & documents.
Introduction to the various query types available in Lucene and clear understanding of these.
Understanding the prerequisites for using Apache Lucene, learning about the querying process, analyzers, scoring boosting, faceting, grouping, highlighting, the various types of geographical and spatial searches, introduction to Apache Tika.
Demonstration of the Apache Lucene workings.
Understanding the Analyzer, Query Parser in Apache Lucene, Query Object, Stopword.
Understanding the various aspects of Apache Lucene like Scoring, Boosting, Highlighting, Faceting and Grouping
Introduction to Apache Solr, the advantages of Apache Solr over Apache Lucene, the basic system requirements for using Apache Solr, introduction to Cores in Apache Solr.
Introduction to the Apache Solr indexing, index using built-in data import handler and post tool, understanding the Solrj Client and configuration of Solrj Client.
Demonstrating the Book Store use cases with Solr Indexing with practical examples, learning to build Schema, the field, field types, CopyField and Dynamic Field, understanding how to add, explore, update, and delete using Solrj.
The various aspects of Apache Solr search like sorting, pagination, an overview of the request parameters, faceting and highlighting.
Understanding the Request Handlers, defining and mapping to search components, highlighting and faceting, updating managed schemas, request parameters hardwiring, adding fields to default search, the various types of Analyzers, Parsers, Tokenizers.
Grouping of results in Apache Solr, Parse queries functions, fuzzy query in Apache Solr.
The extended features in Apache Solr, learning about Pseudo-fields, Pseudo-Joins, Spell Check, suggestions, Geospatial Search, multi-language search, stop words and synonyms.
Understanding the concept of Multicore in Solr, the creation of Multicore in Solr, the need of Multicore, Joining of data, Replication and Ping Handler.
Understanding the SolrCloud, the concept of Sharding, indexing, and replication in Apache SolrCloud, the working of Apache SolrCloud, distributed requests, reading and writng slide fault tolerance, cluster coordination using Apache ZooKeeper.
Introduction to Splunk, Splunk developer roles and responsibilities
Writing Splunk query for search, sharing, saving, scheduling and exporting search results
Creation of alert, explaining alerts and viewing fired alerts
Introduction to Tags in Splunk, deploying Tags for Splunk search, understanding event types and utility, generating and implementing event types in Search
Search Command study, search practices in general, detailed understanding of search, search field performance with different commands like table,multikv, rename, rex & erex
Using following commands and their functions: addcoltotals, addtotals,top, rare,stats
Explore the available visualizations, create charts and time charts, omit null values and format results
Calculating and analyzing results, value conversion, round and format values, using eval command, conditional statements, filtering calculated search results
Understanding Search Transactions
Learn about data lookups, example, lookup table, defining and configuring automatic lookup, deploying lookup in reports and searches
Creating search charts, reports and dashboards
Working with raw data for data extraction, transformation, parsing and preview
Introduction to the Splunk 3 tier architecture, understanding the Server settings, control, preferences and licensing, the most important components of Splunk tool, the hardware requirements, conditions for installation of Splunk.
Understanding how to install and configure Splunk, index creation, input configuration in standalone server, the search preferences, installing Splunk in the Linux environment.
Installing Splunk in the Linux environment, the various prerequisites, configuration of Splunk in Linux.
Introduction to the Splunk Distributed Management Console, index clustering, forwarder management and distributed search in Splunk environment, providing the right authentication to users, access control.
Introducing the Splunk app, managing the Splunk app, the various add-ons in Splunk app, deleting and installing apps from SplunkBase, deploying the various app permissions, deploying the Splunk app, apps on forwarder.
Understanding the index time configuration file and search time configuration file.
Learning about the index time and search time configuration files in Splunk, installing the forwarders, configuring the output and inputs.conf, managing the Universal Forwarders.
Deploying the Splunk tool, the Splunk deployment Server, setting up the Splunk deployment environment, deploying the clients grouping in Splunk.
Understanding the Splunk Indexes, the default Splunk Indexes, segregating the Splunk Indexes, learning about Splunk Buckets and Bucket Classification, estimating index storage, creating new index.
Understanding the concept of role inheritance, Splunk authentications, native authentications, LDAP authentications.
Splunk installation, configuration, data inputs, app management, Splunk important concepts, parsing machine-generated data, search indexer and forwarder.
Introduction to Splunk Configuration Files, Universal Forwarder, Forwarder Management, data management, troubleshooting and monitoring.
Converting machine-generated data into operational intelligence, setting up Dashboard, Reports and Charts, integrating Search Head Clustering & Indexer Clustering.
Understanding the input methods, deploying scripted, Windows, network and agentless input types, fine-tuning it all.
Splunk User authentication and Job Role assignment, learning to manage, monitor and optimize Splunk Indexes.
Understanding parsing of machine-generated data, manipulation of raw data, previewing and parsing, data field extraction.
Distributed search concepts, improving search performance, large scale deployment and overcoming execution hurdles, working with Splunk Distributed Management Console for monitoring the entire operation.
Introducing Scala and deployment of Scala for Big Data applications and Apache Spark analytics.
The importance of Scala, the concept of REPL (Read Evaluate Print Loop), deep dive into Scala pattern matching, type interface, higher order function, currying, traits, application space and Scala for data analysis.
Learning about the Scala Interpreter, static object timer in Scala, testing String equality in Scala, Implicit classes in Scala, the concept of currying in Scala, various classes in Scala.
Learning about the Classes concept, understanding the constructor overloading, the various abstract classes, the hierarchy types in Scala, the concept of object equality, the val and var methods in Scala.
Understanding Sealed traits, wild, constructor, tuple, variable pattern, and constant pattern.
Understanding traits in Scala, the advantages of traits, linearization of traits, the Java equivalent and avoiding of boilerplate code.
Implementation of traits in Scala and Java, handling of multiple traits extending.
Introduction to Scala collections, classification of collections, the difference between Iterator, and Iterable in Scala, example of list sequence in Scala.
The two types of collections in Scala, Mutable and Immutable collections, understanding lists and arrays in Scala, the list buffer and array buffer, Queue in Scala, double-ended queue Deque, Stacks, Sets, Maps, Tuples in Scala.
Introduction to Scala packages and imports, the selective imports, the Scala test classes, introduction to JUnit test class, JUnit interface via JUnit 3 suite for Scala test, packaging of Scala applications in Directory Structure, example of Spark Split and Spark Scala.
Introduction to Spark, how Spark overcomes the drawbacks of working MapReduce, understanding in-memory MapReduce, Spark Hadoop YARN, HDFS Revision, YARN Revision, the overview of Spark and how it is better Hadoop, deploying Spark without Hadoop.
Spark installation guide, working with Spark Shell, the concept of Resilient Distributed Datasets (RDD), learning to do functional programming in Spark, the architecture of Spark.
Deep dive into Spark RDDs, the RDD general operations, a read-only partitioned collection of records, using the concept of RDD for faster and efficient data processing.
Understanding the concept of Key-Value pair in RDDs, learning how Spark makes MapReduce operations faster, various operations of RDD.
Comparing the Spark applications with Spark Shell, creating a Spark application using Scala or Java, deploying a Spark application, the web user interface of Spark application, a real world example of Spark and configuring of Spark.
Learning about Spark parallel processing, deploying on a cluster, introduction to Spark partitions, file-based partitioning of RDDs, understanding of HDFS and data locality, mastering the technique of parallel operations.
Understanding the RDD persistence overview, distributed persistence, RDD lineage
Understanding the Spark streaming, creating a Spark stream application, processing of Spark stream, streaming request count and DStreams.
Introduction to Spark multi-batch operations, state operations, sliding window operations and advanced data sources.
Learning about the Spark common use cases, the concept of iterative algorithm in Spark, analyzing with Spark graph processing, introduction to K-Means and machine learning.
Introduction to various variables in Spark like shared variables, broadcast variables, learning about accumulators, the common performance issues and troubleshooting the performance problems.
Learning about Spark SQL, the context of SQL in Spark for providing structured data processing, understanding the Data Frames in Spark, learning to query and transform data in Data Frames, how Data Frame provides the benefit of both Spark RDD and Spark SQL, deploying Hive on Spark as the execution engine.
Learning about the scheduling and partitioning in Spark, scheduling within and around applications, static partitioning, dynamic sharing, fair scheduling, Spark master high availability, standby Masters with Zookeeper, Single Node Recovery With Local File System, High Order Functions.
Understanding how to design capacity planning in Spark, creation of Maps, Transformations, the concept of concurrency in Java and Scala.
Understanding about log analysis with Spark, first log analyzers in Spark, working with various buffers like array, compact and protocol buffer.
Big Data characteristics, understanding Hadoop distributed computing, the Bayesian Law, deploying Storm for real time analytics, the Apache Storm features, comparing Storm with Hadoop, Storm execution, learning about Tuple, Spout, Bolt.
Installing the Apache Storm, various types of run modes of Storm.
Understanding Apache Storm and the data model.
Installation of Apache Kakfa and its configuration.
Understanding of advanced Storm topics like Spouts, Bolts, Stream Groupings, Topology and its Life cycle, learning about Guaranteed Message Processing.
Various Grouping types in Storm, reliable and unreliable messages, Bolt structure and life cycle, understanding Trident topology for failure handling, process, Call Log Analysis Topology for analyzing call logs for calls made from one number to another.
Understanding of Trident Spouts and its different types, the various Trident Spout interface and components, familiarizing with Trident Filter, Aggregator and Functions, a practical and hands-on use case on solving call log problem using Storm Trident.
Various components, classes and interfaces in storm like – Base Rich Bolt Class, i RichBolt Interface, i RichSpout Interface, Base Rich Spout class and the various methodology of working with them.
Understanding Cassandra, its core concepts, its strengths and deployment.
Twitter Boot Stripping, detailed understanding of Boot Stripping, concepts of Storm, Storm Development Environment.
Introduction to Cassandra, its strengths and deployment areas
Significance of NoSQL, RDBMS Replication, Key Challenges, types of NoSQL, benefits and drawbacks, salient features of NoSQL database. CAP Theorem, Consistency.
Installation, introduction to Cassandra, key concepts and deployment of non relational database, column-oriented database, Data Model – column, column family,
Token calculation, Configuration overview, Node tool, Validators, Comparators, Expiring column, QA
How Cassandra modelling varies from Relational database modelling, Cassandra modelling steps, introduction to Time Series modelling, comparing Column family Vs. Super Column family, Counter column family, Partitioners, Partitioners strategies, Replication, Gossip protocols, Read operation, Consistency, Comparison
Creation of multi node cluster, node settings, Key and Row cache, System Key space, understanding of Read Operation, Cassandra Commands overview, VNodes, Column family
JSON, Hector client, AVRO, Thrift, JAVA code writing method, Hector tag
Cassandra management, commands of node tool, MapReduce and Cassandra, Secondary index, Datastax Installation
Rules of Cassandra data modelling, increasing data writes, duplication, and reducing data reads, modelling data around queries, creating table for data queries
Understanding the Java application creation methodology, learning key drivers, deploying the IDE for Cassandra applications,cluster connection and data query implementation
Learning about Node Tool Utility, cluster management using Command Line Interface, Cassandra management and monitoring via DataStax Ops Center.
Cassandra client connectivity, connection pool internals, API, important features and concepts of Hector client, Thrift, JAVA code, Summarization.
The Architecture of Couchbase, understanding Couchbase distributed NoSQL database engine, vBuckets for information distribution on Couchbase cluster, user and system requirements, Couchbase downloading and installation.
Couchbase single node deployment for development purpose
Managing the Couchbase environment with the Web Console tool, configuring the Couchbase server and management, working with Couchbase data buckets, default bucket sizing, and administration.
Methods for deploying Couchbase in multi node cluster – all Couchbase Servers on one machine and second with each Couchbase Server on own machine.
The Couchbase Command-line Interface tools for managing and monitoring single node and multi node clusters, Severs and vBuckets, developing Reports for log data collection.
RDBMS, types of relational databases, challenges of RDBMS, NoSQL database, its significance, how NoSQL suits Big Data needs, Introduction to MongoDB and its advantages, MongoDB installation, JSON features, data types and examples.
Installing MongoDB, basic MongoDB commands and operations, MongoChef (MongoGUI) Installation, MongoDB Data types.
The need for NoSQL, types of NoSQL databases, OLTP, OLAP, limitations of RDBMS, ACID properties, CAP Theorem, Base property, learning about JSON/BSON, database collection & document, MongoDB uses, MongoDB Write Concern – Acknowledged, Replica Acknowledged, Unacknowledged, Journaled, Fsync.
Understanding CRUD and its functionality, CRUD concepts, MongoDB Query & Syntax, read and write queries and query optimization.
Concepts of data modeling, difference between MongoDB and RDBMS modeling, Model tree structure, operational strategies, monitoring and backup.
In this module you will learn MongoDB® Administration activities such as Health Check, Backup, Recovery, database sharding and profiling, Data Import/Export, Performance tuning etc.
Concepts of data aggregation and types, data indexing concepts, properties and variations.
Understanding database security risks, MongoDB security concept and security approach, MongoDB integration with Java and Robomongo.
Implementing techniques to work with variety of unstructured data like images, videos, log data, and others, understanding GridFS MongoDB file system for storing data.
Project 1 – Working with MapReduce, Hive, Sqoop
This project is involved with working on the various Hadoop components like MapReduce, Apache Hive and Apache Sqoop. Work with Sqoop to import data from relational database management system like MySQL data into HDFS. Deploy Hive for summarizing data, querying and analysis. Convert SQL queries using HiveQL for deploying MapReduce on the transferred data. You will gain considerable proficiency in Hive, and Sqoop after completion of this project.
Project 2 – Work on MovieLens data for finding top records
Data – MovieLens dataset
In this project you will work exclusively on data collected through MovieLens available rating data sets. The project involves the following important components:
Project 3 – Hadoop YARN Project – End to End PoC
In this project you will work on a live Hadoop YARN project. YARN is part of the Hadoop 2.0 ecosystem that lets Hadoop to decouple from MapReduce and deploy more competitive processing and wider array of applications. You will work on the YARN central Resource Manager. The salient features of this project include:
Project 4 – Partitioning Tables in Hive
This project involves working with Hive table data partitioning. Ensuring the right partitioning helps to read the data, deploy it on the HDFS, and run the MapReduce jobs at a much faster rate. Hive lets you partition data in multiple ways like:
This will give you hands-on experience in partitioning of Hive tables manually, deploying single SQL execution in dynamic partitioning, bucketing of data so as to break it into manageable chunks.
Project 5 – Connecting Pentaho with Hadoop Ecosystem
This project lets you connect Pentaho with the Hadoop ecosystem. Pentaho works well with HDFS, HBase, Oozie and Zookeeper. You will connect the Hadoop cluster with Pentaho data integration, analytics, Pentaho server and report designer. Some of the components of this project include the following:
Project 6 – Multi-node cluster setup
This is a project that gives you opportunity to work on real world Hadoop multi-node cluster setup in a distributed environment. The major components of this project involve:
You will get a complete demonstration of working with various Hadoop cluster master and slave nodes, installing Java as a prerequisite for running Hadoop, installation of Hadoop and mapping the nodes in the Hadoop cluster.
Project 7 – Hadoop Testing using MR
In this project you will gain proficiency in Hadoop MapReduce code testing using MRUnit. You will learn about real world scenarios of deploying MRUnit, Mockito, and PowerMock. Some of the important aspects of this project include:
After completion of this project you will be well-versed in test driven development and will be able to write light-weight test units that work specifically on the Hadoop architecture.
Project 8 – Hadoop Weblog Analytics
Data – Weblogs
This project is involved with making sense of all the web log data in order to derive valuable insights from it. You will work with loading the server data onto a Hadoop cluster using various techniques. The various modules of this project include:
The web log data can include various URLs visited, cookie data, user demographics, location, date and time of web service access, etc. In this project you will transport the data using Apache Flume or Kafka, workflow and data cleansing using MapReduce, Pig or Spark. The insight thus derived can be used for analyzing customer behavior and predict buying patterns.
Project 9 – Hadoop Maintenance
This project is involved with working on the Hadoop cluster for maintaining and managing it. You will work on a number of important tasks like:
Project Title – Restaurant Revenue Prediction
Data set – Sales
Project Description – This project involves predicting the sales of a restaurant on the basis of certain objective measurements. This project will give real time industry experience on handling multiple use cases and derive the solution. This project gives insights about feature engineering and selection.
Project 1 – Understanding Cold Start Problem in Data Science
Topics: This project involves understanding of the cold start problem associated with the recommender systems. You will gain hands-on experience in information filtering, working on systems with zero historical data to refer to, as in the case of launching a new product. You will gain proficiency in working with personalized applications like movies, books, songs, news and such other recommendations. This project includes the following:
Project 2 – Recommendation for Movie, Summary
Topics: This is real world project that gives you hands-on experience in working with a movie recommender system. Depending on what movies are liked by a particular user, you will be in a position to provider data-driven recommendations. This project involves understanding recommender systems, information filtering, predicting ‘rating’, learning about user ‘preference’ and so on. You will exclusively work on data related to user details, movie details and others. The main components of the project include the following:
The Market Basket Analysis (MBA) case study
This case study is associated with the modeling technique of Market Basket Analysis where you will learn about loading of data, various techniques for plotting the items and running the algorithms. It includes finding out what are the items that go hand in hand and hence can be clubbed together. This is used for various real world scenarios like a supermarket shopping cart and so on.
Topics : This project involves working with the Couchbase command-line interface tools that are used for managing of clusters in a multi-node or single node setup, working with vBuckets in Couchbase server, deploying Reports for log data collection. You will gain hands-on experience in deploying commands like start, stop and report status for log collection. It also includes working with Couchbase-cli, cbcollect_info tool and so on. Upon completion of the project you will be proficient in using Couchbase CLI for managing and monitoring clusters, data replication using XDCR.
Project – Running Function Queries on Apache Solr
Topics : In this project you will learn about the Function Queries and deploy it on the search results got in Apache Solr. You will understand how exactly the Function Queries are used to modify the search results based on certain conditions. It involves working on the index store that has dimensions of a box with arbitrary names, sort all the boxes through search and then modify the search results using Function Queries based on new parameters. Some of the query parsers used are DisMax, Extended DisMax and standard.
Topics: This project gives you hands-on experience in working with the Splunk tool. You will have the data set of employee details in a text file based on which you will create a dashboard and report. Then you will deploy the various Splunk commands to perform row operations, extract certain data fields, edit the event, add tags, search with tag name for event and then save the tag search. Upon completion of this project you will learn to create a searchable repository using data that is captured, correlated and indexed in real time and ultimately visualize it using dashboard, report and alert.
Type – Field Extraction
Topics : In this project you will learn to extract fields from events using the Splunk field extraction technique. You will gain knowledge in the basics of field extractions, understand the use of field extractor, the field extraction page in Splunk web and field extract configuration in files. Learn about the regular expression and delimiters method of field extraction. Upon completion of the project you will gain expertise in building Splunk dashboard and use the extracted fields data in it to create rich visualizations in an enterprise setup.
Project 1: Movie Recommendation
Topics – This is a project wherein you will gain hands-on experience in deploying Apache Spark for movie recommendation. You will be introduced to the Spark Machine Learning Library, a guide to MLlib algorithms and coding which is a machine learning library. Understand how to deploy collaborative filtering, clustering, regression, and dimensionality reduction in MLlib. Upon completion of the project you will gain experience in working with streaming data, sampling, testing and statistics.
Project 2: Twitter API Integration for tweet Analysis
Topics – With this project you will learn to integrate Twitter API for analyzing tweets. You will write codes on the server side using any of the scripting languages like PHP, Ruby or Python, for requesting the Twitter API and get the results in JSON format. You will then read the results and perform various operations like aggregation, filtering and parsing as per the need to come up with tweet analysis.
Project 3: Data Exploration Using Spark SQL – Wikipedia data set
Topics – This project lets you work with Spark SQL. You will gain experience in working with Spark SQL for combining it with ETL applications, real time analysis of data, performing batch analysis, deploying machine learning, creating visualizations and processing of graphs.
Project 1. Call Log Analysis using Trident
Topics : In this project you will be working on call logs to decipher the data and gather valuable insights using Apache Storm Trident. You will extensively work with data about calls made from one number to another. The aim of this project is to resolve the call log issues with Trident stream processing and low latency distributed querying. You will gain hands-on experience in working with Spouts and Bolts along with various Trident functions, filters, aggregation, joins and grouping.
Project 2. Twitter Data Analysis using Trident
Topics : This is a project that involves working with Twitter data and processing it to extract patterns out of it. The Apache Storm Trident is the perfect framework for real-time analysis of tweets. Working with Trident you will be able to simplify the task of live Twitter feed analysis. In this project you will gain real world experience of working with Spouts, Bolts, and Trident filters, joins, aggregation, functions and grouping.
Project 3. US Presidential Election Result analysis using Trident DRPC Query
Topics : This is a project that lets you work on the US presidential election results and predict who is leading and trailing on a real-time basis. For this you exclusively work with Trident distributed Remote Procedure Call server. After completion of the project you will learn how to access data residing in a remote computer or network and deploy it for real-time processing, analysis and prediction.
Type : Deploying the IDE for Cassandra applications
Topics : This project gives you a hands-on experience in installing and working with Apache Cassandra which is a high performance and extremely scalable database for distributed data with no single point of failure. You will deploy the Java Integrated Development Environment for running Cassandra, learn about the key drivers, work with Cassandra applications in a cluster setup and implement data querying techniques.
Java is one of the most popular programming languages for working with MongoDB. This project tells you how to work with the MongoDB Java Driver, and using MongoDB as a Java Developer. Become proficient in creating a table for inserting video using Java programming. Some of the tasks and steps involved are as below–
Intellipaat’s Combo program is a structured learning path specially designed by industry experts and ensures that you transform into Big Data Data Science expert. Individual courses at Intellipaat focus on one or two specializations. However, if you have to masters Big Data Data Science then this program is for you
Intellipaat is the pioneer of Big Data Data Science training we provide:
Intellipaat offers the self-paced training and online instructor-led training.
Hadoop developer, Hadoop admin, Hadoop analyst, Hadoop testing, Spark & Scala, Apache Storm, Data Science with R, Data Science with SAS, Splunk, Deep learning are online instructor-led courses
Java, Hbase, Cassandra, Apache kafka, Couchbase, Apache Solr, Linux, Mahout are self-paced courses
If you have any queries you can contact our 24/7 dedicated support to raise a ticket. We provide you email support and solution to your queries. If the query is not resolved by email we can arrange for a one-on-one session with our trainers. The best part is that you can contact Intellipaat even after completion of training to get support and assistance. There is also no limit on the number of queries you can raise when it comes to doubt clearance and query resolution.
We provide you with the opportunity to work on 48 real world projects wherein you can apply your knowledge and skills that you acquired through our training, making you perfectly industry ready
Yes, Intellipaat does provide you with placement assistance. We have tie-ups with 80+ organizations including Ericsson, Cisco, Cognizant, TCS, among others that are looking for Hadoop professionals and we would be happy to assist you with the process of preparing yourself for the interview and the job
Upon successful completion of training you have to take a set of quizzes, complete the projects and upon review and on scoring over 60% marks in the qualifying quiz the official Intellipaat verified certificate is awarded.The Intellipaat Certification is a seal of approval and is highly recognized in 80+ corporations around the world including many in the Fortune 500 list of companies.
Preferably 8 GB RAM (Windows or Mac) with a good internet connection
All the instructors are from the industry with over 18+ years’ experience. They are subjects experts and each of them has gone through rigorous selection process.
This is a comprehensive course that is designed to clear multiple certifications viz.
The entire training course content is in line with respective certification program and helps you clear the requisite certification exam with ease and get the best jobs in the top MNCs.
As part of this training you will be working on real time projects and assignments that have immense implications in the real world industry scenario thus helping you fast track your career effortlessly.
At the end of this training program there will be quizzes that perfectly reflect the type of questions asked in the respective certification exams and helps you score better marks in certification exam.
Intellipaat Course Completion certificate will be awarded on the completion of Project work (on expert review) and upon scoring of at least 60% marks in the quiz. Intellipaat certification is well recognized in top 80+ MNCs like Ericsson, Cisco, Cognizant, Sony, Mu Sigma, Saint-Gobain, Standard Chartered, TCS, Genpact, Hexaware, etc.
"PMI®", "PMP®" and "PMI-ACP®" are registered marks of the Project Management Institute, Inc.
The Open Group®, TOGAF® are trademarks of The Open Group.
The Swirl logoTM is a trade mark of AXELOS Limited.
ITIL® is a registered trade mark of AXELOS Limited.
PRINCE2® is a Registered Trade Mark of AXELOS Limited.
Certified ScrumMaster® (CSM) and Certified Scrum Trainer® (CST) are registered trademarks of SCRUM ALLIANCE®
Professional Scrum Master is a registered trademark of Scrum.org