In this Hadoop and R Programming training combo course, you will learn to analyze, extract and compute large volumes of structured and unstructured data using Hadoop and R Programming.
It is an all-in-one course designed to give a 360-degree overview of Hadoop architecture and its implementation on real-time projects, along with core concepts of R Programming applied to import various formats of data for statistical computing and graphics. The major topics include Hadoop and its ecosystem, core concepts of MapReduce and HDFS, introduction to HBase architecture, Hadoop Cluster Setup, Hadoop Administration and Maintenance. The course further trains you on data structures, variables, control flow, functions and getting data into the R environment, overview of statistics in R, descriptive statistics, inferential statistics, linear regression, sophisticated graphics in R, R Programming for mapping and GIS and integrating R Programming with Hadoop.
After the completion of this Hadoop all-in-one course, you will be able to:
1. Hadoop Projects
1. Project – Working with MapReduce, Hive, Sqoop
Problem Statement – It describes how to import MySQL data using sqoop and querying it using hive and also describes how to run the word count MapReduce job.
2. Project – Work on Movie lens data for finding top records
Data – Movie Lens dataset
Problem Statement – It includes:
3. Project – Hadoop Yarn Project – End to End PoC
Problem Statement – It includes:
4. Project – Partitioning Tables
Problem Statement – It describes the parting and How to perform portioning. It includes:
5. Project – Sales Commission
Data – Sales
Problem Statement – In this we calculate the commission according to the sales.
6. Project – Connecting Pentaho with Hadoop Ecosystem
Problem Statement – It includes:
7. Project – Multinode Cluster Setup
Problem Statement – It includes following actions:
8. Project – Hadoop Testing using MR
Problem Statement – It describes that how to test map reduce codes with MR unit.
9. Project – Hadoop Weblog Analytics
Data – Weblogs
Problem Statement – The goal is to enable the participants to have a feel of the actual data sets in a production environment and how to load the data into a Hadoop cluster using various techniques. Once data is loaded, the next goal is to perform basic analytics on this data.
2. R Programming Project – Restaurant Revenue Prediction
Data – Revenue Dataset
Problem Statement – It predicts the annual restaurant sales based on the objective measurements. It uses following data fields:
It also includes:
The architecture of Hadoop 2.0 cluster, what is High Availability and Federation, how to setup a production cluster, the various shell commands in Hadoop, understanding configuration files in Hadoop 2.0, installing single node cluster with Cloudera Manager, understanding Spark, Scala, Sqoop, Pig and Flume.
Introducing Big Data & Hadoop, what is Big Data and where does Hadoop fits in, two important Hadoop ecosystem componentsnamely Map Reduce and HDFS, in-depth Hadoop Distributed File System – Replications, Block Size, Secondary Name node, High Availability, in-depth YARN – Resource Manager, Node Manager.
Hands-on Exercise – HDFS working mechanism, data replication process, how to determine the size of the block, understanding a DataNode and NameNode.
Learning the working mechanism of MapReduce, understanding the mapping and reducing stages in MR, the various terminologies in MR like Input Format, Output Format, Partitioners, Combiners, Shuffle and Sort
Hands-on Exercise – How to write a Word Count program in MapReduce, how to write a custom Partitioner, what is a MapReduce Combiner, how to run a job in a local job runner, deploying unit test, what is a map side join and reduce side join, what is a tool runner, how to use counters, dataset joining with map side and reduce side joins.
Introducing Hadoop Hive, detailed architecture of Hive, comparing Hive with Pig and RDBMS, working with Hive Query Language, creation of database, table, Group by and other clauses, the various types of Hive tables, Hcatalog, storing the Hive Results, Hive partitioning and Buckets.
Hands-on Exercise – Database creation in Hive, dropping a database, Hive table creation, how to change the database, data loading, Hive table creation, dropping and altering table, pulling data by writing Hive queries with filter conditions, table partitioning in Hive, what is a group by clause
The indexing in Hive, the Map side Join in Hive, working with complex data types, the Hive User-defined Functions, Introduction to Impala, comparing Hive with Impala, the detailed architecture of Impala
Hands-on Exercise – How to work with Hive queries, the process of joining table and writing indexes, external table and sequence table deployment, data storage in a different table.
Apache Pig introduction, its various features, the various data types and schema in Hive, the available functions in Pig, Hive Bags, Tuples and Fields.
Hands-on Exercise – Working with Pig in MapReduce and local mode, loading of data, limiting data to 4 rows, storing the data into file, working with Group By,Filter By,Distinct,Cross,Split in Hive.
Apache Sqoop introduction, overview, importing and exporting data, performance improvement with Sqoop, Sqoop limitations, introduction to Flume and understanding the architecture of Flume, what is HBase and the CAP theorem.
Hands-on Exercise – Working with Flume to generating of Sequence Number and consuming it, using the Flume Agent to consume the Twitter data, using AVRO to create Hive Table, AVRO with Pig, creating Table in HBase, deploying Disable, Scan and Enable Table.
Using Scala for writing Apache Spark applications, detailed study of Scala, the need for Scala, the concept of object oriented programing, executing the Scala code, the various classes in Scala like Getters,Setters, Constructors, Abstract ,Extending Objects, Overriding Methods, the Java and Scala interoperability, the concept of functional programming and anonymous functions, Bobsrockets package, comparing the mutable and immutable collections.
Hands-on Exercise – Writing Spark application using Scala, understanding the robustness of Scala for Spark real-time analytics operation.
Detailed Apache Spark, its various features, comparing with Hadoop, the various Spark components, combining HDFS with Spark, Scalding, introduction to Scala, importance of Scala and RDD.
Hands-on Exercise – The Resilient Distributed Dataset in Spark and how it helps to speed up big data processing.
Understanding the Spark RDD operations, comparison of Spark with MapReduce, what is a Spark transformation, loading data in Spark, types of RDD operations viz. transformation and action, what is Key Value pair.
Hands-on Exercise – How to deploy RDD with HDFS, using the in-memory dataset, using file for RDD, how to define the base RDD from external file, deploying RDD via transformation, using the Map and Reduce functions, working on word count and count log severity.
The detailed Spark SQL, the significance of SQL in Spark for working with structured data processing, Spark SQL JSON support, working with XML data, and parquet files, creating HiveContext, writing Data Frame to Hive, How to read a JDBC file, significance of a Spark Data Frame, how to create a Data Frame, what is schema manual inferring, how to work with CSV files, JDBC table reading, data conversion from Data Frame to JDBC, Spark SQL user-defined functions. shared variable and accumulators, how to query and transform data in Data Frames, how Data Frame provides the benefits of both Spark RDD and Spark SQL, deploying Hive on Spark as the execution engine.
Hands-on Exercise – Data querying and transformation using Data Frames, finding out the benefits of Data Frames over Spark SQL and Spark RDD.
Introduction to Spark MLlib, understanding the various algorithms, what is Spark iterative algorithm, Spark graph processing analysis, introducing machine learning, K-Means clustering, Spark variables like shared and broadcast variables, what are accumulators.
Hands-on Exercise – Writing spark code using Mlib.
Introduction to Spark streaming, the architecture of Spark Streaming, working with the Spark streaming program, processing data using Spark streaming, requesting count and Dstream, multi-batch and sliding window operations and working with advanced data sources.
Hands-on Exercise – Deploying Spark streaming for data in motion and checking the output is as per the requirement.
Create a four node Hadoop cluster setup, running the MapReduce Jobs on the Hadoop cluster, successfully running the MapReduce code, working with the Cloudera Manager setup.
Hands-on Exercise – The method to build a multi-node Hadoop cluster using an Amazon EC2 instance, working with the Cloudera Manager.
The overview of Hadoop configuration, the importance of Hadoop configuration file, the various parameters and values of configuration, the HDFS parameters and MapReduce parameters, setting up the Hadoop environment, the Include’ and Exclude configuration files, the administration and maintenance of Name node, Data node directory structures and files, What is a File system image, understanding Edit log.
Hands-on Exercise – The process of performance tuning in MapReduce.
Introduction to the Checkpoint Procedure, Name node failure and how to ensure the recovery procedure, Safe Mode, Metadata and Data backup, the various potential problems and solutions, what to look for, how to add and remove nodes.
Hands-on Exercise – How to go about ensuring the MapReduce File system Recovery for various different scenarios, JMX monitoring of the Hadoop cluster, how to use the logs and stack traces for monitoring and troubleshooting, using the Job Scheduler for scheduling jobs in the same cluster, getting the MapReduce job submission flow, FIFO schedule, getting to know the Fair Scheduler and its configuration.
How ETL tools work in Big data Industry, Introduction to ETL and Data warehousing. Working with prominent use cases of Big data in ETL industry, End to End ETL PoC showing big data integration with ETL tool.
Hands-on Exercise – Connecting to HDFS from ETL tool and moving data from Local system to HDFS, Moving Data from DBMS to HDFS, Working with Hive with ETL Tool, Creating Map Reduce job in ETL tool
Working towards the solution of the Hadoop project solution, its problem statements and the possible solution outcomes, preparing for the Cloudera Certifications, points to focus for scoring the highest marks, tips for cracking Hadoop interview questions.
Hands-on Exercise – The project of a real-world high value Big Data Hadoop application and getting the right solution based on the criteria set by the Intellipaat team.
Why testing is important, Unit testing, Integration testing, Performance testing, Diagnostics, Nightly QA test, Benchmark and end to end tests, Functional testing, Release certification testing, Security testing, Scalability Testing, Commissioning and Decommissioning of Data Nodes Testing, Reliability testing, Release testing
Understanding the Requirement, preparation of the Testing Estimation, Test Cases, Test Data, Test bed creation, Test Execution, Defect Reporting, Defect Retest, Daily Status report delivery, Test completion, ETL testing at every stage (HDFS, HIVE, HBASE) while loading the input (logs/files/records etc) using sqoop/flume which includes but not limited to data verification, Reconciliation, User Authorization and Authentication testing (Groups, Users, Privileges etc), Report defects to the development team or manager and driving them to closure, Consolidate all the defects and create defect reports, Validating new feature and issues in Core Hadoop.
Report defects to the development team or manager and driving them to closure, Consolidate all the defects and create defect reports, Responsible for creating a testing Framework called MR Unit for testing of Map-Reduce programs.
Automation testing using the OOZIE, Data validation using the query surge tool.
Test plan for HDFS upgrade, Test automation and result
How to test install and configure
R language for statistical programming, the various features of R, introduction to R Studio, the statistical packages, familiarity with different data types and functions, learning to deploy them in various scenarios, use SQL to apply ‘join’ function, components of R Studio like code editor, visualization and debugging tools, learn about R-bind.
R Functions, code compilation and data in well-defined format called R-Packages, learn about R-Package structure, Package metadata and testing, CRAN (Comprehensive R Archive Network), Vector creation and variables values assignment.
R functionality, Rep Function, generating Repeats, Sorting and generating Factor Levels, Transpose and Stack Function.
Introduction to matrix and vector in R, understanding the various functions like Merge, Strsplit, Matrix manipulation, rowSums, rowMeans, colMeans, colSums, sequencing, repetition, indexing and other functions.
Understanding subscripts in plots in R, how to obtain parts of vectors, using subscripts with arrays, as logical variables, with lists, understanding how to read data from external files.
Generate plot in R, Graphs, Bar Plots, Line Plots, Histogram, components of Pie Chart.
Understanding Analysis of Variance (ANOVA) statistical technique, working with Pie Charts, Histograms, deploying ANOVA with R, one way ANOVA, two way ANOVA.
K-Means Clustering for Cluster & Affinity Analysis, Cluster Algorithm, cohesive subset of items, solving clustering issues, working with large datasets, association rule mining affinity analysis for data mining and analysis and learning co-occurrence relationships.
Introduction to Association Rule Mining, the various concepts of Association Rule Mining, various methods to predict relations between variables in large datasets, the algorithm and rules of Association Rule Mining, understanding single cardinality.
Understanding what is Simple Linear Regression, the various equations of Line, Slope, Y-Intercept Regression Line, deploying analysis using Regression, the least square criterion, interpreting the results, standard error to estimate and measure of variation.
Scatter Plots, Two variable Relationship, Simple Linear Regression analysis, Line of best fit
Deep understanding of the measure of variation, the concept of co-efficient of determination, F-Test, the test statistic with an F-distribution, advanced regression in R, prediction linear regression.
Logistic Regression Mean, Logistic Regression in R.
Advanced logistic regression, understanding how to do prediction using logistic regression, ensuring the model is accurate, understanding sensitivity and specificity, confusion matrix, what is ROC, a graphical plot illustrating binary classifier system, ROC curve in R for determining sensitivity/specificity trade-offs for a binary classifier.
Detailed understanding of ROC, area under ROC Curve, converting the variable, data set partitioning, understanding how to check for multicollinearlity, how two or more variables are highly correlated, building of model, advanced data set partitioning, interpreting of the output, predicting the output, detailed confusion matrix, deploying the Hosmer-Lemeshow test for checking whether the observed event rates match the expected event rates.
Data analysis with R, understanding the WALD test, MC Fadden’s pseudo R-squared, the significance of the area under ROC Curve, Kolmogorov Smirnov Chart which is non-parametric test of one dimensional probability distribution.
Connecting to various databases from the R environment, deploying the ODBC tables for reading the data, visualization of the performance of the algorithm using Confusion Matrix.
Creating an integrated environment for deploying R on Hadoop platform, working with R Hadoop, RMR package and R Hadoop Integrated Programming Environment, R programming for MapReduce jobs and Hadoop execution.
Logistic Regression Case Study
In this case study you will get a detailed understanding of the advertisement spends of a company that will help to drive more sales. You will deploy logistic regression to forecast the future trends, detect patterns, uncover insights and more all through the power of R programming. Due to this the future advertisement spends can be decided and optimized for higher revenues.
Multiple Regression Case Study
You will understand how to compare the miles per gallon (MPG) of a car based on the various parameters. You will deploy multiple regression and note down the MPG for car make, model, speed, load conditions, etc. It includes the model building, model diagnostic, checking the ROC curve, among other things.
Receiver Operating Characteristic (ROC) case study
You will work with various data sets in R, deploy data exploration methodologies, build scalable models, predict the outcome with highest precision, diagnose the model that you have created with various real world data, check the ROC curve and more.
Domain – Restaurant Revenue Prediction
Data set – Sales
Project Description – This project involves predicting the sales of a restaurant on the basis of certain objective measurements. This project will give real time industry experience on handling multiple use cases and derive the solution. This project gives insights about feature engineering and selection.
Domain – Data Analytics
Objective – To predict about the class of a flower using its petal’s dimensions
Domain – Finance
Objective – The project aims to find the most impacting factors in preferences of pre-paid model, also identifies which are all the variables highly correlated with impacting factors
Domain – Stock Market
Objective – This project focuses on Machine Learning by creating predictive data model to predict future stock prices
This course is designed for clearing Cloudera Spark and Hadoop Developer Certification (CCA175), Cloudera CCA Administrator Exam (CCA131) and R Certification exams. At the end of the course, there will be a quiz and project assignments. Once you complete them, you will be awarded with Intellipaat Course Completion Certificate.
Hadoop Architect: Hadoop Architect is a professional who organizes, manages and governs Hadoop on a very large cluster. The most important thing Hadoop Architect must have is rich experience in Hive, HBase, MapReduce, PIG and so on. Hadoop Developers: Hadoop Developer is a person who just loves programming and he must have knowledge about Core, Java, SQL and other languages along with remarkable skills. Hadoop QA Professional : Hadoop QA professional is a person who tests and rectify glitches in Hadoop Hadoop Administrator: Hadoop Administrator is a person who admins Hadoop and its Data base system. He has a well and good understanding of Hadoop principles and its hardware systems. Others: There can be some other jobs which could be assigned to some other professional as well. For example there can be a Hadoop trainer, Hadoop consultant, Hadoop engineers & also senior Hadoop engineers, big data Engineers, Hadoop developers and also Java Engineers (DSE Team).
Java 1.6.x or higher, preferably from Sun -see HadoopJavaVersions Linux and Windows are the supported operating systems, but BSD, Mac OS/X, and Open Solaris are known to work.
R is now considered to be not just the most popular open-source analytic tool in the world but the most popular analytic tool in the world. Estimates about number of users range from 250000 to over 2 million. If you look at online popularity, R is the hands-down winner. It has more blogs, discussion groups and email lists than any other tool including SAS. Kdnuggets, a popular website on data mining, conducts annual surveys on popularity of various analytic tools and R was again the top choice in most of the surveys. There is a definite shortage of trained resources who can do analytics with R. The few who do have the right skills find themselves in great demand as organizations look to ramp up their R capabilities.
You will work on multiple case studies as part of the course. The case studies will involve hands-on work on huge datasets.
Learn how to use R for data manipulation and predictive modeling. Develop a comfort level with R as a tool for data analysis and apply statistical algorithms to build analytical models.
In Intellipaat self-paced training program you will receive recorded sessions, course material, Quiz, related software’s and assignments.The courses are designed such that you will get real world exposure and focused on clearing relevant certification exam. After completion of training you can take quiz which enable you to check your knowledge and enables you to clear relevant certification at higher marks/grade also you will be able to work on the technology independently.
In Self-paced courses trainer is not available whereas in Online training trainer will be available for answering queries at the same time. In self-paced course we provide email support for doubt clearance or any query related to training also if you face some unexpected challenges we will arrange live class with trainer.
All Courses are highly interactive to provide good exposure. You can learn at your own place and at your leisure time. Prices of self-paced is training is 75% cheaper than online training. You will have lifetime access hence you can refer it anytime during your project work or job.
Yes, at the top of the page of course details you can see sample videos.
As soon as you enroll to the course, your LMS (The Learning Management System) Access will be Functional. You will immediately get access to our course content in the form of a complete set of previous class recordings, PPTs, PDFs, assignments and access to our 24×7 support team. You can start learning right away.
24/7 access to video tutorials and Email Support along with online interactive session support with trainer for issue resolving.
Yes, You can pay difference amount between Online training and Self-paced course and you can be enrolled in next online training batch.
Yes, we will provide you the links of the software to download which are open source and for proprietary tools we will provide you trail version if available.
Please send an email . You can also chat with us to get an instant solution.
Intellipaat verified certificates will be awarded based on successful completion of course projects. There are set of quizzes after each couse module that you need to go through . After successful submission, official Intellipaat verified certificate will be given to you.
Towards the end of the Course, you will have to work on a Training project. This will help you understand how the different components of course are related to each other.
Classes are conducted via LIVE Video Streaming, where you get a chance to meet the instructor by speaking, chatting and sharing your screen. You will always have the access to videos and PPT. This would give you a clear insight about how the classes are conducted, quality of instructors and the level of Interaction in the Class.
Yes, We do keep launching multiple offers, please see offer page.
We will help you with the issue and doubts regarding the course. You can attempt the quiz again.
Training in Cities: Bangalore, Hyderabad, Chennai, Delhi, Kolkata, UK, London, Chicago, San Francisco, Dallas, Washington, New York, Orlando, Boston