R Case Studies
Logistic Regression Case Study
In this case study you will get a detailed understanding of the advertisement spends of a company that will help to drive more sales. You will deploy logistic regression to forecast the future trends, detect patterns, uncover insights and more all through the power of R programming. Due to this the future advertisement spends can be decided and optimized for higher revenues.
Multiple Regression Case Study
You will understand how to compare the miles per gallon (MPG) of a car based on the various parameters. You will deploy multiple regression and note down the MPG for car make, model, speed, load conditions, etc. It includes the model building, model diagnostic, checking the ROC curve, among other things.
Receiver Operating Characteristic (ROC) case study
You will work with various data sets in R, deploy data exploration methodologies, build scalable models, predict the outcome with highest precision, diagnose the model that you have created with various real world data, check the ROC curve and more.
What Hadoop Projects You will be working on?
Project 1 : Working with MapReduce, Hive, Sqoop
Industry : General
Problem Statement : How to successfully import data using Sqoop into HDFS for data analysis.
Topics : As part of this project you will work on the various Hadoop components like MapReduce, Apache Hive and Apache Sqoop. Work with Sqoop to import data from relational database management system like MySQL data into HDFS. Deploy Hive for summarizing data, querying and analysis. Convert SQL queries using HiveQL for deploying MapReduce on the transferred data. You will gain considerable proficiency in Hive, and Sqoop after completion of this project.
- Sqoop data transfer from RDBMS to Hadoop
- Coding in Hive Query Language
- Data querying and analysis.
Project 2 : Work on MovieLens data for finding top movies
Industry : Media and Entertainment
Problem Statement : How to create the top ten movies list using the MovieLens data.
Topics : In this project you will work exclusively on data collected through MovieLens available rating data sets. The project involves writing MapReduce program to analyze the MovieLens data and create list of top ten movies. You will also work with Apache Pig and Apache Hive for working with distributed datasets and analyzing it.
- MapReduce program for working on the data file
- Apache Pig for analyzing data
- Apache Hive data warehousing and querying
Project 3 : Hadoop YARN Project – End to End PoC
Industry : Banking
Problem Statement : How to bring the daily data ( incremental data) into the Hadoop Distributed File System.
Topics : In this project we have transaction data which is daily recorded/store in the RDBMS. Now this data is transferred everyday into HDFS for further Big Data Analytics. You will work on live Hadoop YARN cluster. YARN is part of the Hadoop 2.0 ecosystem that lets Hadoop to decouple from MapReduce and deploy more competitive processing and wider array of applications. You will work on the YARN central Resource Manager.
- Using Sqoop commands to bring the data into HDFS
- End to End flow of transaction data
- Working with the data from HDFS
Project 4 : Table Partitioning in Hive
Industry : Banking
Problem Statement : How to improve the query speed using Hive data partitioning.
Topics : This project involves working with Hive table data partitioning. Ensuring the right partitioning helps to read the data, deploy it on the HDFS, and run the MapReduce jobs at a much faster rate. Hive lets you partition data in multiple ways. This will give you hands-on experience in partitioning of Hive tables manually, deploying single SQL execution in dynamic partitioning, bucketing of data so as to break it into manageable chunks.
- Manual Partitioning
- Dynamic Partitioning
Project 5 : Connecting Pentaho with Hadoop Ecosystem
Industry : Social Network
Problem Statement : How to deploy ETL for data analysis activities.
Topics : This project lets you connect Pentaho with the Hadoop ecosystem. Pentaho works well with HDFS, HBase, Oozie and Zookeeper. You will connect the Hadoop cluster with Pentaho data integration, analytics, Pentaho server and report designer. This project will give you complete working knowledge on the Pentaho ETL tool.
- Working knowledge of ETL and Business Intelligence
- Configuring Pentaho to work with Hadoop Distribution
- Loading, Transforming and Extracting data into Hadoop cluster
Project 6 : Multi-node cluster setup
Industry : General
Problem Statement : How to setup a Hadoop real-time cluster on Amazon EC2.
Topics : This is a project that gives you opportunity to work on real world Hadoop multi-node cluster setup in a distributed environment. You will get a complete demonstration of working with various Hadoop cluster master and slave nodes, installing Java as a prerequisite for running Hadoop, installation of Hadoop and mapping the nodes in the Hadoop cluster.
- Hadoop installation and configuration
- Running a Hadoop multi-node using a 4 node cluster on Amazon EC2
- Deploying of MapReduce job on the Hadoop cluster.
Project 7 : Hadoop Testing using MRUnit
Industry : General
Problem Statement : How to test MapReduce applications
Topics : In this project you will gain proficiency in Hadoop MapReduce code testing using MRUnit. You will learn about real world scenarios of deploying MRUnit, Mockito, and PowerMock. This will give you hands-on experience in the various testing tools for Hadoop MapReduce. After completion of this project you will be well-versed in test driven development and will be able to write light-weight test units that work specifically on the Hadoop architecture.
- Writing JUnit tests using MRUnit for MapReduce applications
- Doing mock static methods using PowerMock & Mockito
- MapReduce Driver for testing the map and reduce pair
Project 8 : Hadoop Weblog Analytics
Industry : Internet services
Problem Statement : How to derive insights from web log data
Topics : This project is involved with making sense of all the web log data in order to derive valuable insights from it. You will work with loading the server data onto a Hadoop cluster using various techniques. The web log data can include various URLs visited, cookie data, user demographics, location, date and time of web service access, etc. In this project you will transport the data using Apache Flume or Kafka, workflow and data cleansing using MapReduce, Pig or Spark. The insight thus derived can be used for analyzing customer behavior and predict buying patterns.
- Aggregation of log data
- Apache Flume for data transportation
- Processing of data and generating analytics
Project 9 : Hadoop Maintenance
Industry : General
Problem Statement : How to administer a Hadoop cluster
Topics : This project is involved with working on the Hadoop cluster for maintaining and managing it. You will work on a number of important tasks that include recovering of data, recovering from failure, adding and removing of machines from the Hadoop cluster and onboarding of users on Hadoop.
- Working with name node directory structure
- Audit logging, data node block scanner, balancer.
- Failover, fencing, DISTCP, Hadoop file formats.
Project 10 : Twitter Sentiment Analysis
Industry – Social Media
Problem Statement : Find out what is the reaction of the people to the demonetization move by India by analyzing their tweets.
Description : This Project involves analyzing the tweets of people by going through what they are saying about the demonetization decision taken by the Indian government. Then you look for key phrases, words and analyze them using the dictionary and the value attributed to them based on the sentiment that it is conveying.
- Download the Tweets & Load into Pig Storage
- Divide tweets into words to calculate sentiment
- Rating the words from +5 to -5 on AFFIN dictionary
- Filtering the Tweets and analyzing sentiment.
Project 11 : Analyzing IPL T20 Cricket
Industry – Sports & Entertainment
Problem Statement : Analyze the entire cricket match and get answers to any question regarding the details of the match.
Description : This project involves working with the IPL dataset that has information regarding batting, bowling, runs scored, wickets taken, and more. This dataset is taken as input and then it is processed so that the entire match can be analyzed based on the user queries or needs.
- Load the data into HDFS
- Analyze the data using Apache Pig or Hive
- Based on user queries give the right output
Apache Spark Projects
Project 1 – Movie Recommendation
Industry : Entertainment
Problem Statement : How to recommend the most appropriate movie to a user based on his taste
Topics :This is a hands-on Apache Spark project deployed for the real-world application of movie recommendations. This project helps you gain essential knowledge in Spark MLlib which is a machine learning library, you will know how to create collaborative filtering, regression, clustering and dimensionality reduction using Spark MLlib. Upon finishing the project you will have first-hand experience in the Apache Spark streaming data analysis, sampling, testing, and statistics among other vital skills.
- Apache Spark MLlib component
- Statistical analysis
- Regression & clustering
Project 2 –Twitter API Integration for tweet Analysis
Industry : Social Media
Problem Statement : Analyzing the user sentiment based on the tweet
Topics :This is a hands-on Twitter analysis project using the Twitter API for analyzing of tweets. You will integrate the Twitter API and do programing using Python or PHP for developing the essential server side codes. Finally you will be able to read the results for various operations by filtering, parsing, and aggregating it depending on the tweet analysis requirement.
- Making requests to Twitter API
- Building the server side codes
- Filtering, parsing & aggregating data
Project 3 –Data Exploration Using Spark SQL – Wikipedia data set
Industry : Internet
Problem Statement : Making sense of Wikipedia data using Spark SQL.
Topics :In this project you will be using the Spark SQL tool for analyzing the Wikipedia data. You will gain hands-on experience in integrating Spark SQL for various applications like batch analysis, machine learning, visualizing and processing of data, ETL processes along with real-time analysis of data.
- Machine learning using Spark
- Deploying data visualization
- Spark SQL integration