Hadoop 2.x Cluster Architecture, Federation and High Availability, A Typical Production Cluster setup, Hadoop Cluster Modes, Common Hadoop Shell Commands, Hadoop 2.x Configuration Files, Cloudera Single node cluster, Hive, Pig, Sqoop, Flume, Scala and Spark.
Introducing Big Data & Hadoop, what is Big Data and where does Hadoop fits in, two important Hadoop ecosystem componentsnamely Map Reduce and HDFS, in-depth Hadoop Distributed File System – Replications, Block Size, Secondary Name node, High Availability, in-depth YARN – Resource Manager, Node Manager.
Hands-on Exercise – Working with HDFS, replicating the data, determining block size, familiarizing with Namenode and Datanode.
Detailed understanding of the working of MapReduce, the mapping and reducing process, the working of Driver, Combiners, Partitioners, Input Formats, Output Formats, Shuffle and Sor
Hands-on Exercise – The detailed methodology for writing the Word Count Program in MapReduce, writing custom partitioner, MapReduce with Combiner, Local Job Runner Mode, Unit Test, ToolRunner, MapSide Join, Reduce Side Join, Using Counters, Joining two datasets using Map-Side Join &Reduce-Side Join
Introducing Hadoop Hive, detailed architecture of Hive, comparing Hive with Pig and RDBMS, working with Hive Query Language, creation of database, table, Group by and other clauses, the various types of Hive tables, Hcatalog, storing the Hive Results, Hive partitioning and Buckets.
Hands-on Exercise – Creating of Hive database, how to drop database, changing the database, creating of Hive table, loading of data, dropping the table and altering it, writing hive queries to pull data using filter conditions, group by clauses, partitioning Hive tables
The indexing in Hive, the Map side Join in Hive, working with complex data types, the Hive User-defined Functions, Introduction to Impala, comparing Hive with Impala, the detailed architecture of Impala
Hands-on Exercise – Working with Hive queries, writing indexes, joining table, deploying external table, sequence table and storing data in another table.
Apache Pig introduction, its various features, the various data types and schema in Hive, the available functions in Pig, Hive Bags, Tuples and Fields.
Hands-on Exercise – Working with Pig in MapReduce and local mode, loading of data, limiting data to 4 rows, storing the data into file, working with Group By,Filter By,Distinct,Cross,Split in Hive.
Introduction to Apache Sqoop, Sqoop overview, basic imports and exports, how to improve Sqoop performance, the limitation of Sqoop, introduction to Flume and its Architecture, introduction to HBase, the CAP theorem.
Hands-on Exercise – Working with Flume to generating of Sequence Number and consuming it, using the Flume Agent to consume the Twitter data, using AVRO to create Hive Table, AVRO with Pig, creating Table in HBase, deploying Disable, Scan and Enable Table.
Using Scala for writing Apache Spark applications, detailed study of Scala, the need for Scala, the concept of object oriented programing, executing the Scala code, the various classes in Scala like Getters,Setters, Constructors, Abstract ,Extending Objects, Overriding Methods, the Java and Scala interoperability, the concept of functional programming and anonymous functions, Bobsrockets package, comparing the mutable and immutable collections.
Hands-on Exercise – Writing Spark application using Scala, understanding the robustness of Scala for Spark real-time analytics operation.
Detailed Apache Spark, its various features, comparing with Hadoop, the various Spark components, combining HDFS with Spark, Scalding, introduction to Scala, importance of Scala and RDD.
Hands-on Exercise – The Resilient Distributed Dataset in Spark and how it helps to speed up big data processing.
The RDD operation in Spark, the Spark transformations, actions, data loading, comparing with MapReduce, Key Value Pair.
Hands-on Exercise – How to deploy RDD with HDFS, using the in-memory dataset, using file for RDD, how to define the base RDD from external file, deploying RDD via transformation, using the Map and Reduce functions, working on word count and count log severity.
The detailed Spark SQL, the significance of SQL in Spark for working with structured data processing, Spark SQL JSON support, working with XML data, and parquet files, creating HiveContext, writing Data Frame to Hive, reading of JDBC files, the importance of Data Frames in Spark, creating Data Frames, schema manual inferring, working with CSV files, reading of JDBC tables, converting from Data Frame to JDBC, the user-defined functions in Spark SQL, shared variable and accumulators, how to query and transform data in Data Frames, how Data Frame provides the benefits of both Spark RDD and Spark SQL, deploying Hive on Spark as the execution engine.
Hands-on Exercise – Data querying and transformation using Data Frames, finding out the benefits of Data Frames over Spark SQL and Spark RDD.
Different Algorithms, the concept of iterative algorithm in Spark, analyzing with Spark graph processing, introduction to K-Means and machine learning, various variables in Spark like shared variables, broadcast variables, learning about accumulators.
Hands-on Exercise – Writing spark code using Mlib.
Introduction to Spark streaming, the architecture of Spark Streaming, working with the Spark streaming program, processing data using Spark streaming, requesting count and Dstream, multi-batch and sliding window operations and working with advanced data sources.
Hands-on Exercise – Deploying Spark streaming for data in motion and checking the output is as per the requirement.
Create a four node Hadoop cluster setup, running the MapReduce Jobs on the Hadoop cluster, successfully running the MapReduce code, working with the Cloudera Manager setup.
Hands-on Exercise – The method to build a multi-node Hadoop cluster using an Amazon EC2 instance, working with the Cloudera Manager.
The overview of Hadoop configuration, the importance of Hadoop configuration file, the various parameters and values of configuration, the HDFS parameters and MapReduce parameters, setting up the Hadoop environment, the Include’ and Exclude configuration files, the administration and maintenance of Name node, Data node directory structures and files, File system image and Edit log
Hands-on Exercise – The method to do performance tuning of MapReduce program.
Introduction to the Checkpoint Procedure, Name node failure and how to ensure the recovery procedure, Safe Mode, Metadata and Data backup, the various potential problems and solutions, what to look for, how to add and remove nodes.
Hands-on Exercise – How to go about ensuring the MapReduce File system Recovery for various different scenarios, JMX monitoring of the Hadoop cluster, how to use the logs and stack traces for monitoring and troubleshooting, using the Job Scheduler for scheduling jobs in the same cluster, getting the MapReduce job submission flow, FIFO schedule, getting to know the Fair Scheduler and its configuration.
Advanced Hadoop administration functions, using the Quorum Journal Manager, configuring the Hadoop federation and security, fundamentals of the Hadoop Platform Security, working with Kerberos authentication, configuring Kerberos on Hadoop cluster.
Hands-on Exercise – Detailed procedure for configuring the Kerberos authentication with the Hadoop cluster and checking the results of the configuration.
How ETL tools work in Big data Industry, Introduction to ETL and Data warehousing. Working with prominent use cases of Big data in ETL industry, End to End ETL PoC showing big data integration with ETL tool.
Hands-on Exercise – Connecting to HDFS from ETL tool and moving data from Local system to HDFS, Moving Data from DBMS to HDFS, Working with Hive with ETL Tool, Creating Map Reduce job in ETL tool
Working towards the solution of the Hadoop IBM project solution, its problem statements and the possible solution outcomes, preparing for the Cloudera Certifications, points to focus for scoring the highest marks, tips for cracking Hadoop interview questions.
Hands-on Exercise – The IBM project of a real-world high value Big Data Hadoop application and getting the right solution based on the criteria set by the IBM team.
Why testing is important, Unit testing, Integration testing, Performance testing, Diagnostics, Nightly QA test, Benchmark and end to end tests, Functional testing, Release certification testing, Security testing, Scalability Testing, Commissioning and Decommissioning of Data Nodes Testing, Reliability testing, Release testing
Understanding the Requirement, preparation of the Testing Estimation, Test Cases, Test Data, Test bed creation, Test Execution, Defect Reporting, Defect Retest, Daily Status report delivery, Test completion, ETL testing at every stage (HDFS, HIVE, HBASE) while loading the input (logs/files/records etc) using sqoop/flume which includes but not limited to data verification, Reconciliation, User Authorization and Authentication testing (Groups, Users, Privileges etc), Report defects to the development team or manager and driving them to closure, Consolidate all the defects and create defect reports, Validating new feature and issues in Core Hadoop.
Report defects to the development team or manager and driving them to closure, Consolidate all the defects and create defect reports, Responsible for creating a testing Framework called MR Unit for testing of Map-Reduce programs.
Automation testing using the OOZIE, Data validation using the query surge tool.
Test plan for HDFS upgrade, Test automation and result
How to test install and configure
Introducing Scala and deployment of Scala for Big Data applications and Apache Spark analytics.
The importance of Scala, the concept of REPL (Read Evaluate Print Loop), deep dive into Scala pattern matching, type interface, higher order function, currying, traits, application space and Scala for data analysis.
Learning about the Scala Interpreter, static object timer in Scala, testing String equality in Scala, Implicit classes in Scala, the concept of currying in Scala, various classes in Scala.
Learning about the Classes concept, understanding the constructor overloading, the various abstract classes, the hierarchy types in Scala, the concept of object equality, the val and var methods in Scala.
Understanding Sealed traits, wild, constructor, tuple, variable pattern, and constant pattern.
Understanding traits in Scala, the advantages of traits, linearization of traits, the Java equivalent and avoiding of boilerplate code.
Implementation of traits in Scala and Java, handling of multiple traits extending.
Introduction to Scala collections, classification of collections, the difference between Iterator, and Iterable in Scala, example of list sequence in Scala.
The two types of collections in Scala, Mutable and Immutable collections, understanding lists and arrays in Scala, the list buffer and array buffer, Queue in Scala, double-ended queue Deque, Stacks, Sets, Maps, Tuples in Scala.
Introduction to Scala packages and imports, the selective imports, the Scala test classes, introduction to JUnit test class, JUnit interface via JUnit 3 suite for Scala test, packaging of Scala applications in Directory Structure, example of Spark Split and Spark Scala.
Introduction to Spark, how Spark overcomes the drawbacks of working MapReduce, understanding in-memory MapReduce, Spark Hadoop YARN, HDFS Revision, YARN Revision, the overview of Spark and how it is better Hadoop, deploying Spark without Hadoop.
Spark installation guide, working with Spark Shell, the concept of Resilient Distributed Datasets (RDD), learning to do functional programming in Spark, the architecture of Spark.
Deep dive into Spark RDDs, the RDD general operations, a read-only partitioned collection of records, using the concept of RDD for faster and efficient data processing.
Understanding the concept of Key-Value pair in RDDs, learning how Spark makes MapReduce operations faster, various operations of RDD.
Comparing the Spark applications with Spark Shell, creating a Spark application using Scala or Java, deploying a Spark application, the web user interface of Spark application, a real world example of Spark and configuring of Spark.
Learning about Spark parallel processing, deploying on a cluster, introduction to Spark partitions, file-based partitioning of RDDs, understanding of HDFS and data locality, mastering the technique of parallel operations.
Understanding the RDD persistence overview, distributed persistence, RDD lineage
Understanding the Spark streaming, creating a Spark stream application, processing of Spark stream, streaming request count and DStreams.
Introduction to Spark multi-batch operations, state operations, sliding window operations and advanced data sources.
Learning about the Spark common use cases, the concept of iterative algorithm in Spark, analyzing with Spark graph processing, introduction to K-Means and machine learning.
Introduction to various variables in Spark like shared variables, broadcast variables, learning about accumulators, the common performance issues and troubleshooting the performance problems.
Learning about Spark SQL, the context of SQL in Spark for providing structured data processing, understanding the Data Frames in Spark, learning to query and transform data in Data Frames, how Data Frame provides the benefit of both Spark RDD and Spark SQL, deploying Hive on Spark as the execution engine.
Learning about the scheduling and partitioning in Spark, scheduling within and around applications, static partitioning, dynamic sharing, fair scheduling, Spark master high availability, standby Masters with Zookeeper, Single Node Recovery With Local File System, High Order Functions.
Understanding how to design capacity planning in Spark, creation of Maps, Transformations, the concept of concurrency in Java and Scala.
Understanding about log analysis with Spark, first log analyzers in Spark, working with various buffers like array, compact and protocol buffer.
Big Data characteristics, understanding Hadoop distributed computing, the Bayesian Law, deploying Storm for real time analytics, the Apache Storm features, comparing Storm with Hadoop, Storm execution, learning about Tuple, Spout, Bolt.
Installing the Apache Storm, various types of run modes of Storm.
Understanding Apache Storm and the data model.
Installation of Apache Kakfa and its configuration.
Understanding of advanced Storm topics like Spouts, Bolts, Stream Groupings, Topology and its Life cycle, learning about Guaranteed Message Processing.
Various Grouping types in Storm, reliable and unreliable messages, Bolt structure and life cycle, understanding Trident topology for failure handling, process, Call Log Analysis Topology for analyzing call logs for calls made from one number to another.
Understanding of Trident Spouts and its different types, the various Trident Spout interface and components, familiarizing with Trident Filter, Aggregator and Functions, a practical and hands-on use case on solving call log problem using Storm Trident.
Various components, classes and interfaces in storm like – Base Rich Bolt Class, i RichBolt Interface, i RichSpout Interface, Base Rich Spout class and the various methodology of working with them.
Understanding Cassandra, its core concepts, its strengths and deployment.
Twitter Boot Stripping, detailed understanding of Boot Stripping, concepts of Storm, Storm Development Environment.
Introduction to Data Science, Use cases, Need of Business Analytics, Data Science Life Cycle, Different tools available for Data Science
Installing R and R-Studio, R packages, R Operators, if statements and loops (for, while, repeat, break, next), switch case
Importing and Exporting data from external source, Data exploratory analysis, R Data Structure (Vector, Scalar, Matrices, Array, Data frame, List), Functions, Apply Functions
Bar Graph (Simple, Grouped, Stacked), Histogram, Pi Chart, Line Chart, Box (Whisker) Plot, Scatter Plot, Correlogram
Terminologies of Statistics ,Measures of Centers, Measures of Spread, Probability, Normal Distribution, Binary Distribution, Hypothesis Testing, Chi Square Test, ANOVA
Supervised Learning – Linear Regression ,Bivariate Regression, Multiple Regression Analysis, Correlation( Positive, negative and neutral), Industrial Case Study, Machine Learning Use-Cases, Machine Learning Process Flow, Machine Learning Categories
What is Classification and its use cases?, What is Decision Tree?, Algorithm for Decision Tree Induction, Creating a Perfect Decision Tree, Confusion Matrix
Random Forest, What is Naive Bayes?
R language for statistical programming, the various features of R, introduction to R Studio, the statistical packages, familiarity with different data types and functions, learning to deploy them in various scenarios, use SQL to apply ‘join’ function, components of R Studio like code editor, visualization and debugging tools, learn about R-bind.
R Functions, code compilation and data in well-defined format called R-Packages, learn about R-Package structure, Package metadata and testing, CRAN (Comprehensive R Archive Network), Vector creation and variables values assignment.
R functionality, Rep Function, generating Repeats, Sorting and generating Factor Levels, Transpose and Stack Function.
Introduction to matrix and vector in R, understanding the various functions like Merge, Strsplit, Matrix manipulation, rowSums, rowMeans, colMeans, colSums, sequencing, repetition, indexing and other functions.
Understanding subscripts in plots in R, how to obtain parts of vectors, using subscripts with arrays, as logical variables, with lists, understanding how to read data from external files.
Generate plot in R, Graphs, Bar Plots, Line Plots, Histogram, components of Pie Chart.
Understanding Analysis of Variance (ANOVA) statistical technique, working with Pie Charts, Histograms, deploying ANOVA with R, one way ANOVA, two way ANOVA.
K-Means Clustering for Cluster & Affinity Analysis, Cluster Algorithm, cohesive subset of items, solving clustering issues, working with large datasets, association rule mining affinity analysis for data mining and analysis and learning co-occurrence relationships.
Introduction to Association Rule Mining, the various concepts of Association Rule Mining, various methods to predict relations between variables in large datasets, the algorithm and rules of Association Rule Mining, understanding single cardinality.
Understanding what is Simple Linear Regression, the various equations of Line, Slope, Y-Intercept Regression Line, deploying analysis using Regression, the least square criterion, interpreting the results, standard error to estimate and measure of variation.
Scatter Plots, Two variable Relationship, Simple Linear Regression analysis, Line of best fit
Deep understanding of the measure of variation, the concept of co-efficient of determination, F-Test, the test statistic with an F-distribution, advanced regression in R, prediction linear regression.
Logistic Regression Mean, Logistic Regression in R.
Advanced logistic regression, understanding how to do prediction using logistic regression, ensuring the model is accurate, understanding sensitivity and specificity, confusion matrix, what is ROC, a graphical plot illustrating binary classifier system, ROC curve in R for determining sensitivity/specificity trade-offs for a binary classifier.
Detailed understanding of ROC, area under ROC Curve, converting the variable, data set partitioning, understanding how to check for multicollinearlity, how two or more variables are highly correlated, building of model, advanced data set partitioning, interpreting of the output, predicting the output, detailed confusion matrix, deploying the Hosmer-Lemeshow test for checking whether the observed event rates match the expected event rates.
Data analysis with R, understanding the WALD test, MC Fadden’s pseudo R-squared, the significance of the area under ROC Curve, Kolmogorov Smirnov Chart which is non-parametric test of one dimensional probability distribution.
Connecting to various databases from the R environment, deploying the ODBC tables for reading the data, visualization of the performance of the algorithm using Confusion Matrix.
Creating an integrated environment for deploying R on Hadoop platform, working with R Hadoop, RMR package and R Hadoop Integrated Programming Environment, R programming for MapReduce jobs and Hadoop execution.
Logistic Regression Case Study
In this case study you will get a detailed understanding of the advertisement spends of a company that will help to drive more sales. You will deploy logistic regression to forecast the future trends, detect patterns, uncover insights and more all through the power of R programming. Due to this the future advertisement spends can be decided and optimized for higher revenues.
Multiple Regression Case Study
You will understand how to compare the miles per gallon (MPG) of a car based on the various parameters. You will deploy multiple regression and note down the MPG for car make, model, speed, load conditions, etc. It includes the model building, model diagnostic, checking the ROC curve, among other things.
Receiver Operating Characteristic (ROC) case study
You will work with various data sets in R, deploy data exploration methodologies, build scalable models, predict the outcome with highest precision, diagnose the model that you have created with various real world data, check the ROC curve and more.
Introduction to Base SAS, Installation of SAS tool, Getting started with SAS, various SAS Windows – Log, Explorer, Output, Search, Editor, etc. working with data sets, overview of SAS Functions, Library Types and programming files
Import/Export Raw Data files, reading and sub setting the data set, various statements like WHERE, SET, Merge
Hands-on Exercise – Import Excel file in workspace, Read data, Export the workspace to save data
Various SAS Operators – Arithmetic, Logical, Comparison, various SAS Functions – NUMERIC, CHARACTER, IS NULL, CONTAINS, LIKE, Input/Put, Date/Time, Conditional Statements (Do While, Do Until, If, Else)
Hands-on Exercise – Apply logical, arithmetic operators and SAS functions to perform operations
Understanding about Input Buffer, PDV (Backend), learning what is Missover
Defining and Using KEEP and DROP statements, apply these statements, Format and Labels in SAS.
Hands-on Exercise – Use KEEP and DROP statements
Understanding Delimiter, dataline rules, DLM, Delimiter DSD, raw data files and execution, list input for standard data.
Hands-on Exercise – Use delimiter rules on raw data files
The various SAS standard Procedures built-in for popular programs – PROC SORT, PROC FREQ, PROC SUMMARY, PROC RANK, PROC EXPORT, PROC DATASET, PROC TRANSPOSE, , PROC CORR etc.
Hands-on Exercise – Use SORT, FREQ, SUMMARY, EXPORT and other procedures
Reading standard and non-standard numeric inputs with Formatted inputs, Column Pointer Controls, Controlling while a record loads, Line pointer control / Absolute line pointer control, Single Trailing , Multiple IN and OUT statements, DATA LINES statement and rules, List Input Method, comparing Single Trailing and Double Trailing.
Hands-on Exercise – Read standard and non-standard numeric inputs with Formatted inputs, Control while a record loads, Control a Line pointer, Write Multiple IN and OUT statements
SAS FORMAT statements – standard and user-written, associating a format with a variable, working with SAS FORMAT, deploying it on PROC Data sets, comparing ATTRIB and FORMAT statements.
Hands-on Exercise – Format a variable, deploy format rule on PROC DATA set, Use ATTRIB statement
Understanding PROC GCHART, various Graphs, Bar Charts – Pie, Bar, 3D, plotting variables with PROC GPLOT.
Hands-on Exercise – Plot graphs using PROC GPLOT Display charts using PROC GCHART
SAS advanced data discovery and visualization, point-and-click analytics capabilities, powerful reporting tools.
Character Functions, Numeric Functions, Converting Variable Type.
Hands-on Exercise – Use Functions in data transformation
Introduction to ODS, Data Optimization, How to generate files (rtf, pdf, html, doc) using SAS
Hands-on Exercise – Optimize data, generate rtf, pdf, html and doc files
Macro Syntax, Macro Variables, Positional Parameters in a Macro, Macro Step
Hands-on Exercise – Write a macro, Use positional parameters
SQL Statements in SAS, SELECT, CASE, JOIN, UNION, Sorting Data
Hands-on Exercise – Create sql query to select and add a condition
Use a CASE in select query
Base SAS web-based interface and ready-to-use programs, advanced data manipulation, storage and retrieval, descriptive statistics.
Hands-on Exercise – Use web UI to do statistical operations
Report Enhancement, Global Statements, User-defined Formats, PROC SORT, ODS Destinations, ODS Listing, PROC FREQ, PROC Means, PROC UNIVARIATE, PROC REPORT, PROC PRINT
Hands-on Exercise – Use PROC SORT to sort the results, List ODS, Find mean using PROC Means, print using PROC PRINT
Introduction to Splunk, Splunk developer roles and responsibilities
Writing Splunk query for search, Autocomplete to build a search, time range, refine search, work with events, identify the contents of search, control a search job
Hands-on Exercise – Write a basic search query
Understand Fields, Use Fields in Search, Use Fields Sidebar, regex field extraction using Field Extractor (FX), delimiter field Extraction using FX
Hands-on Exercise – Use Fields in Search, Use Fields Sidebar, Use Field Extractor (FX), delimit field Extraction using FX
Writing Splunk query for search, sharing, saving, scheduling and exporting search results
Hands-on Exercise – Schedule a search, Save a search result, Share and export a search result
Creation of alert, explaining alerts and viewing fired alerts
Hands-on Exercise – Create an alert, view fired alerts
Describe and Configure Scheduled Reports
Introduction to Tags in Splunk, deploying Tags for Splunk search, understanding event types and utility, generating and implementing event types in Search
Hands-on Exercise – Deploy tags for Splunk search, generate and implement event types in Search
Define Macros, Arguments and Variables in a Macro
Hands-on Exercise – Define a Macro with arguments and use variables in it
GET, POST, and Search workflow actions
Hands-on Exercise – Create GET, POST, and Search workflow
Search Command study, search practices in general, search pipeline, specify indexes in search, syntax highlighting, autocomplete, search commands like tables, fields, sort, multikv, rename, rex & erex
Hands-on Exercise – Create search pipeline, specify indexes in search, highlight syntax, use autocomplete feature, use search commands like tables, fields, sort, multikv, rename, rex & erex
Using Top, Rare, Stats Commands
Hands-on Exercise – Use Top, Rare, Stats Commands
Using following commands and their functions: addcoltotals, addtotals,top, rare,stats
Hands-on Exercise – Create reports using following commands and their functions: addcoltotals, addtotals
iplocation, geostats, geom, addtotals commands
Hands-on Exercise – Track ip using iplocation, get geo data using geostats
Explore the available visualizations, create charts and time charts, omit null values and format results
Hands-on Exercise – Create time charts, omit null values and format results
Calculating and analyzing results, value conversion, roundoff and format values, using eval command, conditional statements, filtering calculated search results
Hands-on Exercise – Calculate and analyze results, perform coversion on a data value, roundoff a numbers, use eval command, write conditional statements,apply filters on calculated search results
Search with Transactions, Report on Transactions, Group events using fields and time, Transaction vs Stats
Hands-on Exercise – Generate Report on Transactions, Group events using fields and time
Learn about data lookups, example, lookup table, defining and configuring automatic lookup, deploying lookup in reports and searches
Hands-on Exercise – Define and configure automatic lookup, deploy lookup in reports and searches
Creating search charts, reports and dashboards, Editing reports and Dashboard, Adding reports to dashboard
Hands-on Exercise – Create search charts, reports and dashboards, Edit reports and Dashboard, Add reports to dashboard
Working with raw data for data extraction, transformation, parsing and preview
Hands-on Exercise – Extract useful data from raw data, perform transformation, parse different values and preview
Describe Pivot, Relationship between data model and pivot, select a data model object, create a pivot report, instant pivot from a search, add a pivot report to dashboard
Hands-on Exercise – Select a data model object, create a pivot report, create instant pivot from a search, add a pivot report to dashboard
What is Splunk CIM, Using the CIM Add-On to normalize data
Hands-on Exercise – Use the CIM Add-On to normalize data
Introduction to the Splunk 3 tier architecture, understanding the Server settings, control, preferences and licensing, the most important components of Splunk tool, the hardware requirements, conditions for installation of Splunk.
Understanding how to install and configure Splunk, index creation, input configuration in standalone server, the search preferences, installing Splunk in the Linux environment.
Installing Splunk in the Linux environment, the various prerequisites, configuration of Splunk in Linux.
Introduction to the Splunk Distributed Management Console, index clustering, forwarder management and distributed search in Splunk environment, providing the right authentication to users, access control.
Introducing the Splunk app, managing the Splunk app, the various add-ons in Splunk app, deleting and installing apps from SplunkBase, deploying the various app permissions, deploying the Splunk app, apps on forwarder.
Understanding the index time configuration file and search time configuration file.
Learning about the index time and search time configuration files in Splunk, installing the forwarders, configuring the output and inputs.conf, managing the Universal Forwarders.
Deploying the Splunk tool, the Splunk deployment Server, setting up the Splunk deployment environment, deploying the clients grouping in Splunk.
Understanding the Splunk Indexes, the default Splunk Indexes, segregating the Splunk Indexes, learning about Splunk Buckets and Bucket Classification, estimating index storage, creating new index.
Understanding the concept of role inheritance, Splunk authentications, native authentications, LDAP authentications.
Splunk installation, configuration, data inputs, app management, Splunk important concepts, parsing machine-generated data, search indexer and forwarder.
Introduction to Splunk Configuration Files, Universal Forwarder, Forwarder Management, data management, troubleshooting and monitoring.
Converting machine-generated data into operational intelligence, setting up Dashboard, Reports and Charts, integrating Search Head Clustering & Indexer Clustering.
Understanding the input methods, deploying scripted, Windows, network and agentless input types, fine-tuning it all.
Splunk User authentication and Job Role assignment, learning to manage, monitor and optimize Splunk Indexes.
Understanding parsing of machine-generated data, manipulation of raw data, previewing and parsing, data field extraction.
Distributed search concepts, improving search performance, large scale deployment and overcoming execution hurdles, working with Splunk Distributed Management Console for monitoring the entire operation.
The domain of machine learning and its implications to the artificial intelligence sector, the advantages of machine learning over other conventional methodologies.
Introduction to Deep Learning within machine learning, how it differs from all others methods of machine learning, training the system with training data, supervised and unsupervised learning, classification and regression supervised learning, clustering and association unsupervised learning, the algorithms used in these types of learning.
Introduction to TensorFlowopen source software library for designing, building and training Deep Learning models, Python Library behind TensorFlow, Tensor Processing Unit (TPU) programmable AI accelerator by Google.
Mapping the human mind with Deep Neural Networks, the various building block of Artificial Neural Networks, the architecture of DNN, its building blocks, the concept of reinforcement learning in DNN, the various parameters, layers, activation functions and optimization algorithms in DNN.
Introduction to GPUs and how they differ from CPUs, the importance of GPUs in training Deep Learning Networks, the forward pass and backward pass training technique, the GPU constituent with simpler core and concurrent hardware.
What is Python Language and features, Why Python and why it is different from other languages, Installation of Python, Anaconda Python distribution for Windows, Mac, Linux. Run a sample python script, working with Pyhton IDE’s. Running basic python commands – Data types, Variables,Keywords,etc
Hands-on Exercise – Install Anaconda Python distribution for your OS (Windows/Linux/Mac)
Indentation(Tabs and Spaces) and Code Comments (Pound # character); Variables and Names; Built-in Data Types in Python – Numeric: int, float, complex – Containers: list, tuple, set, dict – Text Sequence: Str (String) – Others: Modules, Classes, Instances, Exceptions, Null Object, Ellipsis Object – Constants: False, True, None, NotImplemented, Ellipsis, __debug__; Basic Operators: Arithmetic, Comparison, Assignment, Logical, Bitwise, Membership, Indentity; Slicing and The Slice Operator [n:m]; Control and Loop Statements: if, for, while, range(), break, continue, else;
Hands-on Exercise – Write your first Python program Write a Python Function (with and without parameters) Use Lambda expression Write a class, create a member function and a variable, Create an object Write a for loop to print all odd numbers
Classes – classes and objects, access modifiers, instance and class members OOPS paradigm – Inheritance, Polymorphism and Encapsulation in Python. Functions: Parameters and Return Types; Lambda Expressions, Making connection with Database for pulling data.
Open a File, Read from a File, Write into a File; Resetting the current position in a File; The Pickle (Serialize and Deserialize Python Objects); The Shelve (Overcome the limitation of Pickle); What is an Exception; Raising an Exception; Catching an Exception;
Hands-on Exercise – Open a text file and read the contents, Write a new line in the opened file, Use pickle to serialize a python object, deserialize the object, Raise an exception and catch it
Arrays and Matrices, ND-array object, Array indexing, Datatypes, Array math Broadcasting, Std Deviation, Conditional Prob, Covariance and Correlation.
Hands-on Exercise – Import numpy module, Create an array using ND-array, Calculate std deviation on an array of numbers, Calculate correlation between two variables
Builds on top of NumPy, SciPy and its characteristics, subpackages: cluster, fftpack, linalg, signal, integrate, optimize, stats; Bayes Theorem using SciPy
Hands-on Exercise – Import SciPy, Apply Bayes theorem using SciPy on the given dataset
Plotting Grapsh and Charts (Line, Pie, Bar, Scatter, Histogram, 3-D); Subplots; The Matplotlib API
Hands-on Exercise – Plot Line, Pie, Scatter, Histogram and other charts using Matplotlib
Dataframes, NumPy array to a dataframe; Import Data (csv, json, excel, sql database); Data operations: View, Select, Filter, Sort, Groupby, Cleaning, Join/Combine, Handling Missing Values; Introduction to Machine Learning(ML); Linear Regression; Time Series
Hands-on Exercise – Import Pandas, Use it to import data from a json file,,Select records by a group and apply filter on top of that, View the records, Perform Linear Regression analysis, Create a Time Series
Introduction to Natural Language Processing (NLP); NLP approach for Text Data; Environment Setup (Jupyter Notebook); Sentence Analysis; ML Algorithms in Scikit-Learn; What is Bag of Words Model; Feature Extraction from Text; Model Training; Search Grid; Multiple Parameters; Build a Pipeline
Hands-on Exercise – Setup Jupyter Notebook environment, Load a dataset in Jupyter, Use algorithm in Scikit-Learn package to perform ML techniques, Train a model Create a search grid
What is Web Scraping; Web Scraping Libraries (Beautifulsoup, Scrapy); Installation of Beautifulsoup; Install lxml Python Parser; Making a Soup Object using an input html; Navigating Py Objects in the Soup Tree; Searching the Tree; Output Print; Parsing Full or Partial
Hands-on Exercise – Install Beautifulsoup and lxml Python parser, Make a Soup object using an input html file, Navigate Py objects in the soup tree, Search tree, Print output
Understanding Hadoop and its various components; Hadoop ecosystem and Hadoop common; HDFS and MapReduce Architecture; Python scripting for MapReduce Jobs on Hadoop framework
Hands-on Exercise – Write a basic MapReduce Job in Python and connect with Hadoop Framework to perform the task
What is Spark,understanding RDDs, Spark Libs, writing Spark code using python,Spark Machine Libraries Mlib, Regression, Classification and Clustering using Spark MLlib
Hands-on Exercise – Implement sandbox, Run a python code in sandbox, Work with HDFS file system from sandbox
What is data visualization, Comparision and benefits against reading raw numbers, Real usage examples from various business domains, Some quick powerful examples using Tableau without going into the technical details of Tableau
Installation of Tableau Desktop, Architecture of Tableau, Interface of Tableau (Layout, Toolbars, Data Pane, Analytics Pane etc), How to start with Tableau, Ways to share and exporting the work done in Tableau
Hands-on Exercise – Play with the tableau desktop, interface to learn its user interface, Share an existing work, Export an existing work
Connection to Excels, PDFs and Cubes, Managing Metadata and Extracts, Data Preparation and dealing with NULL values, Data Joins (Inner, Left, Right, Outer) and Union, Cross Database joining, Data Blending
Hands-on Exercise – Connect to an excel sheet and import data, Use metadata and extracts, Handle NULL values, Clean up the data before the actual use, Perform various join techniques, Perform data blending from more than one sources
Marks, Highlighting, Sort and Group, Working with Sets (Creation of sets, Editing sets, IN/OUT, Sets in Hierarchies)
Hands-on Exercise – Create and edit sets using Marks, Highlight desired items, Make groups, Applying sorting on result, Make hierachies in the created set
Filters (Addition and Removal), Filtering continuous dates, dimensions, measures, Interactive Filters
Hands-on Exercise – Add Filter on data set by date/dimensions/measures, Use interactive filter to views, Remove some filters to see the result
Formatting Data (Labels, Annotations, Tooltips, Edit axes), Formatting Pane (Menu, Settings, Font, Alignment, Copy-Paste), Trend and Reference Lines, Forecasting, k-means Cluster Analysis in Tableau
Hands-on Exercise – Apply labels, annotations, tooltips to graphs, Edit the attributes of axes, Set a reference line, Do k-means cluster analysis on a dataset
Coordinate points, Plotting Longitude and Latitude, Editing Unrecognized Locations, Custom Geocoding, Polygon Maps, WMS: Web Mapping Services, Background Image (Add Image, Plot Points on Image, Generate coordinates from Image)
Hands-on Exercise – Plot latitude and longitude on geo map, Edit locations on the map, Create custom geocoding, Use images of a map and plot points on it, find coordinates in the image, Create a polygon map, Use WMS
Calculation Syntax and Functions in Tableau, Types of Calculations (Table, String, Logic, Date, Number, Aggregate), LOD Expressions (concept and syntax), Aggregation and Replication with LOD Expressions, Nested LOD Expressions
Create Parameters, Parameters in Calculations, Using Parameters with Filters, Column Selection Parameters, Chart Selection Parameters
Hands-on Exercise – Create new parameters to apply on a filter, Pass parameters to filters to selet columns, Pass parameters to filters to select charts
Dual Axes Graphs, Histogram (Single and Dual Axes), Box Plot, Pareto Chart, Motion Chart, Funnel Chart, Waterfall Chart, Tree Map, Heat Map, Market Basket analysis
Hands-on Exercise – Plot a histogram, heat map, tree map, funnel chart and others using the same data set, Do market basket analysis on a given dataset
Build and Format a Dashboard (Size, Views, Objects, Legends and Filters), Best Practices for Creative and Interactive Dashboards using Actions, Create Stories (Intro of Story Points, Creating and Updating Story Points, Adding Visuals in Stories, Annotations with Description)
Hands-on Exercise – Create a dashboard view, Include objects, legends and filters, Make the dashboard interactive, Create and edit a story with visual effects, annotation, description
Introduction to R Language, Applications and Use Cases of R, Deploying R on Tableau Platform, Learning R functions in Tableau, Integration with Hadoop
Hands-on Exercise – Deploy R on tableau, Create a line graph using R interface, Connect tableau with Hadoop and extract data
Getting started with HBase, Core concepts of HBase, Understanding HBase with an Example
Why HBase?, Where to use HBase?, What is NoSQL?
HDFS vs.HBase, HBase Use Cases, Data Modeling HBase
HBase Architecture, Main components of HBase Cluster
HBase Shell, HBase API, Primary Operations, Advanced Operations
Create a Table and Insert Data into it, Integration of Hive with HBase, Load Utility
Putting Folder to VM, File loading with both load Utility
Introduction to Cassandra, its strengths and deployment areas
Significance of NoSQL, RDBMS Replication, Key Challenges, types of NoSQL, benefits and drawbacks, salient features of NoSQL database. CAP Theorem, Consistency.
Installation, introduction to Cassandra, key concepts and deployment of non relational database, column-oriented database, Data Model – column, column family,
Token calculation, Configuration overview, Node tool, Validators, Comparators, Expiring column, QA
How Cassandra modelling varies from Relational database modelling, Cassandra modelling steps, introduction to Time Series modelling, comparing Column family Vs. Super Column family, Counter column family, Partitioners, Partitioners strategies, Replication, Gossip protocols, Read operation, Consistency, Comparison
Creation of multi node cluster, node settings, Key and Row cache, System Key space, understanding of Read Operation, Cassandra Commands overview, VNodes, Column family
JSON, Hector client, AVRO, Thrift, JAVA code writing method, Hector tag
Cassandra management, commands of node tool, MapReduce and Cassandra, Secondary index, Datastax Installation
Rules of Cassandra data modelling, increasing data writes, duplication, and reducing data reads, modelling data around queries, creating table for data queries
Understanding the Java application creation methodology, learning key drivers, deploying the IDE for Cassandra applications,cluster connection and data query implementation
Learning about Node Tool Utility, cluster management using Command Line Interface, Cassandra management and monitoring via DataStax Ops Center.
Cassandra client connectivity, connection pool internals, API, important features and concepts of Hector client, Thrift, JAVA code, Summarization.
RDBMS, types of relational databases, challenges of RDBMS, NoSQL database, its significance, how NoSQL suits Big Data needs, Introduction to MongoDB and its advantages, MongoDB installation, JSON features, data types and examples.
Installing MongoDB, basic MongoDB commands and operations, MongoChef (MongoGUI) Installation, MongoDB Data types.
Hands-on Exercise – Install MongoDB, Install MongoChef (MongoGUI)
The need for NoSQL, types of NoSQL databases, OLTP, OLAP, limitations of RDBMS, ACID properties, CAP Theorem, Base property, learning about JSON/BSON, database collection & document, MongoDB uses, MongoDB Write Concern – Acknowledged, Replica Acknowledged, Unacknowledged, Journaled, Fsync.
Hands-on Exercise – Write a JSON document
Understanding CRUD and its functionality, CRUD concepts, MongoDB Query & Syntax, read and write queries and query optimization.
Hands-on Exercise – Use Insert query to Create a data entry, Use find query to Read data, Use update and replace queris to Update, Use delete query operations on a DB file
Concepts of data modeling, difference between MongoDB and RDBMS modeling, Model tree structure, operational strategies, monitoring and backup.
Hands-on Exercise – Write a data model tree structure for a family hierarchy
In this module you will learn MongoDB® Administration activities such as Health Check, Backup, Recovery, database sharding and profiling, Data Import/Export, Performance tuning etc.
Hands-on Exercise – Use shard key and hashed shard keys, Perform backup and recovery of a dummy dataset, Import data from a csv file, Export data to a csv file
Concepts of data aggregation and types, data indexing concepts, properties and variations.
Hands-on Exercise – Do aggregation using pipeline, sort, skip and limit, Create index on data using single key, using multikey
Understanding database security risks, MongoDB security concept and security approach, MongoDB integration with Java and Robomongo.
Hands-on Exercise – MongoDB integration with Java and Robomongo.
Implementing techniques to work with variety of unstructured data like images, videos, log data, and others, understanding GridFS MongoDB file system for storing data.
Hands-on Exercise – Work with variety of unstructured data like images, videos, log data, and others
The Architecture of Couchbase, understanding Couchbase distributed NoSQL database engine, vBuckets for information distribution on Couchbase cluster, user and system requirements, Couchbase downloading and installation.
Couchbase single node deployment for development purpose
Managing the Couchbase environment with the Web Console tool, configuring the Couchbase server and management, working with Couchbase data buckets, default bucket sizing, and administration.
Methods for deploying Couchbase in multi node cluster – all Couchbase Servers on one machine and second with each Couchbase Server on own machine.
The Couchbase Command-line Interface tools for managing and monitoring single node and multi node clusters, Severs and vBuckets, developing Reports for log data collection.
Classification and Recommendation, Clustering in Mahout, Pattern Mining, Understanding machine Learning, Using Model diagram to decide the approach, Data flow, Supervised and Unsupervised learning
Concept of Recommendation, Recommendations by E-commerce site, Comparison between User Recommendations and Item recommendation, Define recommenders and Classifiers, Process of Collaborative Filtering, Explaining Pearson coefficient algorithm, Euclidean distance measure, Implementing a recommender using map reduce
Defining Clustering, User-to-user similarity, Clustering Illustration, Euclidean distance measure, Distance measure vector, Understanding the process of Clustering, Vectorizing documents-Unstructured data
Document clustering, Sequence-to-sparse Utility, K-Mean Clustering
Terminology, Predictor and Target variable, Classifiable DataKey Challenges in Classification algorithm, Vectorizing Continuous data, Classification Examples, Logic Regression and its examples
Clustering, Clustering Process, Transaction Clustering, Different techniques of Vectorization, Distance measure, Clustering algorithm-K-MEAN, Clustering Application-1, Clustering Application-2, Sentiment Analyzer
Pearson Coefficient, Collaborative Filtering Process, Collaborative Filtering, Similarity Algorithms, Pearson Correlation, Euclidean Distance Measure -Frequent Pattern & Association rules, Frequent Pattern Growth
Introduction to the search engine, the Apache Lucene, understanding the inverted index, documents and fields & documents.
Introduction to the various query types available in Lucene and clear understanding of these.
Understanding the prerequisites for using Apache Lucene, learning about the querying process, analyzers, scoring boosting, faceting, grouping, highlighting, the various types of geographical and spatial searches, introduction to Apache Tika.
Demonstration of the Apache Lucene workings.
Understanding the Analyzer, Query Parser in Apache Lucene, Query Object, Stopword.
Understanding the various aspects of Apache Lucene like Scoring, Boosting, Highlighting, Faceting and Grouping
Introduction to Apache Solr, the advantages of Apache Solr over Apache Lucene, the basic system requirements for using Apache Solr, introduction to Cores in Apache Solr.
Introduction to the Apache Solr indexing, index using built-in data import handler and post tool, understanding the Solrj Client and configuration of Solrj Client.
Demonstrating the Book Store use cases with Solr Indexing with practical examples, learning to build Schema, the field, field types, CopyField and Dynamic Field, understanding how to add, explore, update, and delete using Solrj.
The various aspects of Apache Solr search like sorting, pagination, an overview of the request parameters, faceting and highlighting.
Understanding the Request Handlers, defining and mapping to search components, highlighting and faceting, updating managed schemas, request parameters hardwiring, adding fields to default search, the various types of Analyzers, Parsers, Tokenizers.
Grouping of results in Apache Solr, Parse queries functions, fuzzy query in Apache Solr.
The extended features in Apache Solr, learning about Pseudo-fields, Pseudo-Joins, Spell Check, suggestions, Geospatial Search, multi-language search, stop words and synonyms.
Understanding the concept of Multicore in Solr, the creation of Multicore in Solr, the need of Multicore, Joining of data, Replication and Ping Handler.
Understanding the SolrCloud, the concept of Sharding, indexing, and replication in Apache SolrCloud, the working of Apache SolrCloud, distributed requests, reading and writng slide fault tolerance, cluster coordination using Apache ZooKeeper.
Introduction to Java Programming, Defining Java, Need for Java, Platform Independent in Java, Define JRE,JVM, JDK, Important Features and Evolution of Java
Overview of Coding basics, Setting up the required environment, Knowing the available IDEs, Writing a Basic-level Java Program, Define Package, What are Java Comments?, Understanding the concept of Reserved Words, Introduction to Java Statements, What are Blocks in Java, Explain a Class, Different Methods
Understanding what is Apache Kafka, the various components and use cases of Kafka, implementing Kafka on a single node.
Learning about the Kafka terminology, deploying single node Kafka with independent Zookeeper, adding replication in Kafka, working with Partitioning and Brokers, understanding Kafka consumers, the Kafka Writes terminology, various failure handling scenarios in Kafka.
Introduction to multi node cluster setup in Kafka, the various administration commands, leadership balancing and partition rebalancing, graceful shutdown of kafka Brokers and tasks, working with the Partition Reassignment Tool, cluster expending, assigning Custom Partition, removing of a Broker and improving Replication Factor of Partitions.
Understanding the need for Kafka Integration, successfully integrating it with Apache Flume, steps in integration of Flume with Kafka as a Source.
Detailed understanding of the Kafka and Flume Integration, deploying Kafka as a Sink and as a Channel, introduction to PyKafka API and setting up the PyKafka Environment.
Connecting Kafka using PyKafka, writing your own Kafka Producers and Consumers, writing a random JSON Producer, writing a Consumer to read the messages from a topic, writing and working with a File Reader Producer, writing a Consumer to store topics data into a file.
Administering Open Source System (Unix Systems), The Role of an administrator, Open Source Licensing, Acquiring your Linux Distribution
The Installation Process of Linux Red Hat System, Structuring the File system, Selecting the software Packages, Performing Installation
Managing Boot Process, Following Boot Scripts Sequence, Assigning services with chk config, The /etc directory configuration Hierarchy
Booting into Rescue Mode, Reinstalling the Boot Loader, Booting into Single-User Mode
PAM-Pluggable Authentication Modulet, What Do We Mean By Home Directory of file users, Syntax for Chage is, How to Change User Features of A User
The Linux Groupmod Command, The Linux g password Command, Linux PS Command, Procs, Memory, Swap, Proc Command, Pkill Linux Command, Syslog, View Newlog Entries
Manipulating portable tar archives, How to install software with red hat packet manager(RPM), What is RPM-REDHAT package manager
How to rebuild a source RPM(SRPM) package, Static ip configuration, View network settings of an Ethernet Adapter, Assigning IP Address to an Interface, Configuring and testing IPV6 connectivity, Stand alone server, Running services through XINTED
Creating Linux partition, Mounting a file system, How to create a user, How to add a user into a group in Linux
Mounting File System, How to mount Specific file system
How to Configure SAMBA Server, Examine the Steps in Reporting PCI Devices Bug, What is UDEV, To Know How to Add or Remove a Linux Kernel Modules/Drivers
Define LVS, installing LVS, understand meaning of Linux Director, to know about Testing and Debugging, what are Real servers and Ipfail.
Project 1 – Working with MapReduce, Hive, Sqoop
This project is involved with working on the various Hadoop components like MapReduce, Apache Hive and Apache Sqoop. Work with Sqoop to import data from relational database management system like MySQL data into HDFS. Deploy Hive for summarizing data, querying and analysis. Convert SQL queries using HiveQL for deploying MapReduce on the transferred data. You will gain considerable proficiency in Hive, and Sqoop after completion of this project.
Project 2 – Work on MovieLens data for finding top records
Data – MovieLens dataset
In this project you will work exclusively on data collected through MovieLens available rating data sets. The project involves the following important components:
Project 3 – Hadoop YARN Project – End to End PoC
In this project you will work on a live Hadoop YARN project. YARN is part of the Hadoop 2.0 ecosystem that lets Hadoop to decouple from MapReduce and deploy more competitive processing and wider array of applications. You will work on the YARN central Resource Manager. The salient features of this project include:
Project 4 – Partitioning Tables in Hive
This project involves working with Hive table data partitioning. Ensuring the right partitioning helps to read the data, deploy it on the HDFS, and run the MapReduce jobs at a much faster rate. Hive lets you partition data in multiple ways like:
This will give you hands-on experience in partitioning of Hive tables manually, deploying single SQL execution in dynamic partitioning, bucketing of data so as to break it into manageable chunks.
Project 5 – Connecting Pentaho with Hadoop Ecosystem
This project lets you connect Pentaho with the Hadoop ecosystem. Pentaho works well with HDFS, HBase, Oozie and Zookeeper. You will connect the Hadoop cluster with Pentaho data integration, analytics, Pentaho server and report designer. Some of the components of this project include the following:
Project 6 – Multi-node cluster setup
This is a project that gives you opportunity to work on real world Hadoop multi-node cluster setup in a distributed environment. The major components of this project involve:
You will get a complete demonstration of working with various Hadoop cluster master and slave nodes, installing Java as a prerequisite for running Hadoop, installation of Hadoop and mapping the nodes in the Hadoop cluster.
Project 7 – Hadoop Testing using MR
In this project you will gain proficiency in Hadoop MapReduce code testing using MRUnit. You will learn about real world scenarios of deploying MRUnit, Mockito, and PowerMock. Some of the important aspects of this project include:
After completion of this project you will be well-versed in test driven development and will be able to write light-weight test units that work specifically on the Hadoop architecture.
Project 8 – Hadoop Weblog Analytics
Data – Weblogs
This project is involved with making sense of all the web log data in order to derive valuable insights from it. You will work with loading the server data onto a Hadoop cluster using various techniques. The various modules of this project include:
The web log data can include various URLs visited, cookie data, user demographics, location, date and time of web service access, etc. In this project you will transport the data using Apache Flume or Kafka, workflow and data cleansing using MapReduce, Pig or Spark. The insight thus derived can be used for analyzing customer behavior and predict buying patterns.
Project 9 – Hadoop Maintenance
This project is involved with working on the Hadoop cluster for maintaining and managing it. You will work on a number of important tasks like:
Project 1: Movie Recommendation
Topics – This is a project wherein you will gain hands-on experience in deploying Apache Spark for movie recommendation. You will be introduced to the Spark Machine Learning Library, a guide to MLlib algorithms and coding which is a machine learning library. Understand how to deploy collaborative filtering, clustering, regression, and dimensionality reduction in MLlib. Upon completion of the project you will gain experience in working with streaming data, sampling, testing and statistics.
Project 2: Twitter API Integration for tweet Analysis
Topics – With this project you will learn to integrate Twitter API for analyzing tweets. You will write codes on the server side using any of the scripting languages like PHP, Ruby or Python, for requesting the Twitter API and get the results in JSON format. You will then read the results and perform various operations like aggregation, filtering and parsing as per the need to come up with tweet analysis.
Project 3: Data Exploration Using Spark SQL – Wikipedia data set
Topics – This project lets you work with Spark SQL. You will gain experience in working with Spark SQL for combining it with ETL applications, real time analysis of data, performing batch analysis, deploying machine learning, creating visualizations and processing of graphs.
Project 1. Call Log Analysis using Trident
Topics : In this project you will be working on call logs to decipher the data and gather valuable insights using Apache Storm Trident. You will extensively work with data about calls made from one number to another. The aim of this project is to resolve the call log issues with Trident stream processing and low latency distributed querying. You will gain hands-on experience in working with Spouts and Bolts along with various Trident functions, filters, aggregation, joins and grouping.
Project 2. Twitter Data Analysis using Trident
Topics : This is a project that involves working with Twitter data and processing it to extract patterns out of it. The Apache Storm Trident is the perfect framework for real-time analysis of tweets. Working with Trident you will be able to simplify the task of live Twitter feed analysis. In this project you will gain real world experience of working with Spouts, Bolts, and Trident filters, joins, aggregation, functions and grouping.
Project 3. US Presidential Election Result analysis using Trident DRPC Query
Topics : This is a project that lets you work on the US presidential election results and predict who is leading and trailing on a real-time basis. For this you exclusively work with Trident distributed Remote Procedure Call server. After completion of the project you will learn how to access data residing in a remote computer or network and deploy it for real-time processing, analysis and prediction.
Domain – Restaurant Revenue Prediction
Data set – Sales
Project Description – This project involves predicting the sales of a restaurant on the basis of certain objective measurements. This project will give real time industry experience on handling multiple use cases and derive the solution. This project gives insights about feature engineering and selection.
Domain – Data Analytics
Objective – To predict about the class of a flower using its petal’s dimensions
Domain – Finance
Objective – The project aims to find the most impacting factors in preferences of pre-paid model, also identifies which are all the variables highly correlated with impacting factors
Domain – Stock Market
Objective – This project focuses on Machine Learning by creating predictive data model to predict future stock prices
Project 1 – Build analytical solution for patients taking medicines
Domain: Health Care
Objective – This project aims to find out descriptive statistics & subset for specific clinical data problems. It will give them brief insight about BASE SAS procedures and data steps.
Project 2 – Build revenue projections reports
Objective – This project will give you hands-on experience in working with the SAS data analytics and business intelligence tool. You will be working on the data entered in a business enterprise setup, aggregate, retrieve and manage that data. You will learn to create insightful reports and graphs and come up with statistical and mathematical analysis to scientifically predict the revenue projection for a particular future time frame. Upon completion of the project you will be well-versed in the practical aspects of data analytics, predictive modeling, and data mining.
Domain: Finance Market
Objective – The project aims to find the most impacting factors in preferences of pre-paid model, also identifies which are all the variables highly correlated with impacting factors
Objective – k-Means Cluster analysis on Iris dataset to predict about the class of a flower using its petal’s dimensions
Project 1 – Understanding Cold Start Problem in Data Science
Topics: This project involves understanding of the cold start problem associated with the recommender systems. You will gain hands-on experience in information filtering, working on systems with zero historical data to refer to, as in the case of launching a new product. You will gain proficiency in working with personalized applications like movies, books, songs, news and such other recommendations. This project includes the following:
Project 2 – Recommendation for Movie, Summary
Topics: This is real world project that gives you hands-on experience in working with a movie recommender system. Depending on what movies are liked by a particular user, you will be in a position to provider data-driven recommendations. This project involves understanding recommender systems, information filtering, predicting ‘rating’, learning about user ‘preference’ and so on. You will exclusively work on data related to user details, movie details and others. The main components of the project include the following:
The Market Basket Analysis (MBA) case study
This case study is associated with the modeling technique of Market Basket Analysis where you will learn about loading of data, various techniques for plotting the items and running the algorithms. It includes finding out what are the items that go hand in hand and hence can be clubbed together. This is used for various real world scenarios like a supermarket shopping cart and so on.
Topics: This project gives you hands-on experience in working with the Splunk tool. You will have the data set of employee details in a text file based on which you will create a dashboard and report. Then you will deploy the various Splunk commands to perform row operations, extract certain data fields, edit the event, add tags, search with tag name for event and then save the tag search. Upon completion of this project you will learn to create a searchable repository using data that is captured, correlated and indexed in real time and ultimately visualize it using dashboard, report and alert.
Type – Field Extraction
Topics : In this project you will learn to extract fields from events using the Splunk field extraction technique. You will gain knowledge in the basics of field extractions, understand the use of field extractor, the field extraction page in Splunk web and field extract configuration in files. Learn about the regular expression and delimiters method of field extraction. Upon completion of the project you will gain expertise in building Splunk dashboard and use the extracted fields data in it to create rich visualizations in an enterprise setup.
Project 1: – Python Web Scraping for Data Science
In this project you will be introduced to the process of web scraping using Python. It involves installation of Beautiful Soup, web scraping libraries, working on common data and page format on the web, learning the important kinds of objects, Navigable String, deploying the searching tree, navigation options, parser, search tree, searching by CSS class, list, function and keyword argument.
Objective – To generate a password using Python code which would be tough to guess
Domain – Finance
Objective – The project aims to find the most impacting factors in preferences of pre-paid model, also identifies which are all the variables highly correlated with impacting factors
Domain – Stock Market
Objective – This project focuses on Machine Learning by creating predictive data model to predict future stock prices
Project 5 : Server logs/Firewall logs
Objective – This includes the process of loading the server logs into the cluster using Flume. It can then be refined using Pig Script, Ambari and HCatlog. You can then visualize it using elastic search and excel.
This project task includes:
Project 1 – Tableau Interactive Dashboard
Data Set – Sales
Objective – This project is involved with working on a Tableau dashboard for sales data. You will gain in-depth experience in working with dashboard objects, learn about visualizing data, highlight action, and dashboard shortcuts. With a few clicks you will be able to combine multiple data sources, add filters and drill down specific information. You will be proficient in creating real time visualizations that are interactive within minutes.
Upon completion of this project you will understand how to create a single point of access for all your sales data, ways of dissecting and analyzing sales from multiple angles, coming up with a sales strategy for improved business revenues.
Domain – Crime Statistics (Public Domain)
Objective –The Project aims to show the types of crimes and their frequency that happen in the District of Columbia. Also to provide the details of the crimes like, the area/location and day of the week the crime has happened
Problem statement :
Police departments are often called upon to put more “feet on the street” to prevent crime and keep order. But with limited resources, it’s impossible to be everywhere at once. This visualization shows where crimes take place by type and which day of the week. This kind of information gives local police more guidance on where they should deploy their crime prevention efforts.
Domain – Healthcare
Objective –Visual Mapping between Vaccination rate and Measles outbreak
Problem statement :
Lab Environment : Tableau Desktop (Better to use latest version which is 10.3 or any version later 10.0 will also have the impact)
Project 1 – Integrate Hive & Java with HBase
Topics : This is project that gives you hands-on experience to connect Hive and Java with HBase. Hive is used for querying using HiveQL that translates SQL-like queries into MapReduce jobs on Hadoop framework. In this project you will do HBase Installation, create Hive for HBase, import the data onto Hive from HBase, use HiveQL for Hive Table data querying and analyzing, and managing the HBase Table. You will also learn to Integrate Java with HBase to run HBase queries using Java applications that you
Type : Deploying the IDE for Cassandra applications
Topics : This project gives you a hands-on experience in installing and working with Apache Cassandra which is a high performance and extremely scalable database for distributed data with no single point of failure. You will deploy the Java Integrated Development Environment for running Cassandra, learn about the key drivers, work with Cassandra applications in a cluster setup and implement data querying techniques.
Java is one of the most popular programming languages for working with MongoDB. This project tells you how to work with the MongoDB Java Driver, and using MongoDB as a Java Developer. Become proficient in creating a table for inserting video using Java programming. Some of the tasks and steps involved are as below–
Topics : This project involves working with the Couchbase command-line interface tools that are used for managing of clusters in a multi-node or single node setup, working with vBuckets in Couchbase server, deploying Reports for log data collection. You will gain hands-on experience in deploying commands like start, stop and report status for log collection. It also includes working with Couchbase-cli, cbcollect_info tool and so on. Upon completion of the project you will be proficient in using Couchbase CLI for managing and monitoring clusters, data replication using XDCR.
Project – Running Function Queries on Apache Solr
Topics : In this project you will learn about the Function Queries and deploy it on the search results got in Apache Solr. You will understand how exactly the Function Queries are used to modify the search results based on certain conditions. It involves working on the index store that has dimensions of a box with arbitrary names, sort all the boxes through search and then modify the search results using Function Queries based on new parameters. Some of the query parsers used are DisMax, Extended DisMax and standard.
Project – Connection and Backups with NFS Server
Topics: How to Connect with NFS server, How to do Backup, How to restore backups, How to use tar and untar
Project – Library Management System
Problem Statement – It creates library management system project which includes following functionalities:
Add book, Add Member, Issue Book, Return Book, Available Book etc.
Type : Multi Broker Kafka Implementation
Topics : In this project you will learn about the Apache Kakfa which is a platform for handling real-time data feeds. You will exclusively work with Kafka brokers, understand partitioning, Kafka consumers, the terminology used for Kafka writes and failure handling in Kafka, understand how to deploy a single node Kafka with independent Zookeeper. Upon completion of the project you will gain considerable experience in working in a real world scenario for processing streaming data within an enterprise infrastructure.
Intellipaat’s Combo program is a structured learning path specially designed by industry experts and ensures that you transform into Big Data Data Science expert. Individual courses at Intellipaat focus on one or two specializations. However, if you have to masters Big Data Data Science then this program is for you
Intellipaat is the pioneer of Big Data Data Science training we provide:
Intellipaat offers the self-paced training and online instructor-led training.
Hadoop developer, Hadoop admin, Hadoop analyst, Hadoop testing, Spark & Scala, Apache Storm, Data Science with R, Data Science with SAS, Splunk, Deep learning are online instructor-led courses
Java, Hbase, Cassandra, Apache kafka, Couchbase, Apache Solr, Linux, Mahout are self-paced courses
If you have any queries you can contact our 24/7 dedicated support to raise a ticket. We provide you email support and solution to your queries. If the query is not resolved by email we can arrange for a one-on-one session with our trainers. The best part is that you can contact Intellipaat even after completion of training to get support and assistance. There is also no limit on the number of queries you can raise when it comes to doubt clearance and query resolution.
We provide you with the opportunity to work on 48 real world projects wherein you can apply your knowledge and skills that you acquired through our training, making you perfectly industry ready
Yes, Intellipaat does provide you with placement assistance. We have tie-ups with 80+ organizations including Ericsson, Cisco, Cognizant, TCS, among others that are looking for Hadoop professionals and we would be happy to assist you with the process of preparing yourself for the interview and the job
Upon successful completion of training you have to take a set of quizzes, complete the projects and upon review and on scoring over 60% marks in the qualifying quiz the official Intellipaat verified certificate is awarded.The Intellipaat Certification is a seal of approval and is highly recognized in 80+ corporations around the world including many in the Fortune 500 list of companies.
Preferably 8 GB RAM (Windows or Mac) with a good internet connection
All the instructors are from the industry with over 18+ years’ experience. They are subjects experts and each of them has gone through rigorous selection process.
This is a comprehensive course that is designed to clear multiple certifications viz.
The entire training course content is in line with respective certification program and helps you clear the requisite certification exam with ease and get the best jobs in the top MNCs.
As part of this training you will be working on real time projects and assignments that have immense implications in the real world industry scenario thus helping you fast track your career effortlessly.
At the end of this training program there will be quizzes that perfectly reflect the type of questions asked in the respective certification exams and helps you score better marks in certification exam.
Intellipaat Course Completion certificate will be awarded on the completion of Project work (on expert review) and upon scoring of at least 60% marks in the quiz. Intellipaat certification is well recognized in top 80+ MNCs like Ericsson, Cisco, Cognizant, Sony, Mu Sigma, Saint-Gobain, Standard Chartered, TCS, Genpact, Hexaware, etc.
"PMI®", "PMP®" and "PMI-ACP®" are registered marks of the Project Management Institute, Inc.
The Open Group®, TOGAF® are trademarks of The Open Group.
The Swirl logoTM is a trade mark of AXELOS Limited.
ITIL® is a registered trade mark of AXELOS Limited.
PRINCE2® is a Registered Trade Mark of AXELOS Limited.
Certified ScrumMaster® (CSM) and Certified Scrum Trainer® (CST) are registered trademarks of SCRUM ALLIANCE®
Professional Scrum Master is a registered trademark of Scrum.org