Introducing Scala and deployment of Scala for Big Data applications and Apache Spark analytics, Scala REPL, Lazy Values, Control Structures in Scala, Directed Acyclic Graph (DAG), First Spark Application Using SBT/Eclipse, Spark Web UI, Spark in Hadoop Ecosystem.
The importance of Scala, the concept of REPL (Read Evaluate Print Loop), deep dive into Scala pattern matching, type interface, higher-order function, currying, traits, application space and Scala for data analysis
Learning about the Scala Interpreter, static object timer in Scala and testing string equality in Scala, implicit classes in Scala, the concept of currying in Scala and various classes in Scala
Learning about the Classes concept, understanding the constructor overloading, various abstract classes, the hierarchy types in Scala, the concept of object equality and the val and var methods in Scala
Understanding sealed traits, wild, constructor, tuple, variable pattern and constant pattern
Understanding traits in Scala, the advantages of traits, linearization of traits, the Java equivalent, and avoiding of boilerplate code
Implementation of traits in Scala and Java and handling of multiple traits extending
Introduction to Scala collections, classification of collections, the difference between Iterator and Iterable in Scala and example of list sequence in Scala
The two types of collections in Scala, Mutable and Immutable collections, understanding lists and arrays in Scala, the list buffer and array buffer, queue in Scala and double-ended queue Deque, Stacks, Sets, Maps and Tuples in Scala
Introduction to Scala packages and imports, the selective imports, the Scala test classes, introduction to JUnit test class, JUnit interface via JUnit 3 suite for Scala test, packaging of Scala applications in Directory Structure and examples of Spark Split and Spark Scala
Introduction to Spark, how Spark overcomes the drawbacks of working MapReduce, understanding in-memory MapReduce, interactive operations on MapReduce, Spark stack, fine vs. coarse-grained update, Spark stack, Spark Hadoop YARN, HDFS Revision, YARN Revision, the overview of Spark and how it is better Hadoop, deploying Spark without Hadoop, Spark history server and Cloudera distribution
Spark installation guide, Spark configuration, memory management, executor memory vs. driver memory, working with Spark Shell, the concept of resilient distributed datasets (RDD), learning to do functional programming in Spark and the architecture of Spark
Spark RDD, creating RDDs, RDD partitioning, operations, and transformation in RDD, Deep dive into Spark RDDs, the RDD general operations, a read-only partitioned collection of records, using the concept of RDD for faster and efficient data processing, RDD action for collect, count, collects map, save-as-text-files and pair RDD functions
Understanding the concept of Key-Value pair in RDDs, learning how Spark makes MapReduce operations faster, various operations of RDD, MapReduce interactive operations, fine and coarse-grained update and Spark stack
Comparing the Spark applications with Spark Shell, creating a Spark application using Scala or Java, deploying a Spark application, Scala built application, creation of mutable list, set and set operations, list, tuple, concatenating list, creating application using SBT, deploying application using Maven, the web user interface of Spark application, a real-world example of Spark and configuring of Spark
Learning about Spark parallel processing, deploying on a cluster, introduction to Spark partitions, file-based partitioning of RDDs, understanding of HDFS and data locality, mastering the technique of parallel operations, comparing repartition and coalesce and RDD actions
The execution flow in Spark, understanding the RDD persistence overview, Spark execution flow, and Spark terminology, distribution shared memory vs. RDD, RDD limitations, Spark shell arguments, distributed persistence, RDD lineage, Key-Value pair for sorting implicit conversions like CountByKey, ReduceByKey, SortByKey and AggregateByKey
Introduction to Machine Learning, types of Machine Learning, introduction to MLlib, various ML algorithms supported by MLlib, Linear Regression, Logistic Regression, Decision Tree, Random Forest, K-means clustering techniques, building a Recommendation Engine
Hands-on Exercise: Building a Recommendation Engine
Why Kafka, what is Kafka, Kafka architecture, Kafka workflow, configuring Kafka cluster, basic operations, Kafka monitoring tools, integrating Apache Flume and Apache Kafka
Hands-on Exercise: Configuring Single Node Single Broker Cluster, Configuring Single Node Multi Broker Cluster, Producing and consuming messages, Integrating Apache Flume and Apache Kafka.
Introduction to Spark Streaming, features of Spark Streaming, Spark Streaming workflow, initializing StreamingContext, Discretized Streams (DStreams), Input DStreams and Receivers, transformations on DStreams, Output Operations on DStreams, Windowed Operators and why it is useful, important Windowed Operators, Stateful Operators.
Hands-on Exercise: Twitter Sentiment Analysis, streaming using netcat server, Kafka-Spark Streaming and Spark-Flume Streaming
Introduction to various variables in Spark like shared variables and broadcast variables, learning about accumulators, the common performance issues and troubleshooting the performance problems
Learning about Spark SQL, the context of SQL in Spark for providing structured data processing, JSON support in Spark SQL, working with XML data, parquet files, creating Hive context, writing Data Frame to Hive, reading JDBC files, understanding the Data Frames in Spark, creating Data Frames, manual inferring of schema, working with CSV files, reading JDBC tables, Data Frame to JDBC, user-defined functions in Spark SQL, shared variables and accumulators, learning to query and transform data in Data Frames, how Data Frame provides the benefit of both Spark RDD and Spark SQL and deploying Hive on Spark as the execution engine
Learning about the scheduling and partitioning in Spark, hash partition, range partition, scheduling within and around applications, static partitioning, dynamic sharing, fair scheduling, Map partition with index, the Zip, GroupByKey, Spark master high availability, standby masters with ZooKeeper, Single-node Recovery with Local File System and High Order Functions
Introduction to Python Language, features, the advantages of Python over other programming languages, Python installation, Windows, Mac & Linux distribution for Anaconda Python, deploying Python IDE, basic Python commands, data types, variables, keywords and more.
Hands-on Exercise – Installing Python Anaconda for the Windows, Linux and Mac.
Built-in data types in Python, tabs and spaces indentation, code comment Pound # character, variables and names, Python built-in data types, Numeric, int, float, complex, list tuple, set dict, containers, text sequence, exceptions, instances, classes, modules, Str(String), Ellipsis Object, Null Object, Ellipsis, Debug, basic operators, comparison, arithmetic, slicing and slice operator, logical, bitwise, loop and control statements, while, for, if, break, else, continue.
Hands-on Exercise – Write your first Python program Write a Python Function (with and without parameters) Use Lambda expression Write a class, create a member function and a variable, Create an object Write a for loop to print all odd numbers
How to write OOP concepts program in Python, connecting to a database, classes and objects in Python, OOPs paradigm, important concepts in OOP like polymorphism, inheritance, encapsulation, Python functions, return types, and parameters, Lambda expressions, connecting to database and pulling the data.
Introduction to arrays and matrices, indexing of array, datatypes, broadcasting of array math, standard deviation, conditional probability, coorelation and covariance.
Hands-on Exercise – How to import NumPy module, creating aray using ND-array, calculating standard deviation on array of numbers, calculating correlation between two variables.
Introduction to SciPy and its functions, building on top of NumPy, cluster, linalg, signal, optimize, integrate, subpackages, SciPy with Bayes Theorem.
Hands-on Exercise – Importing of SciPy, applying the Bayes theorem on the given dataset.
How to plot graph and chart with Python, various aspects of line, scatter, bar, histogram, 3D, the API of MatPlotLib, subplots.
Hands-on Exercise – deploying MatPlotLib for creating Pie, Scatter, Line, Histogram.
Introduction to Python dataframes, importing data from JSON, CSV, Excel, SQL database, NumPy array to dataframe, various data operations like selecting, filtering, sorting, viewing, joining, combining, how to handle missing values, time series analysis, linear regression.
Hands-on Exercise – working on importing data from JSON files, selecting record by a group, applying filter on top, viewing records, analyzing with linear regression, and creation of time series.
What is natural language processing, working with NLP on text data, setting up the environment using Jupyter Notebook, analyzing sentence, the Scikit-Learn machine learning algorithms, bags of words model, extracting feature from text, searching a grid, model training, multiple parameters, building of a pipeline.
Hands-on Exercise – setting up the Jupyter notebook environment, loading of a dataset in Jupyter, algorithms in Scikit-Learn package for performing machine learning techniques, training a model to search a grid.
Introduction to web scraping in Python, the various web scraping libraries, beautifulsoup, Scrapy Python packages, installing of beautifulsoup, installing Python parser lxml, creating soup object with input HTML, searching of tree, full or partial parsing, output print, searching the tree.
Hands-on Exercise – Installation of Beautiful soup and lxml Python parser, making a soup object with input HTML file, navigating using Py objects in soup tree.
Introduction to Python for Hadoop, the basics of the Hadoop ecosystem, Hadoop common, the architecture of MapReduce and HDFS, deploying Python coding for MapReduce jobs on Hadoop framework.
Hands-on Exercise – How to write a MapReduce job with Python, connecting to the Hadoop framework and performing the tasks.
Introduction to Apache Spark, importance of RDD, the Spark libraries, deploying Spark code with Python, the machine learning library of Spark MLlib, deploying Spark MLlib for classification, clustering and regression.
Hands-on Exercise – How to implement Python in a sandbox, working with the HDFS file system.
Project 1: Movie Recommendation
Topics – This is a project wherein you will gain hands-on experience in deploying Apache Spark for movie recommendation. You will be introduced to the Spark Machine Learning Library, a guide to MLlib algorithms and coding which is a machine learning library. Understand how to deploy collaborative filtering, clustering, regression, and dimensionality reduction in MLlib. Upon completion of the project you will gain experience in working with streaming data, sampling, testing and statistics.
Project 2: Twitter API Integration for tweet Analysis
Topics – With this project you will learn to integrate Twitter API for analyzing tweets. You will write codes on the server side using any of the scripting languages like PHP, Ruby or Python, for requesting the Twitter API and get the results in JSON format. You will then read the results and perform various operations like aggregation, filtering and parsing as per the need to come up with tweet analysis.
Project 3: Data Exploration Using Spark SQL – Wikipedia dataset
Topics – This project lets you work with Spark SQL. You will gain experience in working with Spark SQL for combining it with ETL applications, real time analysis of data, performing batch analysis, deploying machine learning, creating visualizations and processing of graphs.
Project 1 : Analyzing the naming pattern using Python
Industry : General
Problem Statement : How to analyze the trends and most popular baby names
Topics : In this Python project you will work with the United States Social Security Administra4on (SSA) has made available data on the frequency of baby names from 1880 through 2016. The project requires analyzing the data considering different methods. You will visualize the most frequent names, determine the naming trends, and come up with the most popular names for a certain year.
Project 2 : – Python Web Scraping for Data Science
In this project you will be introduced to the process of web scraping using Python. It involves installation of Beautiful Soup, web scraping libraries, working on common data and page format on the web, learning the important kinds of objects, Navigable String, deploying the searching tree, navigation options, parser, search tree, searching by CSS class, list, function and keyword argument.
Project 3 : Predicting customer churn in Telecom Company
Industry – Telecommunications
Problem Statement – How to increase the profitability of a telecom major by reducing the churn rate
Topics :In this project you will work with the telecom company’s customer dataset. This dataset includes subscribing telephone customer’s details. Each of the column has data on phone number, call minutes during various times of the day, the charges incurred, lifetime account duration, whether or not the customer has churned some services by unsubscribing it. The goal is to predict whether a customer will eventually churn or not.
Project 4 : Server logs/Firewall logs
Objective – This includes the process of loading the server logs into the cluster using Flume. It can then be refined using Pig Script, Ambari and HCatlog. You can then visualize it using elastic search and excel.
This project task includes:
This course is designed for clearing the Apache Spark Certification examination of any reputed company. At the end of the course, there will be a quiz and project assignments; once you complete them, you will be awarded with Intellipaat Course Completion Certificate.
You will get Lifetime access to high quality interactive tutorials along with life time access to complete Course Material .There will be 24/7 access to video tutorials with email support. If you stuck in any unexpected problem we will provide online interactive sessions with trainer for issue resolving.
We provide 24X7 support by email for issues or doubts clearance for Self-paced training.
In online Instructor led training, trainer will be available to help you out with your queries regarding the course. If required, the support team can also provide you live support by accessing your machine remotely. This ensures that all your doubts and problems faced during labs and project work are clarified round the clock.
"PMI®", "PMP®" and "PMI-ACP®" are registered marks of the Project Management Institute, Inc.
The Open Group®, TOGAF® are trademarks of The Open Group.
The Swirl logoTM is a trade mark of AXELOS Limited.
ITIL® is a registered trade mark of AXELOS Limited.
PRINCE2® is a Registered Trade Mark of AXELOS Limited.
Certified ScrumMaster® (CSM) and Certified Scrum Trainer® (CST) are registered trademarks of SCRUM ALLIANCE®
Professional Scrum Master is a registered trademark of Scrum.org