Azure Databricks: Unified Analytics Platform

Have a look at this Azure Databricks Tutorial video curated by industry experts

Databricks Introduction

Databricks is a software company founded by the creators of Apache Spark. The company has also created famous software such as Delta Lake, MLflow, and Koalas. These are the popular open-source projects that span data engineering, data science, and machine learning. Databricks develops web-based platforms for working with Spark, which provides automated cluster management and IPython-style notebooks.

Databricks in Azure

Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud services platform. Azure Databricks offers three environments:

Databricks SQL
Databricks data science and engineering
Databricks machine learning

Databricks SQL

Databricks SQL provides a user-friendly platform. This helps analysts, who work on SQL queries, to run queries on Azure Data Lake, create multiple virtualizations, and build and share dashboards.

Databricks Data Science and Engineering

Databricks data science and engineering provide an interactive working environment for data engineers, data scientists, and machine learning engineers. The two ways to send data through the big data pipeline are:

Ingest into Azure through Azure Data Factory in batches
Stream real-time by using Apache Kafka, Event Hubs, or IoT Hub

Databricks Machine Learning

Databricks machine learning is a complete machine learning environment. It helps to manage services for experiment tracking, model training, feature development, and management. It also does model serving.

Pros and Cons of Azure Databricks

Moving ahead in this blog, we will discuss the pros and cons of Azure Databricks and understand how good it really is.

Pros

It can process large amounts of data with Databricks and since it is part of Azure; the data is cloud-native.
The clusters are easy to set up and configure.
It has an Azure Synapse Analytics connector as well as the ability to connect to Azure DB.
It is integrated with Active Directory.
It supports multiple languages. Scala is the main language, but it also works well with Python, SQL, and R.

Cons

It does not integrate with Git or any other versioning tool.
It, currently, only supports HDInsight and not Azure Batch or AZTK.

Databricks SQL

Databricks SQL allows you to run quick ad-hoc SQL queries on Data Lake. Integrating with Azure Active Directory enables to run of complete Azure-based solutions by using Databricks SQL. By integrating with Azure databases, Databricks SQL can store Synapse Analytics, Azure Cosmos DB, Data Lake Store, and Blob Storage. Integrating with Power BI, Databricks SQL allows users to discover and share insights more easily. BI tools, such as Tableau Software, can also be used for accessing data bricks.

The interface that allows the automation of Databricks SQL objects is REST API.

Data Management

It has three parts:

Visualization: A graphical presentation of the result of running a query
Dashboard: A presentation of query visualizations and commentary
Alert: A notification that a field returned by a query has reached a threshold

Computation Management

Here, we will know about the terms that will help to run SQL queries in Databricks SQL.

Query: A valid SQL statement
SQL endpoint: A resource where SQL queries are executed
Query history: A list of previously executed queries and their characteristics

Authorization

User and group: The user is an individual who has access to the system. The set of multiple users is known as a group.
Personal access token: An opaque string is used to authenticate to the REST API.
Access control list: Set of permissions attached to a principal that requires access to an object. ACL (Access Control List) specifies the object and actions allowed in it.

Databricks Data Science & Engineering

Databricks Data Science & Engineering is, sometimes, also called Workspace. It is an analytics platform that is based on Apache Spark.

Databricks Data Science & Engineering comprises complete open-source Apache Spark cluster technologies and capabilities. Spark in Databricks Data Science & Engineering includes the following components:

Spark SQL and DataFrames: This is the Spark module for working with structured data. A DataFrame is a distributed collection of data that is organized into named columns. It is very similar to a table in a relational database or a data frame in R or Python.
Streaming: This integrates with HDFS, Flume, and Kafka. Streaming is real-time data processing and analysis for analytical and interactive applications.
MLlib: It is short for Machine Learning Library consisting of common learning algorithms and utilities including classification, regression, clustering, collaborative filtering, dimensionality reduction as well as underlying optimization primitives.
GraphX: Graphs and graph computation for a broad scope of use cases from cognitive analytics to data exploration.
Spark Core API: This has the support for R, SQL, Python, Scala, and Java.

Integrating with Azure Active Directory enables you to run complete Azure-based solutions by using Databricks SQL. By integrating with Azure databases, Databricks SQL can store Synapse Analytics, Cosmos DB, Data Lake Store, and Blob Storage. By integrating with Power BI, Databricks SQL allows users to discover and share insights more easily. BI tools, such as Tableau Software, can also be used.

Get 100% Hike!

Master Most in Demand Skills Now!

Workspace

Workspace is the place for accessing all Azure Databricks assets. It organizes objects into folders and provides access to data objects and computational resources.

The workspace contains:

Dashboard: It provides access to visualizations.
Library: Package available to notebook or job running on the cluster. We can also add our own libraries.
Repo: A folder whose contents are co-versioned together by syncing them to a local Git repository.
Experiment: A collection of MLflow runs for training an ML model.

Interface

It supports UI, API, and command line (CLI.)

UI: It provides a user-friendly interface to workspace folders and their resources.
Rest API: There are two versions, REST API 2.0 and REST API 1.2. REST API 2.0 has features of REST API 1.2 along with some additional features. So, REST API 2.0 is the preferred version.
CLI: It is an open-source project that is available on GitHub. CLI is built on REST API 2.0.

Data Management

Databricks File System (DBFS): It is an abstraction layer over the Blob store. It contains directories that can contain files or more directories.
Database: It is a collection of information that can be managed and updated.
Table: Tables can be queried with Apache Spark SQL and Apache Spark APIs.
Metastore: It stores information about various tables and partitions in the data warehouse.

Computation Management

To run computations in Azure Databricks, we need to know about the following:

Cluster: It is a set of computation resources and configurations on which we can run notebooks and jobs. These are of two types:
- All-purpose: We create an all-purpose cluster by using UI, CLI, or REST API. We can manually terminate and restart an all-purpose cluster. Multiple users can share such clusters to do collaborative, interactive analysis.
- Job: The Azure Databricks job scheduler creates a job cluster when we run a job on a new job cluster and terminates the cluster when the job is complete. We cannot restart a job cluster.
Pool: It has a set of ready-to-use instances that reduce cluster start. It also reduces auto-scaling time. If the pool does not have enough resources, it expands itself. When the attached cluster is terminated, the instances it uses are returned to the pool and can be reused by a different cluster.

Databricks Runtime

The core components that run on clusters managed by Azure Databricks offer several runtimes:

It includes Apache Spark but also adds numerous other features to improve big data analytics.
Databricks Runtime for machine learning is built on Databricks runtime and provides a ready environment for machine learning and data science.
Databricks Runtime for genomics is a version of Databricks runtime that is optimized for working with genomic and biomedical data.
Databricks Light is the Azure Databricks packaging of the open-source Apache Spark runtime.

Job

Workload: There are two types of workloads with respect to the pricing schemes:
- Data engineering workload: This workload works on a job cluster.
- Data analytics workload: This workload runs on an all-purpose cluster.
Execution context: It is the state of a REPL environment. It supports Python, R, Scala, and SQL.

Model Management

The concepts that are needed to know how to build machine learning models are:

Model: This is a mathematical function that represents the relation between inputs and outputs. Machine learning consists of training and inference steps. We can train a model by using an existing data set and using that to predict the outcomes of new data.
Run: It is a collection of parameters, metrics, and tags that are related to training a machine learning model.
Experiment: It is the primary unit of organization and access control for runs. All MLflow runs belong to the experiment.

Authentication and Authorization

User and group: A user is an individual who has access to the system. A set of users is a group.
Access control list: Access control list (ACL) is a set of permissions that are attached to a principal, which requires access to an object. ACL specifies the object and the actions allowed on it.

Databricks Machine Learning

Databricks machine learning is an integrated end-to-end machine learning platform incorporating managed services for experiment tracking, model training, feature development and management, and feature and model serving. Databricks machine learning automates the creation of a cluster that is optimized for machine learning. Databricks Runtime ML clusters include the most popular machine learning libraries such as TensorFlow, PyTorch, Keras, and XGBoost. It also includes libraries, such as Horovod, that are required for distributed training.

With Databricks machine learning, we can:

Train models either manually or with AutoML
Track training parameters and models by using experiments with MLflow tracking
Create feature tables and access them for model training and inference
Share, manage, and serve models by using Model Registry

We also have access to all of the capabilities of Azure Databricks workspace such as notebooks, clusters, jobs, data, Delta tables, security and admin controls, and many more.

Conclusion

Azure Databricks is an easy, fast, and collaborative Apache spark-based analytics platform. It accelerates innovation by bringing together data science, data engineering, and business. This helps to take the collaboration to another step and makes the process of data analytics more productive, secure, scalable, and optimized for Azure. You can also take up a Microsoft Azure Data Engineer Training course and learn more about technology that has made major improvements in the cloud area.