Installing Spark on Windows for Beginners

Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. While it was originally built for Unix-based environments, it can also be installed and run on Windows for local development and testing purposes. To install Spark on Windows, you need to have a few prerequisites ready: the Java Development Kit (JDK) to run Spark, Python if you plan to use the PySpark API, and the winutils.exe utility to ensure compatibility with Windows. Once these are in place, you can proceed with the Spark installation process and start exploring its features on your Windows machine.

Steps to Install Apache Spark

Step 1: Ensure if Java is installed on your system

Before installing Spark, Java is a must-have for your system. The following command will verify the version of Java installed on your system:

$java -version

If Java is already installed on your system, you get to see the following output:

java version "1.7.0_71"<br>
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)<br>
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

You have to install Java if it is not installed on your system.

Step 2: Now, ensure if Scala is installed on your system

Installing the Scala programming language is mandatory before installing Spark as it is important for Spark’s implementation. The following command will verify the version of Scala used in your system:

$scala -version

If the Scala application is already installed on your system, you get to see the following response on the screen:

Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

If you don’t have Scala, then you have to install it on your system. Let’s see how to install Scala.

Step 3: First, download Scala

You need to download the latest version of Scala. Here, you will see the scala-2.11.6 version being used. After downloading, you will be able to find the Scala tar file in the Downloads folder. You can also check Spark Version to ensure compatibility with the installed Scala version.

Step 4: Now, install Scala

You must follow the given steps to install Scala on your system:

Extract the Scala tar file using the following command:

$ tar xvf scala-2.11.6.tgz

Move Scala software files to the directory (/usr/local/scala) using the following commands:

$ su –<br>

Password:

# cd /home/Hadoop/Downloads/<br>
# mv scala-2.11.6 /usr/local/scala<br>
# exit

Set PATH for Scala using the following command:

$ export PATH = $PATH:/usr/local/scala/bin

Now, verify the installation of Scala by checking the version of it

$scala -version

If your Scala installation is successful, then you will get the following output:

Scala code runner version 2.11.6 — Copyright 2002-2013, LAMP/EPFL

Get 100% Hike!

Master Most in Demand Skills Now!

Now, you are welcome to the core of this tutorial section on ‘Download Apache Spark.’ Once, you are ready with Java and Scala on your systems, go to Step 5.

Step 5: Download Apache Spark

After finishing with the installation of Java and Scala, now, in this step, you need to download the latest version of Spark by using the following command:

spark-1.3.1-bin-hadoop2.6 version

After this, you can find a Spark tar file in the Downloads folder.

Step 6: Install Spark

Follow the below steps for installing Apache Spark.

Extract the Spark tar file using the following command:

$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz

Move Spark software files to the directory using the following commands:

/usr/local/spark<br>
$ su –

Password:

# cd /home/Hadoop/Downloads/<br>
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark<br>
# exit

Now, configure the environment for Spark

For this, you need to add the following path to ~/.bashrc file which will add the location, where the Spark software files are located to the PATH variable type.

export PATH = $PATH:/usr/local/spark/bin

Use the below command for sourcing the ~/.bashrc file

$ source ~/.bashrc

With this, you have successfully installed Apache Spark on your system. Now, you need to verify it.

Step 7: Verify the Installation of Spark on your system

The following command will open the Spark shell application version:

$spark-shell

If Spark is installed successfully, then you will be getting the following output:

Spark assembly has been built with Hive, including Datanucleus jars on classpath<br>
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties<br>
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop<br>
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop<br>
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;<br>
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)<br>
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server<br>
15/06/04 15:25:23 INFO Utils: Successfully started service naming 'HTTP class server' on port 43292.

Welcome to the Spark World!

Using the Scala version 2.10.4 (Java HotSpot™ 64-Bit Server VM, Java 1.7.0_71), type in the expressions to have them evaluated as and when the requirement is raised. The Spark context will be available as Scala.

Initializing Spark in Python

from pyspark import SparkConf, SparkContext<br>
conf = SparkConf().setMaster("local").setAppName("My App")<br>
sc = SparkContext(conf = conf)

Initializing Spark in Scala

import org.apache.spark.SparkConf<br>
import org.apache.spark.SparkContext<br>
import org.apache.spark.SparkContext._<br>
val conf = new SparkConf().setMaster("local").setAppName("My App")<br>
val sc = new SparkContext(conf)

Initializing Spark in Java

import org.apache.spark.SparkConf;<br>
import org.apache.spark.api.java.JavaSparkContext;<br>
SparkConf conf = new SparkConf().setMaster("local").setAppName("My App");<br>
JavaSparkContext sc = new JavaSparkContext(conf);

The above examples show the minimal way to initialize a SparkContext, in Python, Scala, and Java, respectively, where you pass two parameters:

A cluster URL, namely, ‘local’ in these examples, tells Spark how to connect to a cluster. This ‘local’ is a special value that runs Spark on one thread on the local machine, without connecting to a cluster.

An application name, namely, ‘My App’ in these examples. This will identify your application on the cluster manager’s UI if you connect to a cluster.

Conclusion

In conclusion, installing Apache Spark on Windows allows you to leverage its powerful big data processing capabilities for local development and testing. By setting up essential components like Java, Python, and winutils.exe, you can create a Spark environment that is compatible with Windows. Once installed, you’ll be ready to experiment with Spark’s versatile features and begin building data processing pipelines right from your Windows machine.

How to Install Spark on Windows? – Complete Guide

Steps to Install Apache Spark

Initializing Spark in Python

Initializing Spark in Scala

Initializing Spark in Java

Conclusion

About the Author