Flat 20% & up to 50% off + Free additional Courses. Hurry up!

Downloading Spark and Getting Started


The main purpose of this chapter is to download and run Spark in local mode on a single computer.


Downloading Spark

The first step to using Spark is to download and unpack it. Visit http://spark.apache.org/downloads.html, select the package type of “Pre-built for Hadoop 2.4 and later,” and click “Direct Download.” This will download a compressed TAR file, or tarball, called spark-1.2.0-bin-hadoop2.4.tgz.

To unpack it, open a terminal, change to the directory where you downloaded Spark, and untar the file. This will create a new directory with the same name but without the final .tgz suffix. Use below command to accomplish this:

cd ~

tar -xf spark-1.2.0-bin-hadoop2.4.tgz

cd spark-1.2.0-bin-hadoop2.4


In the line containing the tar command, the x flag tells tar we are extracting files, and the f flag specifies the name of the tarball. The ls command lists the contents of the Spark directory.


Introduction to Spark’s Python and Scala Shells

Spark comes with interactive shells that enable ad hoc data analysis. Spark’s shells allow you to interact with data that is distributed on disk or in memory across many machines, and Spark takes care of automatically distributing this processing.


Let’s discuss this with example:

The first step is to open up one of Spark’s shells.

  • To open the Python version of the Spark shell, which we also refer to as the PySpark Shell, go into your Spark directory and type: bin/pyspark (Or bin\pyspark in Windows.)
  • To open the Scala version of the shell, type: bin/spark-shell

 Let’s create one in the shell from a local text file and do some very simple ad hoc analysis by following Example 2-1 for Python or Example 2-2 for Scala.


Example 2-1. Python line count

>>>lines = sc.textFile("README.md") # Create an RDD called lines

>>>lines.count() # Count the number of items in this RDD 127

>>>lines.first() # First item in this RDD, i.e. first line of README.md u'# Apache Spark'


Example 2-2.Scala line count

scala>val lines = sc.textFile("README.md") // Create an RDD called lines lines: spark.RDD[String] = MappedRDD[...]

scala>lines.count() // Count the number of items in this RDD res0: Long = 127

scala>lines.first() // First item in this RDD, i.e. first line of README.md res1: String = # Apache Spark


Introduction to Core Spark Concepts

Spark application consists of a driver program. The driver program contains your application’s main function and defines distributed datasets on the cluster, then applies operations to them.

Driver programs access Spark through a Spark Context object. In the shell, a Spark Context is automatically created for you as the variable called sc.

Once you have a Spark Context, you can use it to build RDDs. To run operations, driver programs typically manage a number of nodes called

executors. Below figure shows how Spark executes on a cluster.

components for distributed execution in spark


Standalone Applications

Spark can be linked into standalone applications in either Java, Scala, or Python. The main difference from using it in the shell is that you need to initialize your own SparkContext. After that, the API is the same.

The process of linking to Spark varies by language. In Java and Scala, you give your application a Maven dependency on the spark-core artifact. As of the time of writing, the latest Spark version is 1.2.0, and the Maven coordinates for that are:

groupId = org.apache.spark

artifactId = spark-core_2.10

version = 1.2.0

In Python, you simply write applications as Python scripts, but you must run them using the bin/spark-submit script included in Spark.


Initializing Spark in Python

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("My App")

sc = SparkContext(conf = conf)


Initializing Spark in Scala

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

val conf = new SparkConf().setMaster("local").setAppName("My App")

val sc = new SparkContext(conf)


Initializing Spark in Java

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaSparkContext;

SparkConf conf = new SparkConf().setMaster("local").setAppName("My App");

JavaSparkContext sc = new JavaSparkContext(conf);


These examples show the minimal way to initialize a SparkContext, where you pass two parameters:

  • A cluster URL, namely local in these examples, which tells Spark how to connect to a cluster. local is a special value that runs Spark on one thread on the local machine, without connecting to a cluster.
  • An application name, namely My App in these examples. This will identify your application on the cluster manager’s UI if you connect to a cluster.

"0 Responses on Downloading Spark and Getting Started"

Training in Cities

Bangalore, Hyderabad, Chennai, Delhi, Kolkata, UK, London, Chicago, San Francisco, Dallas, Washington, New York, Orlando, Boston

100% Secure Payments. All major credit & debit cards accepted Or Pay by Paypal.


Sales Offer

  • To avail this offer, enroll before 25th October 2016.
  • This offer cannot be combined with any other offer.
  • This offer is valid on selected courses only.
  • Please use coupon codes mentioned below to avail the offer
DW offer

Sign Up or Login to view the Free Downloading Spark and Getting Started.