Downloading Spark and Getting Started
The main purpose of this chapter is to download and run Spark in local mode on a single computer.
The first step to using Spark is to download and unpack it. Visit http://spark.apache.org/downloads.html, select the package type of “Pre-built for Hadoop 2.4 and later,” and click “Direct Download.” This will download a compressed TAR file, or tarball, called spark-1.2.0-bin-hadoop2.4.tgz.
To unpack it, open a terminal, change to the directory where you downloaded Spark, and untar the file. This will create a new directory with the same name but without the final .tgz suffix. Use below command to accomplish this:
tar -xf spark-1.2.0-bin-hadoop2.4.tgz
In the line containing the tar command, the x flag tells tar we are extracting files, and the f flag specifies the name of the tarball. The ls command lists the contents of the Spark directory.
Introduction to Spark’s Python and Scala Shells
Spark comes with interactive shells that enable ad hoc data analysis. Spark’s shells allow you to interact with data that is distributed on disk or in memory across many machines, and Spark takes care of automatically distributing this processing.
Let’s discuss this with example:
The first step is to open up one of Spark’s shells.
- To open the Python version of the Spark shell, which we also refer to as the PySpark Shell, go into your Spark directory and type: bin/pyspark (Or bin\pyspark in Windows.)
- To open the Scala version of the shell, type: bin/spark-shell
Let’s create one in the shell from a local text file and do some very simple ad hoc analysis by following Example 2-1 for Python or Example 2-2 for Scala.
Example 2-1. Python line count >>>lines = sc.textFile("README.md") # Create an RDD called lines >>>lines.count() # Count the number of items in this RDD 127 >>>lines.first() # First item in this RDD, i.e. first line of README.md u'# Apache Spark'
Example 2-2.Scala line count scala>val lines = sc.textFile("README.md") // Create an RDD called lines lines: spark.RDD[String] = MappedRDD[...] scala>lines.count() // Count the number of items in this RDD res0: Long = 127 scala>lines.first() // First item in this RDD, i.e. first line of README.md res1: String = # Apache Spark
Introduction to Core Spark Concepts
Spark application consists of a driver program. The driver program contains your application’s main function and defines distributed datasets on the cluster, then applies operations to them.
Driver programs access Spark through a Spark Context object. In the shell, a Spark Context is automatically created for you as the variable called sc.
Once you have a Spark Context, you can use it to build RDDs. To run operations, driver programs typically manage a number of nodes called
executors. Below figure shows how Spark executes on a cluster.
Spark can be linked into standalone applications in either Java, Scala, or Python. The main difference from using it in the shell is that you need to initialize your own SparkContext. After that, the API is the same.
The process of linking to Spark varies by language. In Java and Scala, you give your application a Maven dependency on the spark-core artifact. As of the time of writing, the latest Spark version is 1.2.0, and the Maven coordinates for that are:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.2.0
In Python, you simply write applications as Python scripts, but you must run them using the bin/spark-submit script included in Spark.
Initializing Spark in Python
from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster("local").setAppName("My App") sc = SparkContext(conf = conf)
Initializing Spark in Scala
import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val conf = new SparkConf().setMaster("local").setAppName("My App") val sc = new SparkContext(conf)
Initializing Spark in Java
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; SparkConf conf = new SparkConf().setMaster("local").setAppName("My App"); JavaSparkContext sc = new JavaSparkContext(conf);
These examples show the minimal way to initialize a SparkContext, where you pass two parameters:
- A cluster URL, namely local in these examples, which tells Spark how to connect to a cluster. local is a special value that runs Spark on one thread on the local machine, without connecting to a cluster.
- An application name, namely My App in these examples. This will identify your application on the cluster manager’s UI if you connect to a cluster.