I have been exposed to Spark lately resulting into this second post related to it.

This post assumes that you know fundamental of Spark. If not then may be you should go here first.

There are several alternatives for setting up Spark cluster. Out of which the most basic one is stand-alone mode for which you won't require any external cluster management tools. Generally stand-alone mode is sufficient for small cluster of size up-to 10 nodes.

Now before you get bored let's start with cluster set-up:

For this example we will assume that we have three nodes with host-names: Node1, Node2 and Node3.

I have used Ubuntu 14.04 LTS for this tutorial. However following steps should work with newer versions of Ubuntu and with also other Debian based Linux distros.

Before we begin with spark, we need to install other dependencies.

Installing java:

Following set of commands will install Java8 on your system. You can skip this steps if you already have Java8 installed on your system. If you are having any other older version of Java installed then it recommended to upgrade it to Java8.

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

In Webupd8 ppa repository also providing a package to set corresponding environment variables ...

$ sudo apt-get install oracle-java8-set-default

In order to verify whether Java8 is successfuly installed, fire following command:
$ java -version

and output should be:

java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)

You might also want to install Scala which is generally preferred language for Spark programming:

You need to download Scala from here, extract the files in some location for example /usr/local/scala/. Alternatively you can fire following set of commands to achieve the same...

$ wget http://www.scala-lang.org/files/archive/scala-2.10.6.tgz
$ sudo mkdir /usr/local/scala
$ sudo tar xvf scala-2.10.6.tgz -C /usr/local/scala/

Now in order to make scala reachable from any location on your file system, we need to set/modify some environment variables.

Go to your home folder using this command: $ cd ~
And open .bashrc file in your favorite editor: $ vi .bashrc
Append following lines at the end of the file:

export SCALA_HOME=/usr/local/scala/scala-2.10.6

Execute the modified .bashrc file with this command in order to make the changes effective.
$ source .bashrc

To verify successful scala install fire this command:
$ scala -version

It should return following output

Scala code runner version 2.10.6 -- Copyright 2002-2013, LAMP/EPFL

Note: We have used 2.10 version of Scala as in order to use latest stable Scala vesion (2.11) we need to manually build spark from it's source which is quite time consuming. Moreover Spark does not yet support its JDBC component for Scala 2.11. Reference: http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211

So in-case you have requirements so that you must have to use Scala 2.11 then you can download spark source and build it by following instruction given on this link.

Now we are set to install Spark.

Download Spark from this page: http://spark.apache.org/downloads.html

From package type drop down select pre-build package matching your Hadoop version. Also as mentioned above note, you always have option to download source from the same link and build spark tailored to your needs.

Once the download is complete, you may extract the package in some appropriate location.

We are all set. Let's test spark with some example script. Go to bin directory under the extracted package and fire this command from terminal.

$./run-example SparkPi 10

You may get following output:

Pi is roughly 3.14634

Bingo!!! Next step to get started with spark is here: http://spark.apache.org/docs/1.1.1/quick-start.html

Queries, doubts, suggestions?? Comments are free ...:D

Popular Posts