How to set up spark stand-alone cluster on Linux

/
0 Comments




I have been exposed to Spark lately resulting into this second post related to it.

This post assumes that you know fundamental of Spark. If not then may be you should go here first.

There are several alternatives for setting up Spark cluster. Out of which the most basic one is stand-alone mode for which you won't require any external cluster management tools. Generally stand-alone mode is sufficient for small cluster of size up-to 10 nodes.

Now before you get bored let's start with cluster set-up:

For this example we will assume that we have three nodes with host-names: Node1, Node2 and Node3.



1) Download and install spark from here on all nodes. For installation process you can refer this post. You can skip this step if you have already installed spark on your nodes.
Note: Your spark home directory path must be same on all the nodes. If it's not then you can create symbolic link and match the directory on all nodes.

2) If you have not configures you host-names then add respective host names on /etc/hostname file.
For example in Node1 the above file should have following content:

Node1

Restart you nodes after this change.

3) Configuring hosts file of each node:
Append following lines to /etc/hosts file

<IP of Node1> Node1
<IP of Node2> Node2
<IP of Node3> Node3

Replace <IP of NodeX> with IP address of respective node.

Ping nodes with each other using the host name (i.e. Node1,Node2 etc) to make sure that above configuration is OK.

4) Setting up password-less ssh connection between master and slave nodes:

Suppose I want to make Node1 as master node then we need to setup password-less ssh connection between Node1 and other slave nodes.

Which you can do using following commands:

Generating ssh key pair:

Fire this command on all three node

ssh-keygen

Copy public key to remote-host:

On Node1 which is master node:
ssh-copy-id -i ~/.ssh/id_rsa.pub Node1

ssh-copy-id -i ~/.ssh/id_rsa.pub Node2


ssh-copy-id -i ~/.ssh/id_rsa.pub Node3


Fire following command from all nodes including Node1:

ssh-copy-id -i ~/.ssh/id_rsa.pub Node1

5) Configuring spark config files:

a) spark-env.sh

Go to you <spark-home>/conf directory.

You will find spark-env.sh.template file. Rename it to spark-env.sh and add following line to this file.

SPARK_MASTER_IP=<IP of Node1>

Replace <IP of Node1> with IP address of Node1(master).

Do this for all nodes.

b) slaves

Under  <spark-home>/conf directory rename slaves.template to slaves.

Add following lines to slaves file:

Node1
Node2
Node3

Again do this for all other nodes.

6) Booting up the cluster:

Go to <spark-home>/sbin directory and file following command:

./start-all.sh

On successful completion of above command your cluster should be on and you can monitor it from following URL:

http://<IP of master node>:8080 


Feel free to throw your doubts, troubles , suggestions in comment section.



You may also like

Popular Posts