Archive for March 2015

In addition to running on the Mesos or YARN cluster managers, Apache Spark also provides a simple standalone deploy mode, that can be launched on a single machine as well. To install Spark Standalone mode, we simply need a compiled version of Spark which matches the hadoop version we are using. If you haven't installed any hadoop in the machine or if the spark will not be using the external hdfs, then you can choose any version. You can download spark as your preference from here.

Once you have downloaded and unpack Spark, there are few simple configurations you have to make in the following files.

conf/slaves

Open <SPARK_HOME>/conf/slaves file in a text editor and add "localhost" to a newline.

conf/spark-env.sh

Create a new file <SPARK_HOME>/conf/spark-env.sh and add following. Spark by-default comes with a template file for spark-env.sh which can be found in the conf directory with name "spark-env.sh.template". You can use that template to configure, by renaming it to "spark-env.sh" and modifying it.

SPARK_HOME=/home/supun/Supun/Softwares/Spark/spark-1.2.1-bin-hadoop2.4
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_INSTANCES=2
SPARK_WORKER_DIR=/home/supun/Supun/Softwares/Spark/wroker
SPARK_MASTER_WEBUI_PORT=5001
SPARK_WORK_WEBUI_PORT=5002

Here, SPARK_WORKER_MEMORY is the amount of memory to allocate for worker nodes. SPARK_WORKER_INSTANCES is the number of worker nodes needed. Here I have created only two worker nodes. SPARK_WORKER_DIR is the place to store spark job related files such as logs and etc. for worker nodes.

SPARK_MASTER_WEBUI_PORT is the URL of the we-based dashboard of the master (If this is not set, Spark will set it to 8080. If the port 8080 is in use by some other application, then it will increment the port by one, and will set to 8081). SPARK_WORK_WEBUI_PORT is the starting port for URLs of the web-based dashboard of worker nodes. If there are more than one worker node, prot numbers will be set automatically starting from the value set here. (e.g: In our scenario, port numbers will be 5002 and 5003, since there are two worker nodes).

Now the configurations are all done.To start the Spark cluster Master, navigate to <SPARK_HOME>/sbin and execute he following.
./start-master.sh

If all the configurations have been done correctly, once the master is started, you should be able to access the master 's web UI at : http://localhost:5001

When starting worker nodes, master tries to access the workers through ssh. Therefore first check whether ssh is installed in the machine by executing "which ssh" and "which sshd". If ssh/sshd has not being installed. install them using:
sudo apt-get install ssh

Now if the software is installed, try the following command (ubuntu) to check whether ssh can access the localhost without a password.
ssh localhost

If this asks for a password (local machine's user's password), then execute the following.
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Then to start the worker nodes execute the following from the same directory (<SPARK_HOME>/sbin)
./start-slaves.sh

Once the worker nodes are up, you would be able to see the worker nodes have been listed in the Master webUI and also. You could also be able to access the webUIs of each of the worker nodes as well. (http://localhost:5002 and http://localhost:5003 in this scenario).

Finally, to run any applications this spark cluster, you can use the URL displayed at the top-left corner of the Master WebUI.

Wrote by Supun Setunga

About Me

My Tech World

Blog Archive

Popular Posts

Like us on Facebook

Search This Blog

Labels

Report Abuse

Pages

FOLLOW US @ INSTAGRAM

Featured

JSON Manipulation with Ballerina

Looped Slider

FOLLOW US @ INSTAGRAM