In addition to running on the Mesos or YARN cluster managers, Apache Spark also provides a simple standalone deploy mode, that can be launched on a single machine as well. To install Spark Standalone mode, we simply need a compiled version of Spark which matches the hadoop version we are using. If you haven't installed any hadoop in the machine or if the spark will not be using the external hdfs, then you can choose any version. You can download spark as your preference from here.
Once you have downloaded and unpack Spark, there are few simple configurations you have to make in the following files.
- conf/slaves
- conf/spark-env.sh
- SPARK_HOME=/home/supun/Supun/Softwares/Spark/spark-1.2.1-bin-hadoop2.4
- SPARK_WORKER_MEMORY=2g
- SPARK_WORKER_INSTANCES=2
- SPARK_WORKER_DIR=/home/supun/Supun/Softwares/Spark/wroker
- SPARK_MASTER_WEBUI_PORT=5001
- SPARK_WORK_WEBUI_PORT=5002
Here, SPARK_WORKER_MEMORY is the amount of memory to allocate for worker nodes. SPARK_WORKER_INSTANCES is the number of worker nodes needed. Here I have created only two worker nodes. SPARK_WORKER_DIR is the place to store spark job related files such as logs and etc. for worker nodes.
SPARK_MASTER_WEBUI_PORT is the URL of the we-based dashboard of the master (If this is not set, Spark will set it to 8080. If the port 8080 is in use by some other application, then it will increment the port by one, and will set to 8081). SPARK_WORK_WEBUI_PORT is the starting port for URLs of the web-based dashboard of worker nodes. If there are more than one worker node, prot numbers will be set automatically starting from the value set here. (e.g: In our scenario, port numbers will be 5002 and 5003, since there are two worker nodes).
Now the configurations are all done.To start the Spark cluster Master, navigate to <SPARK_HOME>/sbin and execute he following.
./start-master.sh
If all the configurations have been done correctly, once the master is started, you should be able to access the master 's web UI at : http://localhost:5001
When starting worker nodes, master tries to access the workers through ssh. Therefore first check whether ssh is installed in the machine by executing "which ssh" and "which sshd". If ssh/sshd has not being installed. install them using:
sudo apt-get install ssh
Now if the software is installed, try the following command (ubuntu) to check whether ssh can access the localhost without a password.
ssh localhost
If this asks for a password (local machine's user's password), then execute the following.
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
./start-slaves.sh
Once the worker nodes are up, you would be able to see the worker nodes have been listed in the Master webUI and also. You could also be able to access the webUIs of each of the worker nodes as well. (http://localhost:5002 and http://localhost:5003 in this scenario).
Finally, to run any applications this spark cluster, you can use the URL displayed at the top-left corner of the Master WebUI.
Read more :
Wrote by Supun Setunga