Setting up a Spark Standalone Cluster in Local Machine
In addition to running on the Mesos or YARN cluster managers, Apache Spark also provides a simple standalone deploy mode, that can be launched on a single machine as well. To install Spark Standalone mode, we simply need a compiled version of Spark which matches the hadoop version we are using. If you haven't installed any hadoop in the machine or if the spark will not be using the external hdfs, then you can choose any version. You can download spark as your preference from here.
Once you have downloaded and unpack Spark, there are few simple configurations you have to make in the following files.
- conf/slaves
- conf/spark-env.sh
- SPARK_HOME=/home/supun/Supun/Softwares/Spark/spark-1.2.1-bin-hadoop2.4
- SPARK_WORKER_MEMORY=2g
- SPARK_WORKER_INSTANCES=2
- SPARK_WORKER_DIR=/home/supun/Supun/Softwares/Spark/wroker
- SPARK_MASTER_WEBUI_PORT=5001
- SPARK_WORK_WEBUI_PORT=5002
Here, SPARK_WORKER_MEMORY is the amount of memory to allocate for worker nodes. SPARK_WORKER_INSTANCES is the number of worker nodes needed. Here I have created only two worker nodes. SPARK_WORKER_DIR is the place to store spark job related files such as logs and etc. for worker nodes.
SPARK_MASTER_WEBUI_PORT is the URL of the we-based dashboard of the master (If this is not set, Spark will set it to 8080. If the port 8080 is in use by some other application, then it will increment the port by one, and will set to 8081). SPARK_WORK_WEBUI_PORT is the starting port for URLs of the web-based dashboard of worker nodes. If there are more than one worker node, prot numbers will be set automatically starting from the value set here. (e.g: In our scenario, port numbers will be 5002 and 5003, since there are two worker nodes).
Now the configurations are all done.To start the Spark cluster Master, navigate to <SPARK_HOME>/sbin and execute he following.
./start-master.sh
If all the configurations have been done correctly, once the master is started, you should be able to access the master 's web UI at : http://localhost:5001
When starting worker nodes, master tries to access the workers through ssh. Therefore first check whether ssh is installed in the machine by executing "which ssh" and "which sshd". If ssh/sshd has not being installed. install them using:
sudo apt-get install ssh
Now if the software is installed, try the following command (ubuntu) to check whether ssh can access the localhost without a password.
ssh localhost
If this asks for a password (local machine's user's password), then execute the following.
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
./start-slaves.sh
Once the worker nodes are up, you would be able to see the worker nodes have been listed in the Master webUI and also. You could also be able to access the webUIs of each of the worker nodes as well. (http://localhost:5002 and http://localhost:5003 in this scenario).
Finally, to run any applications this spark cluster, you can use the URL displayed at the top-left corner of the Master WebUI.
Read more :
Tags:
spark
3 comments
Hi Admin,
ReplyDeleteThis information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic.s
Regards,
sas training chennai|sas institutes in Chennai|sas training institutes in Chennai
Hello,
ReplyDeleteI really enjoyed while reading your article, the information you have mentioned in this post was damn good. Keep sharing your blog with updated and useful information.
Regards,
Informatica training in chennai|Best Informatica Training In Chennai|Informatica training center in Chennai
Excellent article! Aasapolska.pl
ReplyDelete