Setting up a Spark Standalone Cluster in Local Machine

In addition to running on the Mesos or YARN cluster managers, Apache Spark also provides a simple standalone deploy mode, that can be launched on a single machine as well. To install Spark Standalone mode, we simply need a compiled version of Spark which matches the hadoop version we are using. If you haven't installed any hadoop in the machine or if the spark will not be using the external hdfs, then you can choose any version. You can download spark as your preference from here.

Once you have downloaded and unpack Spark, there are few simple configurations you have to make in the following files.

  • conf/slaves
Open <SPARK_HOME>/conf/slaves file in a text editor and add "localhost" to a newline.


  • conf/spark-env.sh
Create a new file <SPARK_HOME>/conf/spark-env.sh and add following. Spark by-default comes with a template file for spark-env.sh which can be found in the conf directory with name "spark-env.sh.template". You can use that template to configure, by renaming it to "spark-env.sh" and modifying it.
    • SPARK_HOME=/home/supun/Supun/Softwares/Spark/spark-1.2.1-bin-hadoop2.4
    • SPARK_WORKER_MEMORY=2g
    • SPARK_WORKER_INSTANCES=2
    • SPARK_WORKER_DIR=/home/supun/Supun/Softwares/Spark/wroker
    • SPARK_MASTER_WEBUI_PORT=5001
    • SPARK_WORK_WEBUI_PORT=5002

Here, SPARK_WORKER_MEMORY is the amount of memory to allocate for worker nodes. SPARK_WORKER_INSTANCES is the number of worker nodes needed. Here I have created only two worker nodes. SPARK_WORKER_DIR is the place to store spark job related files such as logs and etc. for worker nodes. 

SPARK_MASTER_WEBUI_PORT is the URL of the we-based dashboard of the master (If this is not set, Spark will set it to 8080. If the port 8080 is in use by some other application, then it will increment the port by one, and will set to 8081). SPARK_WORK_WEBUI_PORT is the starting port for URLs of the web-based dashboard of worker nodes. If there are more than one worker node, prot numbers will be set automatically starting from the value set here. (e.g: In our scenario, port numbers will be 5002 and 5003, since there are two worker nodes).

Now the configurations are all done.To start the Spark cluster Master, navigate to <SPARK_HOME>/sbin and execute he following.
     ./start-master.sh

If all the configurations have been done correctly, once the master is started, you should be able to access the master 's web UI at : http://localhost:5001

When starting worker nodes, master tries to access the workers through ssh. Therefore first check whether ssh is installed in the machine by executing "which ssh" and "which sshd". If ssh/sshd has not being installed. install them using:
      sudo apt-get install ssh

Now if the software is installed, try the following command (ubuntu) to check whether ssh can access the localhost without a password.
      ssh localhost

If this asks for a password (local machine's user's password), then execute the following.
      ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
      cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Then to start the worker nodes execute the following from the same directory (<SPARK_HOME>/sbin)
      ./start-slaves.sh


Once the worker nodes are up, you would be able to see the worker nodes have been listed in the Master webUI and also. You could also be able to access the webUIs of each of the worker nodes as well. (http://localhost:5002 and http://localhost:5003 in this scenario).

Finally, to run any applications this spark cluster, you can use the URL displayed at the top-left corner of the Master WebUI.





Read more :

Tags:

Share:

5 comments

  1. There are lots of information about latest technology and how to get trained in them, like Big Data Course in Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Big Data Training Chennai). By the way you are running a great blog. Thanks for sharing this.

    Big Data Training in Chennai | Big Data Training

    ReplyDelete
  2. A table is the basic unit of data storage in an oracle database. The table of a database hold all of the user accesible data. Table data is stored in rows and columns. But what is all about the clusters and how to handle it using oracle database system? Expecting a right answer from you. By the way you are maintaining a great blog. Thanks for sharing this in here.
    Oracle Training in Chennai | Oracle Course in Chennai | Oracle Training Center in Chennai

    ReplyDelete
  3. Hi Admin,
    This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic.s
    Regards,
    sas training chennai|sas institutes in Chennai|sas training institutes in Chennai

    ReplyDelete
  4. Hello,
    I really enjoyed while reading your article, the information you have mentioned in this post was damn good. Keep sharing your blog with updated and useful information.
    Regards,
    Informatica training in chennai|Best Informatica Training In Chennai|Informatica training center in Chennai

    ReplyDelete
  5. Thank you for this wonderful information. It was really helpful.oracle dba training In Chennai

    ReplyDelete