Archive for September 2016

In the previous post we discussed on how to connect jupyter notebook to pyspark. Further going forward, in this post I will discuss on how you can run python scripts, and analyze and build Machine Learning models on top of data stored in WSO2 Data Analytics servers. You may use a vanilla Data Analytics Server (DAS) or, any other analytics server such as ESB Analytics, APM Analytics or IS Analytics Servers for this purpose.

Prerequisites

Install jupyter
Download WSO2 Data Analytcs Server (DAS) 3.1.0
Download and uncompress spark 1.6.2 binary.
Dowload pyrolite-4.13.jar

Configure the Analytics Server

In this scenario, Analytics Server will act as the external spark cluster, as well as the data source. Hence it is required to start the Analytics server in cluster mode. For that open <DAS_HOME>/repository/conf/axis2/axis2.xml and enable clustering as follows:

<clustering class="org.wso2.carbon.core.clustering.hazelcast.HazelcastClusteringAgent"
                enable="true">

When the Analytics server starts in cluster, it creates a spark cluster as well. (Or if its pointed to a external spark cluster, it will join that external cluster). Analytics server also creates a Spark App, and will be accumilating all the existing cores in the cluster. But, when connect python/pyspark for the same cluster, it will also creates an Spark App, but since no cores are available to run, it will be in "waiting" state, and will not run. Therefore to avoid that, we need to limit the amount of resource that gets allocated to the CarbonAnalytics Spark app. In order to do that, open <DAS_HOME>/repository/conf/analytics/spark/spark-defaults.conf file, and set/change the following parameters.

carbon.spark.master.count  1

# Worker
spark.worker.cores 4
spark.worker.memory 4g

spark.executor.memory 2g

spark.cores.max 2

Note that, here "spark.worker.cores" (4) is the number of total cores we allocate for spark. And "spark.cores.max" (2) is the number of maximum cores allocate for each spark application.

Since we are not using a minimum HA cluster in DAS/Analytics Server, we need to set the following property in <DAS_HOME>/repository/conf/etc/tasks-config.xml file.

<taskServerCount>1</taskServerCount>

Now start the server by navigating to <DAS_HOME> and executing:

./bin/wso2server.sh

Once the server is up, to check whether the spark cluster is correctly configured, navigate to the spark master web UI on: http://localhost:8081/. It should show something similar to below.

Note that the number of cores allocated for the worker is 4 (2 Used). And the number of cores allocated for the CarbonAnalytics Application is 2. Also here you can see the spark master URL, on the top-left corner (spark://10.100.5.116:7077). This URL is used by pyspark and other clients to connect/submit jobs to this spark cluster.

Now to run a python script on top of this Analytics Server, we have two options:

Connect ipython/jupyter notebook, execute the python script from the UI.
Execute the raw python script using spark-submit.sh

Connect Jupyter Notebook (Option I)

open ~/.bashrc and add the following entries:

export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark
export PYSPARK_PYTHON=/home/supun/Supun/Softwares/anaconda3/bin/python
export SPARK_HOME="/home/supun/Supun/Softwares/spark-1.6.2-bin-hadoop2.6"
export PATH="/home/supun/Supun/Softwares/spark-1.6.2-bin-hadoop2.6/bin:$PATH"
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib:$PYTHONPATH
export SPARK_CLASSPATH=/home/supun/Downloads/pyrolite-4.13.jar

When we are running a python script on top of spark, pyspark will submit it as a job to the spark cluster (Analytics server, in this case). Therefore we need to add the all the external jars to the sparks class path, so that the spark executor knows where to look for the classes during runtime. Thus, we need to add absolute path of jars located in <DAS_HOME>/repository/libs directory as well as <DAS_HOME>/repository/components/plugins directory, to the Spark Class path, seperated by colons (:) as below.

export SPARK_CLASSPATH=/home/supun/Downloads/wso2das-3.1.0/repository/components/plugins/abdera_1.0.0.wso2v3.jar:/home/supun/Downloads/wso2das-3.1.0/repository/components/plugins/ajaxtags_1.3.0.beta-rc7-wso2v1.jar.......

To make the changes take effect, run:

source ~/.bashrc

Create a new directory, to be used as the python workspace (say "python-workspace"). This directory will be used to store the scripts we create in the notebook. Navigate to that created directory, and start the notebook, specifying the master URL of the remote spark cluster as below, when starting the notebook.

pyspark --master spark://10.100.5.116:7077 --conf "spark.driver.extraJavaOptions=-Dwso2_custom_conf_dir=/home/supun/Downloads/wso2das-3.1.0/repository/conf"

Finally navigate to http://localhost:8888/ to access the notebook, and create a new python script by New --> Python 3.
Then check the spark master UI (http://localhost:8081) again. You should see a second application named "PySparkShell" has been started too, and is using the remaining 2 cores. (see below)

Retrieve Data

In the new python script we created in the jupyter, we can use any spark-python API. To do spark operations with python, we are going to need the Spark Context and SQLContext. When we start jupyter with pyspark, it will create a spark context by default. This can be accessed using the object 'sc'.
We can also create our own spark context, with any additional configurations as well. But to create a new one, we need to stop the existing spark context first.

from pyspark import SparkContext, SparkConf, SQLContext

# Set the additional propeties.
sparkConf = (SparkConf().set(key="spark.driver.allowMultipleContexts",value="true").set(key="spark.executor.extraJavaOptions", value="-Dwso2_custom_conf_dir=/home/supun/Downloads/wso2das-3.1.0/repository/conf"))

# Stop the default SparkContext created by pyspark. And create a new SparkContext using the above SparkConf.
sc.stop()
sparkCtx = SparkContext(conf=sparkConf)

# Check spark master.
print(sparkConf.get("spark.master"))

# Create a SQL context.
sqlCtx = SQLContext(sparkCtx)

df = sqlCtx.sql("SELECT * FROM table1")
df.show()

'df' is a spark dataframe. Now you can do any spark operation on top of that dataframe. You can also use spark-mllib and spark-ml packages and build machine learning models as well. You can refer [1] for such a sample on training a Random Forest Classification model, on top of data stored in WSO2 DAS.

Running Python script without jupyter Notebook (Option II)

Other than running python scripts with notebook, we can also run the raw python script directly on top of spark, using pyspark. For that we can use the same python script as above, with a slight modification. In the above scenario, there is a default spark context ("sc") created by notebook. But in this case there wont be any such a default spark context. hence we do not need the sc.stop() snippet. (or else it will give errors.). Once we remove that line of code, we can save the script with .py extension. Then run the saved script as below:

<SPARK_HOME>./bin/spark-submit --master spark://10.100.5.116:7077 --conf "spark.driver.extraJavaOptions=-Dwso2_custom_conf_dir=/home/supun/Downloads/wso2das-3.1.0/repository/conf"  PySpark-Sample.py

You can refer [2] for a python script which does the same as the one we discussed earlier.

References

[1] https://github.com/SupunS/play-ground/blob/master/python/pyspark/PySpark-Sample.ipynb
[2] https://github.com/SupunS/play-ground/blob/master/python/pyspark/PySpark-Sample.py

Wrote by Supun Setunga

Prerequisites

Install jupyter
Download and uncompress spark 1.6.2 binary.
Dowload pyrolite-4.13.jar

Set Environment Variables

open ~/.bashrc and add the following entries:

export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark
export PYSPARK_PYTHON=/home/supun/Supun/Softwares/anaconda3/bin/python
export SPARK_HOME="/home/supun/Supun/Softwares/spark-1.6.2-bin-hadoop2.6"
export PATH="/home/supun/Supun/Softwares/spark-1.6.2-bin-hadoop2.6/bin:$PATH"
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib:$PYTHONPATH
export SPARK_CLASSPATH=/home/supun/Downloads/pyrolite-4.13.jar

If you are using some external third-party libraries such as spark-csv, then add that jar's absolute path to Spark Class path, seperated by colons (:) as below.

export SPARK_CLASSPATH=<path/to/third/party/jar1>:<path/to/third/party/jar2>:..:<path/to/third/party/jarN>

To make the changes take effect, run:

source ~/.bashrc

Get Started

pyspark

Here spark will start in local mode. You can check the Spark UI at http://localhost:5001

If you need to connect to a remote spark cluster, then specify the master URL of the remote spark cluster as below, when starting the notebook.

pyspark --master spark://10.100.5.116:7077"

Finally navigate to http://localhost:8888/ to access the notebook.

Use Spark within python

To do spark operations with python, we are going to need the Spark Context and SQLContext. When we start jupyter with pyspark, it will create a spark context by default. This can be accessed using the object 'sc'.
We can also create our own spark context, with any additional configurations as well. But to create a new one, we need to stop the existing spark context first.

from pyspark import SparkContext, SparkConf, SQLContext

# Set the additional propeties.
sparkConf = (SparkConf().set(key="spark.driver.allowMultipleContexts",value="true"))

# Stop the default SparkContext created by pyspark. And create a new SparkContext using the above SparkConf.
sc.stop()
sparkCtx = SparkContext(conf=sparkConf)

# Check spark master.
print(sparkConf.get("spark.master"))

# Create a SQL context.
sqlCtx = SQLContext(sparkCtx)

df = sqlCtx.sql("SELECT * FROM table1")
df.show()

'df' is a spark dataframe. Now you can do any spark operation on top of that dataframe. You can also use spark-mllib and spark-ml packages and build machine learning models as well.

Wrote by Supun Setunga

Prerequisites:

Install python
Install ipython notebook

Create a directory as a workspace for the notebook, and navigate to it. Start python jupyter by running:

jupyter notebook

Create a new python notebook. To use Pandas Dataframe this notebook scipt, we first need to import the pandas library as follows.

import numpy as np
import pandas as pd

Importing a Dataset

To import a csv file from local file system:

filePath = "/home/supun/Supun/MachineLearning/data/Iris/train.csv"
irisData = pd.read_csv(filePath)
print(irisData)

Output will be as follows:

     sepal_length  sepal_width  petal_length  petal_width
0             NaN          3.5           1.4          0.2
1             NaN          3.0           1.4          0.2
2             NaN          3.2           1.3          0.2
3             NaN          3.1           1.5          0.2
4             NaN          3.6           1.4          0.2
5             NaN          3.9           1.7          0.4
6             NaN          3.4           1.4          0.3
7             NaN          3.4           1.5          0.2
8             NaN          2.9           1.4          0.2
9             NaN          3.1           1.5          0.1
10            NaN          3.7           1.5          0.2
11            NaN          3.4           1.6          0.2
12            NaN          3.0           1.4          0.1

Basic Retrieve Operations

Get a single column of the dataset. Say we want to get all the values of the column "sepal_length":

print(irisData["sepal_length"])

Get a multiple column of the dataset. Say we want to get all the values of the column "sepal_length" and "petal_length":

print(irisData[["sepal_length", "petal_length"]])
#Note there are two square brackets.

Get a subset of rows of the dataset. Say we want to get the first 10 rows of the dataset:

print(irisData[0:10])

Get a subset of rows a column of the dataset. Say we want to get the first 10 rows of the column "sepal_length" of the dataset:

print(irisData["sepal_length"][0:10])

Basic Math Operations

Add a constant to each value of a column in the dataset:

print(irisData["sepal_length"] + 5)

Add two (or more) columns in the dataset:

print(irisData["petal_width"] + irisData["petal_length"])

Here values will be added row-wise. i.e: value in the n-th row of petal_width column, is added to the value in the n-th row of petal_length column.

Similarly we can do the same for other math operations such as Substraction (-), Multiplication (*) and Division (/) as well.

Wrote by Supun Setunga

This post will discuss on how to setup a fully distributed hbase cluster. Here we will not run zookeeper as a separate server, but will be using the zookeeper which is embedded in hbase itself. And our setup will consist of 1 master node, and 2 slave nodes.

Prerequisites

Update /etc/hostname file to include hadoop.master, hadoop.slave1, hadoop.slave2 respectively, as the hostnames of the machines.
Download hbase 1.2.1
Setup a fully distributed Hadoop cluster [1].
Start the hadoop server.

Configure HBase

Fisrt create a directory for hbase in the hadoop file system. For that navigate to <hadoop_home>/bin and execute:

hadoop fs -mkdir /hbase

Do the following confgiurations in <hbase_home>/conf/hbase-site.xml. Note that, here the host and port of "hbase.rootdir" should be the same host and port as hadoop's fs.default.name, we gave at the prerequisites step.

<property>
    <name>hbase.rootdir</name>
    <value>hdfs://hadoop.master:9000/hbase</value>
</property>
<!-- Note that above should be the same host and port as hadoop's fs.default.name-->

<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
</property>
<property>
    <name>hbase.zookeeper.property.clientPort</name>
    <value>2222</value>
    <description>Property from ZooKeeper's config zoo.cfg.The port at which the clients will connect.</description>
</property>
<property>
    <name>hbase.zookeeper.quorum</name>
    <value>hadoop.slave1,hadoop.slave2</value>
</property>
<property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/home/wso2/Desktop/hbase/localDirs/zookeeper-dataDir</value>
</property>

Here, hbase.rootdir should be on namenode. In our case, master is the namenode. Zookeeper quorums should be the slave nodes. This tells which nodes should run the zookeerper. It is preferred to have odd number of nodes for zookeeper.

Add the hostname of slave nodes, to <hbase_home>/conf/regionservers, in all nodes, except master/namenode

hadoop.slave1
hadoop.slave2

Finally, set the following jvm properties in <hbase_home>/conf/hbase-env.sh file. The property HBASE_MANAGES_ZK is to indicate that hbase is managing the zookeeper, and no external zookeeper server is running.

# To use built in zookeeper
export HBASE_MANAGES_ZK=true

# set java class path
export JAVA_HOME=your_java_home

# Add hadoop-conf directory to hbase class path:
export HBASE_CLASSPATH=$HBASE_CLASSPATH:<hadoop_home>/etc/hadoop

Now all the configurations are complete. Now we can start the server by navigating to <hbase_home>/bin directory and executing the following:

./start-hbase.sh

Once the hbase server is up, you can navigate to its master web UI from http://hadoop.master:16010/

References

[1] http://supunsetunga.blogspot.com/2016/08/setting-up-fully-distributed-hadoop.html

Wrote by Supun Setunga

Prerequisites

Configure the Analytics Server

Connect Jupyter Notebook (Option I)

Retrieve Data

Running Python script without jupyter Notebook (Option II)

References

Prerequisites

Set Environment Variables

Get Started

Use Spark within python

Prerequisites:

Importing a Dataset

Basic Retrieve Operations

Basic Math Operations

Prerequisites

Configure HBase

References

About Me

My Tech World

Blog Archive

Popular Posts

Like us on Facebook

Search This Blog

Labels

Report Abuse

Pages

FOLLOW US @ INSTAGRAM

Featured

JSON Manipulation with Ballerina

Looped Slider

FOLLOW US @ INSTAGRAM