Connect iPython/Jupyter Notebook to pyspak

Wrote by Supun Setunga September 15, 2016 1 Comment

Prerequisites

Install jupyter
Download and uncompress spark 1.6.2 binary.
Dowload pyrolite-4.13.jar

Set Environment Variables

open ~/.bashrc and add the following entries:

export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark
export PYSPARK_PYTHON=/home/supun/Supun/Softwares/anaconda3/bin/python
export SPARK_HOME="/home/supun/Supun/Softwares/spark-1.6.2-bin-hadoop2.6"
export PATH="/home/supun/Supun/Softwares/spark-1.6.2-bin-hadoop2.6/bin:$PATH"
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib:$PYTHONPATH
export SPARK_CLASSPATH=/home/supun/Downloads/pyrolite-4.13.jar

If you are using some external third-party libraries such as spark-csv, then add that jar's absolute path to Spark Class path, seperated by colons (:) as below.

export SPARK_CLASSPATH=<path/to/third/party/jar1>:<path/to/third/party/jar2>:..:<path/to/third/party/jarN>

To make the changes take effect, run:

source ~/.bashrc

Get Started

Create a new directory, to be used as the python workspace (say "python-workspace"). This directory will be used to store the scripts we create in the notebook. Navigate to that created directory, and run the following to start the notebook.

pyspark

Here spark will start in local mode. You can check the Spark UI at http://localhost:5001

If you need to connect to a remote spark cluster, then specify the master URL of the remote spark cluster as below, when starting the notebook.

pyspark --master spark://10.100.5.116:7077"

Finally navigate to http://localhost:8888/ to access the notebook.

Use Spark within python

To do spark operations with python, we are going to need the Spark Context and SQLContext. When we start jupyter with pyspark, it will create a spark context by default. This can be accessed using the object 'sc'.
We can also create our own spark context, with any additional configurations as well. But to create a new one, we need to stop the existing spark context first.

from pyspark import SparkContext, SparkConf, SQLContext

# Set the additional propeties.
sparkConf = (SparkConf().set(key="spark.driver.allowMultipleContexts",value="true"))

# Stop the default SparkContext created by pyspark. And create a new SparkContext using the above SparkConf.
sc.stop()
sparkCtx = SparkContext(conf=sparkConf)

# Check spark master.
print(sparkConf.get("spark.master"))

# Create a SQL context.
sqlCtx = SQLContext(sparkCtx)

df = sqlCtx.sql("SELECT * FROM table1")
df.show()

'df' is a spark dataframe. Now you can do any spark operation on top of that dataframe. You can also use spark-mllib and spark-ml packages and build machine learning models as well.

Tags: pyspark python wso2das

1 comments

benJanuary 30, 2019 at 2:13 AM
It is cool that you describe. I've been looking for the answer for a long time.
Dentist near me
ReplyDelete
Replies

Add comment

Connect iPython/Jupyter Notebook to pyspak

Prerequisites

Set Environment Variables

Get Started

Use Spark within python

Share:

1 comments

About Me

My Tech World

Blog Archive

Popular Posts

Like us on Facebook

Search This Blog

Labels

Report Abuse

Most Popular

Pages

FOLLOW US @ INSTAGRAM

Featured

JSON Manipulation with Ballerina

Looped Slider

FOLLOW US @ INSTAGRAM

Connect iPython/Jupyter Notebook to pyspak

Prerequisites

Set Environment Variables

Get Started

Use Spark within python

Share:

Related Articles

1 comments

About Me

My Tech World

Blog Archive

Popular Posts

Like us on Facebook

Search This Blog

Labels

Report Abuse

Most Popular

Pages

FOLLOW US @ INSTAGRAM

Featured

JSON Manipulation with Ballerina

Looped Slider

FOLLOW US @ INSTAGRAM