Connect iPython/Jupyter Notebook to pyspak

Prerequisites


  • Install jupyter
  • Download and uncompress spark 1.6.2 binary.
  • Dowload pyrolite-4.13.jar


Set Environment Variables

open ~/.bashrc and add the following entries: 
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark
export PYSPARK_PYTHON=/home/supun/Supun/Softwares/anaconda3/bin/python
export SPARK_HOME="/home/supun/Supun/Softwares/spark-1.6.2-bin-hadoop2.6"
export PATH="/home/supun/Supun/Softwares/spark-1.6.2-bin-hadoop2.6/bin:$PATH"
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib:$PYTHONPATH
export SPARK_CLASSPATH=/home/supun/Downloads/pyrolite-4.13.jar

If you are using some external third-party libraries such as spark-csv, then add that jar's absolute path to Spark Class path, seperated by colons (:) as below.
export SPARK_CLASSPATH=<path/to/third/party/jar1>:<path/to/third/party/jar2>:..:<path/to/third/party/jarN>

To make the changes take effect, run:
source ~/.bashrc 


Get Started


Create a new directory, to be used as the python workspace (say "python-workspace"). This directory will be used to store the scripts we create in the notebook. Navigate to that created directory, and run the following to start the notebook.

pyspark

Here spark will start in local mode. You can check the Spark UI at http://localhost:5001

If you need to connect to a remote spark cluster, then specify the master URL of the remote spark cluster as below, when starting the notebook.
pyspark --master spark://10.100.5.116:7077"

Finally navigate to http://localhost:8888/ to access the notebook.


Use Spark within python

To do spark operations with python, we are going to need the Spark Context and SQLContext. When we start jupyter with pyspark, it will create a spark context by default. This can be accessed using the object 'sc'.
We can also create our own spark context, with any additional configurations as well. But to create a new one, we need to stop the existing spark context first.
from pyspark import SparkContext, SparkConf, SQLContext

# Set the additional propeties.
sparkConf = (SparkConf().set(key="spark.driver.allowMultipleContexts",value="true"))

# Stop the default SparkContext created by pyspark. And create a new SparkContext using the above SparkConf.
sc.stop()
sparkCtx = SparkContext(conf=sparkConf)

# Check spark master.
print(sparkConf.get("spark.master"))

# Create a SQL context.
sqlCtx = SQLContext(sparkCtx)

df = sqlCtx.sql("SELECT * FROM table1")
df.show()

'df' is a spark dataframe. Now you can do any spark operation on top of that dataframe. You can also use spark-mllib and spark-ml packages and build machine learning models as well.

Share:

0 comments