Connect iPython/Jupyter Notebook to pyspak
Prerequisites
- Install jupyter
- Download and uncompress spark 1.6.2 binary.
- Dowload pyrolite-4.13.jar
Set Environment Variables
open ~/.bashrc and add the following entries:
export PYSPARK_DRIVER_PYTHON=ipython export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark export PYSPARK_PYTHON=/home/supun/Supun/Softwares/anaconda3/bin/python export SPARK_HOME="/home/supun/Supun/Softwares/spark-1.6.2-bin-hadoop2.6" export PATH="/home/supun/Supun/Softwares/spark-1.6.2-bin-hadoop2.6/bin:$PATH" export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib:$PYTHONPATH export SPARK_CLASSPATH=/home/supun/Downloads/pyrolite-4.13.jar
If you are using some external third-party libraries such as spark-csv, then add that jar's absolute path to Spark Class path, seperated by colons (:) as below.
export SPARK_CLASSPATH=<path/to/third/party/jar1>:<path/to/third/party/jar2>:..:<path/to/third/party/jarN>
To make the changes take effect, run:
source ~/.bashrc
Get Started
Create a new directory, to be used as the python workspace (say "python-workspace"). This directory will be used to store the scripts we create in the notebook. Navigate to that created directory, and run the following to start the notebook.
pyspark
Here spark will start in local mode. You can check the Spark UI at http://localhost:5001
If you need to connect to a remote spark cluster, then specify the master URL of the remote spark cluster as below, when starting the notebook.
pyspark --master spark://10.100.5.116:7077"
Finally navigate to http://localhost:8888/ to access the notebook.
Use Spark within python
To do spark operations with python, we are going to need the Spark Context and SQLContext. When we start jupyter with pyspark, it will create a spark context by default. This can be accessed using the object 'sc'.We can also create our own spark context, with any additional configurations as well. But to create a new one, we need to stop the existing spark context first.
from pyspark import SparkContext, SparkConf, SQLContext # Set the additional propeties. sparkConf = (SparkConf().set(key="spark.driver.allowMultipleContexts",value="true")) # Stop the default SparkContext created by pyspark. And create a new SparkContext using the above SparkConf. sc.stop() sparkCtx = SparkContext(conf=sparkConf) # Check spark master. print(sparkConf.get("spark.master")) # Create a SQL context. sqlCtx = SQLContext(sparkCtx) df = sqlCtx.sql("SELECT * FROM table1") df.show()
'df' is a spark dataframe. Now you can do any spark operation on top of that dataframe. You can also use spark-mllib and spark-ml packages and build machine learning models as well.
1 comments
It is cool that you describe. I've been looking for the answer for a long time.
ReplyDeleteDentist near me